After a marathon debugging session we found and fixed two performance-related bugs in the service last night. One had led to more disk I/O being done when a device performed a re-sync with contacts enabled. The data we store was being re-written when it hadn't really changed. We had many devices re-syncing yesterday when they converted to multiple calendars. This led to disk saturation and hence service slowness. Slow service leads to increased load because some devices time out and trigger yet more re-syncing. The second problem was a memory leak that became worse with the increased load, eventually leading to the service bogging down completely which triggers an automated re-start. Both these bugs were fixed late last night.
The worst aspect of this incident was that Apple devices interacted in an unfortunate way with the degraded service: if the device wasn't able to sync a change made on its end it could trigger an out of sync condition in the service. This is handled on the iPhone by deleting all the existing data, then re-fetching new data from the service. Normally this would happen in a few seconds but yesterday the service wasn't always responding to the re-fetch and as a result the device ended up empty until it was able to successfully re-sync.
For us this is a worse situation than if the service were completely down because in that case you'd still have all your data on the phone. Therefore we modified the service code last night to allow us to globally block these out of sync triggers on a temporary basis. Hence we can guarantee that nobody will have their contacts vanish all of a sudden. The downside to this is that nothing will sync for the affected devices either. This safety mode was enabled last night. It will be turned off once we're sure service is stable this morning (things look good at present).
All the devices that got into the half-synced state where contacts and calendar events were 'vanished' should have picked up sync again and fetched new data.
We do have plenty of servers and network bandwidth available.
Update : people have asked 'what can I do' to get syncing again. The answer is : nothing. Sync should pick up again naturally. The safety mode has been turned off now, so unless there are further service load problems everyone should see normal service return soon.