After having played around with Twitter data for a while, I had a question: how Twitter samples the supposedly random tweets to send out through its sample streaming API?
I vaguely remember that it used to say "1% random sample" somewhere on the official documentation but I can no longer find that statement. So I decided to investigate the question by experiments. The result turns out to be far more fascinating than I expected (such as the appearance of 666).
This task would be trivial if I had firehose access but I do not. I initially thought of crawling tweets with ID's near the ones received in the stream sample and then do the counting. But I quickly found out how terribly inefficient that was: the tweet ids seem often to be very sparse. Then, thanks to Twitter's commitment to open source, I found their tweet ID generator on Github, wittily named snowflake (after a snowflake's large number of possible configurations, I suppose). In order to create a distributed solution to global unique ID generation, the essential idea of snowflake is to use timestamp and unique worker ID together to ensure uniqueness in an independent manner.
The first thing I noticed in snowflake is that whereas the 'created_at' property of the returned JSON tweet objects provides timing information at per-second resolution, one can recover per-millisecond timing information from snowflake! With this more precise timing information, some intriguing pattern emerges from the tweets in sample stream: within each second, all received tweets fall within a 10-millisecond-wide window. So we get 10/1000 = 1% of the millisecond timestamps which translates to roughly 1% of all tweets (assuming good randomness in tweet creation time) confirming the claim in my memory. But the surprise does not stop there, that sampling window is the same for every second! It is fixed exactly between the 657th and the 666th millisecond. So there is the 666 in the title. I wonder what is the story behind choosing 666 and this particular scheme of "random" sampling.
To make the post more complete, I should add that: 1. snowflake is used not only for tweet ID's but also direct message ID's. 2. before snowflake was activated sometime on 11/4/2010, Twitter used incremental ID's (the earliest existing tweet being 20).
To start playing with snow, you can use my little python module to create and melt a snowflake ID. (Indeed, you might soon find that not every tweet is delivered even in that 10 milliseconds window.)
If you find this interesting, leave a comment. We can also talk on twitter: @falcondai