6/20/2013

666 and How Twitter Samples Tweets in Streaming API

After having played around with Twitter data for a while, I had a question: how Twitter samples the supposedly random tweets to send out through its sample streaming API?

I vaguely remember that it used to say "1% random sample" somewhere on the official documentation but I can no longer find that statement. So I decided to investigate the question by experiments. The result turns out to be far more fascinating than I expected (such as the appearance of 666).

This task would be trivial if I had firehose access but I do not. I initially thought of crawling tweets with ID's near the ones received in the stream sample and then do the counting. But I quickly found out how terribly inefficient that was: the tweet ids seem often to be very sparse. Then, thanks to Twitter's commitment to open source, I found their tweet ID generator on Github, wittily named snowflake (after a snowflake's large number of possible configurations, I suppose). In order to create a distributed solution to global unique ID generation, the essential idea of snowflake is to use timestamp and unique worker ID together to ensure uniqueness in an independent manner.

The first thing I noticed in snowflake is that whereas the 'created_at' property of the returned JSON tweet objects provides timing information at per-second resolution, one can recover per-millisecond timing information from snowflake! With this more precise timing information, some intriguing pattern emerges from the tweets in sample stream: within each second, all received tweets fall within a 10-millisecond-wide window. So we get 10/1000 = 1% of the millisecond timestamps which translates to roughly 1% of all tweets (assuming good randomness in tweet creation time) confirming the claim in my memory. But the surprise does not stop there, that sampling window is the same for every second! It is fixed exactly between the 657th and the 666th millisecond. So there is the 666 in the title. I wonder what is the story behind choosing 666 and this particular scheme of "random" sampling.

To make the post more complete, I should add that: 1. snowflake is used not only for tweet ID's but also direct message ID's. 2. before snowflake was activated sometime on 11/4/2010, Twitter used incremental ID's (the earliest existing tweet being 20).

To start playing with snow, you can use my little python module to create and melt a snowflake ID. (Indeed, you might soon find that not every tweet is delivered even in that 10 milliseconds window.)

If you find this interesting, leave a comment. We can also talk on twitter: @falcondai

11 comments:

  1. Replies
    1. I haven't used Instagram's API so I wouldn't know, maybe you can try it and share your experience. This analysis takes advantage of Twitter's id generator, i.e. snowflake (which is open-sourced), to study their streaming API's sampling scheme.

      Delete
  2. Hi, good information from your website. May i ask, did you know any source code api that able to execute maximum of 1000 tweets? because i have found the source code that has limit of 100 tweets to be executed.

    Tq for your respond, sir.

    ReplyDelete
  3. Here is the documentation.

    https://twittercommunity.com/t/potential-adjustments-to-streaming-api-sample-volumes/31628

    ReplyDelete
  4. I'd be interested if anyone has tried this again recently? Has the game changed? Also, I wonder if they miss tweets during high volumes, thus reducing the sample rate?

    ReplyDelete


  5. I am quite new to Twitter API and Tweepy and I am confused with the rate-limiting concept, I am using the streaming API and I want to gather sample tweets without using any filters such as hashtags or location, some sources state I should not get rate limited with sample tweets as I am getting 1% of tweets and some state otherwise. I keep getting error 420 very often and I was wondering if there is a way to avoid it or make it smoother? Thanks~ Anne from tailored software solutions

    ReplyDelete
  6. This is my first time visit to your blog and I am very interested in the articles that you serve. Provide enough knowledge for me. Thank you for sharing useful and don't forget, keep sharing useful info: smm panel

    ReplyDelete
  7. Twitter is the present top "hot property" on the Web, yet its prominence and how to utilize it has confused numerous entrepreneurs. buy twitter followers

    ReplyDelete
  8. Great ¡V I should certainly pronounce, impressed with your site. I had no trouble navigating through all tabs as well as related info ended up being truly easy to do to access. I recently found what I hoped for before you know it at all. Quite unusual. Is likely to appreciate it for those who add forums or anything, website theme . a tones way for your client to communicate. Excellent task.. Insta photos

    ReplyDelete
  9. Tune in, Gauge and Draw in: Social media checking instruments like Brand Screen offer you a stage to tune in, quantify and connect with clients over the social web. smm panel India

    ReplyDelete
  10. Present day innovation has made our carries on with so progressed and prosperous that we discover today we are practically unequipped for living without it! buy twitter followers

    ReplyDelete