6/20/2013

666 and How Twitter Samples Tweets in Streaming API

After having played around with Twitter data for a while, I had a question: how Twitter samples the supposedly random tweets to send out through its sample streaming API?

I vaguely remember that it used to say "1% random sample" somewhere on the official documentation but I can no longer find that statement. So I decided to investigate the question by experiments. The result turns out to be far more fascinating than I expected (such as the appearance of 666).

This task would be trivial if I had firehose access but I do not. I initially thought of crawling tweets with ID's near the ones received in the stream sample and then do the counting. But I quickly found out how terribly inefficient that was: the tweet ids seem often to be very sparse. Then, thanks to Twitter's commitment to open source, I found their tweet ID generator on Github, wittily named snowflake (after a snowflake's large number of possible configurations, I suppose). In order to create a distributed solution to global unique ID generation, the essential idea of snowflake is to use timestamp and unique worker ID together to ensure uniqueness in an independent manner.

The first thing I noticed in snowflake is that whereas the 'created_at' property of the returned JSON tweet objects provides timing information at per-second resolution, one can recover per-millisecond timing information from snowflake! With this more precise timing information, some intriguing pattern emerges from the tweets in sample stream: within each second, all received tweets fall within a 10-millisecond-wide window. So we get 10/1000 = 1% of the millisecond timestamps which translates to roughly 1% of all tweets (assuming good randomness in tweet creation time) confirming the claim in my memory. But the surprise does not stop there, that sampling window is the same for every second! It is fixed exactly between the 657th and the 666th millisecond. So there is the 666 in the title. I wonder what is the story behind choosing 666 and this particular scheme of "random" sampling.

To make the post more complete, I should add that: 1. snowflake is used not only for tweet ID's but also direct message ID's. 2. before snowflake was activated sometime on 11/4/2010, Twitter used incremental ID's (the earliest existing tweet being 20).

To start playing with snow, you can use my little python module to create and melt a snowflake ID. (Indeed, you might soon find that not every tweet is delivered even in that 10 milliseconds window.)

If you find this interesting, leave a comment. We can also talk on twitter: @falcondai

8 comments:

  1. Replies
    1. I haven't used Instagram's API so I wouldn't know, maybe you can try it and share your experience. This analysis takes advantage of Twitter's id generator, i.e. snowflake (which is open-sourced), to study their streaming API's sampling scheme.

      Delete
  2. Hi, good information from your website. May i ask, did you know any source code api that able to execute maximum of 1000 tweets? because i have found the source code that has limit of 100 tweets to be executed.

    Tq for your respond, sir.

    ReplyDelete
  3. Here is the documentation.

    https://twittercommunity.com/t/potential-adjustments-to-streaming-api-sample-volumes/31628

    ReplyDelete
  4. I'd be interested if anyone has tried this again recently? Has the game changed? Also, I wonder if they miss tweets during high volumes, thus reducing the sample rate?

    ReplyDelete
  5. Very effective and useful article.
    I was finding a professional photoshop service provider but after seeing your post I have some good knowledge on this too.

    ReplyDelete


  6. I am quite new to Twitter API and Tweepy and I am confused with the rate-limiting concept, I am using the streaming API and I want to gather sample tweets without using any filters such as hashtags or location, some sources state I should not get rate limited with sample tweets as I am getting 1% of tweets and some state otherwise. I keep getting error 420 very often and I was wondering if there is a way to avoid it or make it smoother? Thanks~ Anne from tailored software solutions

    ReplyDelete
  7. Taldeen is one of the best plastic manufacturing company in Saudi Arabia. They are manufacturing Handling Solutions Plastic products like Plastic Pallets and plastic crates. Here is the link of the product
    Handling Solutions
    Plastic Pallets
    Here is the details of best BSc Medical Imaging Technology Colleges in Bangalore. You can get the college details from the below link. BSc Medical Imaging Technology Course is one of the best demanding course in recent times in India
    BSc Medical Imaging Technology Colleges In Bangalore
    Christian College Bangalore providing BSc Medical Imaging Technology Course. Here is the link about the details of BSc Medical Imaging Technology. You can click the below link for more information about BSc Medical Imaging Technology.
    BSc Cardiac Care Technology Colleges In Bangalore
    Christian College Bangalore providing BSc Optometry Course. Here is the link about the details of BSc Optometry. You can click the below link for more information about BSc Optometry. BSc Optometry is one of the most demanding course in recent times.
    Optometry Colleges In Bangalore
    BBA Aviation course is the best (Most Demanded) management course in India. Here, Christian College Bangalore providing BBA Aviation course. You can get the details of Christian College BBA Aviation from the below mentioned link. If you are interested in BBA Aviation, just visit the below link to know about BBA Aviation.
    BBA Aviation Colleges In Bangalore
    GrueBleen is one of the Branding and Marketing agency Based in Riyadh- Saudi Arabia. The main functions of GrueBleen is Advertising, Branding, Marketing, Office Branding, Exhibition Management and Digital Marketing. Visit the below link to know more about GrueBleen Creative Club.
    Branding Agency Riyadh
    Marketing Agency Riyadh
    Agriculture Solutions – Taldeen is a plastic manufacturing company in Saudi Arabia. They are manufacturing agricultural plastic products like greenhouse cover and hay cover. Visit the below link to know more details
    Agriculture Solutions
    Greenhouse Cover
    Medical Imaging Technology – One of the most demanding allied health science course in recent times in India. Check out the details of Best BSc Medical Imaging Technology Colleges Details with the following link.
    BSc Medical Imaging Technology Colleges In Bangalore
    BSc Perfusion Technology – If you are looking to study BSc Perfusion Technology in Bangalore, just check out the following link. In that link you can get the details of Best BSc Medical Imaging Technology colleges in Bangalore
    BSc Perfusion Technology Colleges in Bangalore
    GrueBleen – One of the best social media marketing agency in Riyadh- Saudi Arabia. Visit here for the all service details of GrueBleen.
    Social Media Marketing Agency

    ReplyDelete