6/20/2013

666 and How Twitter Samples Tweets in Streaming API

After having played around with Twitter data for a while, I had a question: how Twitter samples the supposedly random tweets to send out through its sample streaming API?

I vaguely remember that it used to say "1% random sample" somewhere on the official documentation but I can no longer find that statement. So I decided to investigate the question by experiments. The result turns out to be far more fascinating than I expected (such as the appearance of 666).

This task would be trivial if I had firehose access but I do not. I initially thought of crawling tweets with ID's near the ones received in the stream sample and then do the counting. But I quickly found out how terribly inefficient that was: the tweet ids seem often to be very sparse. Then, thanks to Twitter's commitment to open source, I found their tweet ID generator on Github, wittily named snowflake (after a snowflake's large number of possible configurations, I suppose). In order to create a distributed solution to global unique ID generation, the essential idea of snowflake is to use timestamp and unique worker ID together to ensure uniqueness in an independent manner.

The first thing I noticed in snowflake is that whereas the 'created_at' property of the returned JSON tweet objects provides timing information at per-second resolution, one can recover per-millisecond timing information from snowflake! With this more precise timing information, some intriguing pattern emerges from the tweets in sample stream: within each second, all received tweets fall within a 10-millisecond-wide window. So we get 10/1000 = 1% of the millisecond timestamps which translates to roughly 1% of all tweets (assuming good randomness in tweet creation time) confirming the claim in my memory. But the surprise does not stop there, that sampling window is the same for every second! It is fixed exactly between the 657th and the 666th millisecond. So there is the 666 in the title. I wonder what is the story behind choosing 666 and this particular scheme of "random" sampling.

To make the post more complete, I should add that: 1. snowflake is used not only for tweet ID's but also direct message ID's. 2. before snowflake was activated sometime on 11/4/2010, Twitter used incremental ID's (the earliest existing tweet being 20).

To start playing with snow, you can use my little python module to create and melt a snowflake ID. (Indeed, you might soon find that not every tweet is delivered even in that 10 milliseconds window.)

If you find this interesting, leave a comment. We can also talk on twitter: @falcondai

16 comments:

  1. Replies
    1. I haven't used Instagram's API so I wouldn't know, maybe you can try it and share your experience. This analysis takes advantage of Twitter's id generator, i.e. snowflake (which is open-sourced), to study their streaming API's sampling scheme.

      Delete
  2. Hi, good information from your website. May i ask, did you know any source code api that able to execute maximum of 1000 tweets? because i have found the source code that has limit of 100 tweets to be executed.

    Tq for your respond, sir.

    ReplyDelete
  3. Here is the documentation.

    https://twittercommunity.com/t/potential-adjustments-to-streaming-api-sample-volumes/31628

    ReplyDelete
  4. I'd be interested if anyone has tried this again recently? Has the game changed? Also, I wonder if they miss tweets during high volumes, thus reducing the sample rate?

    ReplyDelete


  5. I am quite new to Twitter API and Tweepy and I am confused with the rate-limiting concept, I am using the streaming API and I want to gather sample tweets without using any filters such as hashtags or location, some sources state I should not get rate limited with sample tweets as I am getting 1% of tweets and some state otherwise. I keep getting error 420 very often and I was wondering if there is a way to avoid it or make it smoother? Thanks~ Anne from tailored software solutions

    ReplyDelete
  6. Twitter is the current top "hot property" Online, yet its prevalence and how to utilize it has confused numerous entrepreneurs. leptitox before and after pictures

    ReplyDelete
  7. The social systems administration sites like Facebook, MySpace, and so forth were a portion of those destinations offering free types of assistance. Individuals didn't figured they would utilize these social systems administration locales for advancing or marketing their business. smm panel

    ReplyDelete
  8. This post is really valuable that designed for the new visitors. Pleasing work, keep on writing.

    ReplyDelete
  9. They went far over budget and past initial time estimates.
    brand management companies

    ReplyDelete
  10. Twitter is the current top "hot property" on the Web, yet its prevalence and how to utilize it has perplexed numerous entrepreneurs. buy facebook accounts

    ReplyDelete
  11. Current innovation has made our lives so progressed and prosperous that we discover today we are practically unequipped for living without it! https://sites.google.com/view/instagramfollowersbuy/

    ReplyDelete
  12. In opposition to mainstream thinking, an independent social media supervisor needs to leave his office once in a while! On the off chance that this is an issue for you, you should contemplate beginning another calling. Get More Customers With Social Media

    ReplyDelete
  13. I likewise utilize numerous online authoritative apparatuses, for example, Thunderbird for getting to all my email accounts in a single spot, Dropbox to handily impart archives to customers and bookmarks to monitor every one of the sites I often visit. Main smm panel

    ReplyDelete
  14. Since you are setting up your show (post) to share to chosen Facebook Groups. The rundown of Facebook Groups I for one use will be accessible toward the finish of this article. visit site

    ReplyDelete
  15. For specialist organizations, a somewhat unique twist on social media advancement is required as, as a rule, a specialist co-op will sit in visual social media, and text-based social destinations. buy instagram auto likes

    ReplyDelete