As useful as the Twitter API is, developers, designers, and researchers have long clamored for more than the trickle of data that service currently allows. We agree — some of the sexiest uses of data require processing not just all that is now, but the vast historical record. Twitter doesn’t provide the only use case for this, but until now its historical bulk data has been hard to find.
Today we are publishing a few items collected from our large scrape of Twitter’s API. The data was collected, cleaned, and packaged over twelve months and contains almost the entire history of Twitter: 35 million users, one billion relationships, and half a billion Tweets, reaching back to March 2006. The initial datasets are a part of our Twitter Census collection.
The first dataset, a Token Count, counts the number of tokens (hashtags, smiley’s and URL’s) that have been tweeted. The data is available for free by month and for pay by hour. Think about comparing this data to the stock market, new movies, new video games, or even trendingtopics.org. For example, use it to look at the adoption of Google Wave on the rate of its mentions. On one payload’s page you will find a snippet with a sample taken during Kanye West’s outburst in September, and on another’s you can see that the “:)” emoticon has been used 135,000 times.
The second dataset solves a large problem developers have when they use Twitter’s Search API and the Twitter API, as each API gives back a different unique string for every user on Twitter. This dataset maps user IDs between the two API’s for 24.5 million users. This mapping should be a godsend to Twitter app developers, as it allows them to easily combine data from each API, letting API calls for friends lists mix easily with searches on the Twitter Search API.
These datasets are only views from the massive collection we have been growing over the last year. We will be releasing additional datasets regularly over the next few weeks so please check back for updates. If you’d like a custom slice or analysis done on this data, please get in touch at imw@infochimps.org.
With the release of this data, we hope to send a signal that this data is valuable and useful to real-time search engines, Twitter apps, and social media researchers. This should start a conversation about where value really lies in this type of data, the various ownership and privacy issues that arise, and that Infochimps.org is the place to go to find data. We invite interested parties to get in touch and begin uploading their data(try invite code “newsupplier”) today as part of the Infochimps marketplace.

[...] This post was mentioned on Twitter by infochimps News, andyhickl. andyhickl said: RT @infochimps: Twitter census data published! stay tuned! http://bit.ly/w2dGf [...]
Curious– how did you get access to this data? Or have you been (only) sampling data via the API for a long while now?
The data has been collected from the API for a year. The entire collection is over half a terabyte.
[...] Twitter Census: Publishing the First of Many Datasets | blog …2 hours ago by Joseph Kelly As useful as the Twitter API is, developers, designers, and researchers have long clamored for more than the trickle of data that service currently allows. We. [...]
Wow, this is HUGE. Charging for access to the historical record was a potential revenue stream I identified back in February — http://citrusfortress.com/wp/2009/02/how-twitter-could-start-making-money-now-without-fucking-up-a-very-very-good-thing/ … Nice work, guys, and best of luck!
(Moderator, please delete my previous post; this one has the correct link. Thanks!)
[...] Original post: Twitter Census: Publishing the First of Many Datasets | blog … [...]
[...] geeks have been watching for a year. Austin-based Infochimps announced this afternoon that it is now selling two important and very large sets of Twitter data. Limited samples of the data are available for free and a third, most important, set of data still [...]
[...] geeks have been watching for a year. Austin-based Infochimps announced this afternoon that it is now selling two important and very large sets of Twitter data. Limited samples of the data are available for free and a third, most important, set of data still [...]
[...] geeks have been watching for a year. Austin-based Infochimps announced this afternoon that it is now selling two important and very large sets of Twitter data. Limited samples of the data are available for free and a third, most important, set of data still [...]
[...] geeks have been watching for a year. Austin-based Infochimps announced this afternoon that it is now selling two important and very large sets of Twitter data. Limited samples of the data are available for free and a third, most important, set of data still [...]
[...] geeks have been watching for a year. Austin-based Infochimps announced this afternoon that it is now selling two important and very large sets of Twitter data. Limited samples of the data are available for free and a third, most important, set of data still [...]
[...] geeks have been watching for a year. Austin-based Infochimps announced this afternoon that it is now selling two important and very large sets of Twitter data. Limited samples of the data are available for free and a third, most important, set of data still [...]
[...] Read More Here… Share and Enjoy: [...]
[...] geeks have been watching for a year. Austin-based Infochimps announced this afternoon that it is now selling two important and very large sets of Twitter data. Limited samples of the data are available for free and a third, most important, set of data still [...]
[...] geeks have been watching for a year. Austin-based Infochimps announced this afternoon that it is now selling two important and very large sets of Twitter data. Limited samples of the data are available for free and a third, most important, set of data still [...]
[...] Twitter Census: Publishing the First of Many Datasets | blog …14 hours ago by Joseph Kelly As useful as the Twitter API is, developers, designers, and researchers have long clamored for more than the trickle of data that service currently allows. We. – [...]
[...] geeks have been watching for a year. Austin-based Infochimps announced this afternoon that it is now selling two important and very large sets of Twitter data. Limited samples of the data are available for free and a third, most important, set of data still [...]
[...] geeks have been watching for a year. Austin-based Infochimps announced this afternoon that it is now selling two important and very large sets of Twitter data. Limited samples of the data are available for free and a third, most important, set of data still [...]
[...] read more…. [...]
Aren’t you worried about getting sued?
@James — Twitter is legendarily open, and their Terms of Service are very clear:
“Except as permitted through the Services (or these Terms), you have to use the Twitter API if you want to reproduce, modify, create derivative works, distribute, sell, transfer, publicly display, publicly perform, transmit, or otherwise use the Content or Services.”
“We encourage and permit broad re-use of Content. The Twitter API exists to enable this.”
(highlighting mine)
[...] Twitter Census: Publishing the First of Many Datasets | blog …11 Nov 2009 by Joseph Kelly As useful as the Twitter API is, developers, designers, and researchers have long clamored for more than the trickle of data that service currently allows. We. – [...]
[...] Twitter Census:Â Publishing the First of Many Datasets [...]
[...] geeks have been watching for a year. Austin-based Infochimps announced this afternoon that it is now selling two important and very large sets of Twitter data. Limited samples of the data are available for free and a third, most important, set of data still [...]
[...] Twitter Census: Publishing the First of Many Datasets | blog …11 Nov 2009 by Joseph Kelly As useful as the Twitter API is, developers, designers, and researchers have long clamored for more than the trickle of data that service currently allows. We. – [...]
[...] geeks have been watching for a year. Austin-based Infochimps announced this afternoon that it is now selling two important and very large sets of Twitter data. Limited samples of the data are available for free and a third, most important, set of data still [...]
[...] written about this fairly extensively already, so I won’t belabor the point. But we’re going to [...]
@mrflip
Wow. I was ignorant to how open Twitter was, thanks!
[...] Read more here: Twitter Census: Publishing the First of Many Datasets | blog … [...]
[...] Twitter Census: Publishing the First of Many Datasets | blog …11 Nov 2009 by Joseph Kelly As useful as the Twitter API is, developers, designers, and researchers have long clamored for more than the trickle of data that service currently allows. We. – [...]
[...] Infochimps says it hopes “to send a signal that this data is valuable and useful to real-time search engines, Twitter apps, and social media researchers.” It also hopes to “start a conversation about where value really lies in this type of data, [and] the various ownership and privacy issues that arise.” Given the complaints from Twitter the first time data was posted, it’s a smart move on the part of Infochimps to add this disclosure and thoroughly anonymize the data. The company very much wants to avoid any sort of ill will or backlash from the Twitterati over the release of the data sets. Back in 2006, AOL Research released 20 million search keywords attached to user IDs for researchers to use. A number of individuals were identified as a result of the “anonymized” data, leading to a number of concerns over what sorts of data are kosher to be released. [...]
Are you guys ever actually going to publish any twitter datasets, or just post grand announcements about publishing datasets?
Thanks for the post. I am assuming that this data is not free. Am I correct?
@Taylor — ouch. Trust us, it is KILLING me every second you don’t have the data. We’re working on getting people broad access to the whole thing rather trying to trickle out a few more datasets here and there. But please note that the datasets grandly announced above are actually published and can be actually download by actual you.
@John — The smiley face dataset is free, but yeah the others in this first batch are not. There will be a lot of data released for free, and there will be a larger pool of data released for academics. If you’re an academic researcher or hobbyist, please post something brief about what you’d do with the data over here — it will help us figure out how to address your needs, and we can put you on our list for early access.
[...] Twitter Census: Publishing the First of Many Datasets (tags: twitter social data download statistics 2009 socialmedia dataset census analytics stats publishing trends open infochimps datasets) [...]