rounded

Massive Scrape of Twitter’s Friend Graph 1 year, 8 months ago. by mrflip

UPDATE:

We’ve posted several Twitter datasets on Infochimps. Take a look and build something cool!

UPDATE:

We’ve taken the data down for the moment, at Twitter’s request. STAY CALM. They want to support research on the twitter graph, but feel that since this is users’ data there should be terms of use in place. We’ve taken the data down while those terms are formulated. I pass along from @ev: “Thank you for your patience and cooperation.”


The infochimps have gathered a massive scrape of the Twitter friend graph.  Right now it weighs in at

  • about 2.7M users: we have most of the “giant component”
  • 10M tweets
  • 58M edges

(These and other details will be updated as further drafts are released. See below for technical info).  This is still in rough, rough draft but this dataset is so amazingly rich we couldn’t help sharing it.  We have not done all the double-checking we’d like, and the field order will change in the next (12/30) rev.  We’ll also have a much larger dump of tweets off the public datamining feed.

The data is offline at the moment pending some TOS from twitter.com. If you’re interested in hearing when it’s released, follow the low-traffic @infochimps on twitter or look for a post here.

Big huge thanks to twitter.com: they have given us permission to share this freely. Please go build tools with this data that make both twitter.com and yourself rich and famous: then more corporations will free their data.

(more…)

The gems of our collection — The best of what's to come 2 years, 5 months ago. by mrflip

Hooray! The infochimps have been waxy’ed.  Let’s see how the server bonobos stand up.

It’s been suggested that I highlight some of the “gems” of our collection, which we’re going to spend the whole weekend shoveling into the pile. These first few are really deep, and somewhat hard to get / not widely known:

  • Full game state for every play of every baseball game in 2007, majors and minors.  Additionally, for about half of the major league games, *pitch by pitch* trajectory and game state information.  (MLB Gameday)
  • Word frequencies in written text for ~800,000 word tokens (British National Corpus)
  • All the wikipedia infoboxes, turned on their side and put into a table for each infobox type.
  • 250,000+ Materials Safety Data sheets – the chemical and safety information required by OHSA
  • 100 years of Hourly weather data; from 1973 on there’s about 10,000 stations all taking hourly readings … put another way, it’s 475,000+ station-years of hourly readings and weighs in at ~15 GB compressed.

(Incidentally, many of those datasets sell for inexcusable and malicious prices.  For those with a commercial bent, something tells me there’s room in the market if you’re willing to accept a markup of less than 10,000 times).

These are a bit silly but interesting for their ridiculous depth:
* A variety of mathematical constants (pi, e, Catalan’s number, the Golden Ratio, others) calculated to in some cases a preposterous 100 billion decimal places (I’ll probably chop them off at a still-ludicrous 500 million).
* 5000 years of solar eclipse times, 6000 years of precise lunar phase, 6000 years of venus transits.
* Odds of Dying for every Cause of Death listed in the US in a given year.

There are also, of course, the well-known collections: IMDB.com, musicbrainz, dbpedia, CIA factbook, geonames, citeseer, census, statistical abstract and the like.  So let’s see how much of the low-hanging fruit we can toss up there this weekend (the hard parts are adding metadata, and getting the non-copyrightable data out of the copyrighted screenscrapes, so what you’ll see are minimal metadata and the non-screenscraped datasets — still beats paying $1200+/GB though.)

[edit: dates for holidays by country, year-by-year odds of dying for all causes of death from the recent 8 year, NIST values for physical and chemical constants, mechanical properties of common engineering materials, and the spoken and written word frequencies for ~800,000 word tokens datasets should be up later today -- if the site is down briefly we're pushing that update to the server.  (If the site is down not-briefly we've been del.waxyslashdiggdotted)  Thanks to my friend Ned for helping do some drudge work to get those out.]

All of Wikipedia's infoboxes & templates, in individual tables for each kind 2 years, 5 months ago. by mrflip

FINALLY — got the wikipedia infoboxen posted to the site, along with some tiny fixes.

This is 3000+ tables on everything from ABA Teams through Simpsons Episodes to Zodiac Signs.  There’s a fair amount of cruft in these, but until I have live metadata editing going I’m not going to worry about it: it takes about 8 hours start to finish to process this dataset, they’re not perfect but they are perfectly usable.

I have the weather dataset and baseball datasets almost ready to go (along with a whole buncha others), but I’m going to take some time to get the site running better first.  Here’s a rough TODO list:

  1. live, versioned metadata editing
  2. uploading
  3. Allow grouping of datasets by collection and add category tags
  4. Make it so fields & contributors tie together.  (For complicated reasons, each dataset creates a new personal version of the field so you can’t actually walk from one “stock price” field to other datasets with that tag.

Then I’ll turn some intensive attention finally to the InfiniteMonkeywrench code.  We need better tools to wrangle these huge datasets into shape.

Stock Market dataset is up 2 years, 5 months ago. by mrflip

40 Years of data on every NYSE, AMEX and NASDAQ listed stock:

These links were busted before but should be worky now.

Statistical Abstract of the United States 2 years, 6 months ago. by mrflip

Added the Statistical Abstract of the United States — the messily, messily formatted analyzed tables released by the US Census Department.  1350+ tables, yum.