blog.infochimps.org – Organizing Huge Information Sources

Something about Everything about Something

Open a banana like a Monkey does

leave a comment »

Open a Banana like a Monkey – most human primates do it wrong!

To go with open banana here is open banana data:

Written by mrflip

12 Jul 2009 at 6:16 pm

Posted in infochimps.org

It’s Hot, Damn Hot. So Hot I saw a Chimp in Orange Robes Burst into Flames.

leave a comment »

It’s been ridiculously hot ridiculously early this year in Austin. A friend passed along this link to a visualization of 100+ degree days over the last 10 years. The author couldn’t find data extending back farther than 2000, but luckily I knew where to look.

I pulled the NCDC weather for Austin from 1948-present (see infochimps.org link for details) and got my Tufte on.

This temperature cycle is hotter than but comparable to the 1950-1965 era. I’ve got no idea if it’s global warming or the peak of a cycle. The fundamental conclusion — that this year so far, 2000 and 2008 were damn hot — stands up well.

Read the rest of this entry »

Written by mrflip

11 Jul 2009 at 1:21 pm

Posted in infochimps.org

Congrats Retrosheet – another decade of rich Baseball data online

leave a comment »

Congrats to Retrosheet, who now have full major-league baseball box scores from 1920-1930 online! (This is in addition to full box score coverage for 1953-2008, and broad coverage of box scores and play-by-play data from 1871-2008). As Nate Silver has said, “Baseball is the perfect dataset”, and we would not have this astonishingly rich and detailed dataset if not for the dedicated crowdsource efforts of the Retrosheet team.

Infochimps metadata entries for these datasets:

Written by mrflip

6 Jul 2009 at 1:47 pm

leave a comment »

On Jessica Hagy’s “Indexed”, the “Start-up Checklist“:

Written by mrflip

30 Jun 2009 at 10:26 am

Posted in infochimps.org

Freebase Hack Day & Updates

leave a comment »

Our friends at Freebase are having another Hack Day in San Francisco this July.  It’s only two weeks away now and the remaining tickets can go fast, get involved http://blog.freebase.com/2009/06/26/two-weeks-til-freebase-hack-day-sign-up-now/.

Learn about the many cool things that Freebase is doing with their data, and the tools that can be built using their platform.

On a side note, http://infochimps.org/ has gotten a facelift.  We’d love feedback on it: info@infochimps.org.   We hope your browsing experience is better, and we will be happy to roll out new features soon!

Written by kellyjoseph

29 Jun 2009 at 1:32 am

Posted in news

Tagged with , ,

What’s New

leave a comment »

Infochimps has been acknowledged as a finalist by the Capital Factory for 2009.

Infochimps is also a finalist in PepsiCo’s pitch competition.

Infochimps has a Facebook page! Become a fan.

Katherine at The New Civilization is aiding us in UX design for our Beta, to be launched at the end of May.  Eve Simon in Washington DC is helping us with the site design.  Our two big goals for the Beta are:

1) Improved browseability of the datasets, including a search bar and better surfing through tags, categories, and collections.

2) Uploading capability.  Users will be able to create accounts and upload datasets, as well as edit the descriptions of other data on the site.

Drop us a line anytime at info@infochimps.org

Written by kellyjoseph

27 Apr 2009 at 1:43 pm

Posted in infochimps.org, main

@mrflip’s OpenGov Talk: Data Commons and Transparent Government

leave a comment »

Here is my (mrflip’s) SxSW OpenGov talk, “How Open Data will help build Open Government“:

There is nothing more painful than watching yourself talk. So I haven’t gone all the way through this video — if you see me don’t give away the ending. Huge thanks to Silona Bonewald (League of Technical Voters) for organizing this, and to Terry Walhus (spring.net) for taping and copying and editing and uploading the videos.

Written by mrflip

20 Apr 2009 at 4:25 pm

Posted in infochimps.org

I love it when a plan comes together…

with 3 comments

So Simon Willison (@simonw), one of the architects of The Guardian’s Open Platform and co-creator of a modestly popular web frameworks is here at SxSW and gave an informal talk (on Zeppelins, of course – what else?). Freebase community manager Kirrily Robert (@skud) saw my tweet and proposed a meetup. After iteratively solving the three body problem, we put out the word on Sunday morning for a meetup on Sunday evening… SemWebAustin @juansequeda and Freebase @jameshome each pinged their 1-neighborhood and next thing you know I’m sitting next to Jure Cuhalev of Zemanta and machine learning machine @Nikete trying to orchestrate overflow seating for 25+ data geeks.

The reason for the gossip-column style of this post is to show the size and breadth of the data geek crowd. James Home and I agree that we need to turn out this Cyrus’ army of data geeks to take over a much larger part of SxSW next year. We need talks on column-store databases and hadoop, linked data and the construction of the data commons, how NLP and machine learning can power inspiring audience-driven websites, on the developing grammar of Information Visualization, on Processing and Prefuse and R. Pete Skomoroch, Mike Driscoll and Christian Chabot all ended up skipping SxSW this year; we need them leading a panel discussion on how to visualize >10M point datasets with limited-bandwidth desktop and web interfaces. I’d like to hear Deepak Singh and one of the @cloudera’ns drop science about scalable cloud computing.

The evening was just informal mingling and conversation, but on request of request of @mndoci and @dataspora, here is our name-droppy slice of the whirlwind:

@mrflip: Learned about how Zemanta is already putting Linked Data and NLP together to make blogging better. Jure is excited about infinite monkeywrench and might be brave enough to pre-alpha its inchoate HTML munger. Got to hear what Blaine Cook of Osmosoft is doing to solve the fractured twitter/facebook/identi.ca/500M-person-strong-local-social-networks-you’ve-never-heard-of ecosystem, and he gave some great feedback on our upcoming Twitter Census. Also got to learn, after pontificating that OAuth is hard, that I was talking to its architect; a great discussion with Blaine and ENTP Uruguay Evan Henshaw-Plath followed about the Rails authorization/identity/authentication stack.

Mike Migurski of @stamen is going to get together with infochimp @dhruvbansal to push the Open Street Maps dataset into Amazon Public Data Sets collection. Harper Reed of Threadless was running off for a 6am (ugh) flight to babysit servers in Chicago by the time we chatted, but pointed towards his Chicago Transit API project. His post on Hidden APIs is a great read BTW. Ran into @Slicehost Matt Tanase at a party after; Rackspace is getting much Cloud-ier, including a 1.5cents/hour pay-as-you-go 256MB slice offering. I’m hoping to talk later about our MachetEC2 project and get his thoughts about how to put open data on tap in the cloud. Jon Pierce and I discussed the Mets’ chances this year and what he sees for big data startup possibilities. Only got to briefly intersect with Andrew Turner about open geocommons, and was chagrined to learn I was shoulder to shoulder with one of gnip but didn’t get to chat. Hope to fix that later.

This meeting alone made SxSW worth it, and I’m looking forward to more discussion later. You can stalk me on twitter as @mrflip or at http://sxsw2009.sched.org/flip. By the way, I’m giving a lightning talk on Open Data in government at Fiddler’s Hearth, 301 Barton Springs Rd at 12:30 — drop by or catch the webcast later.

Written by mrflip

16 Mar 2009 at 9:13 am

Posted in infochimps.org

Amazon Web Services hosts DBpedia, Freebase data sets

leave a comment »

The Infochimps.org community played part in pushing DBpedia and Freebase data sets  to Amazon Web Services.  This is an auxiliary effort by Infochimps.org to increase access to data.  It is important to have the data in places where there are the right tools for people to use it.  AWS is the place, look at creating an Amazon Machine Image to start working with the new data sets.  Our MachetEC2 can help, please let us know how your experience was in using it.

Thanks to Kingsley Idehen with Linked Open Data for being a good point of contact. 

We will upload more data sets to AWS in the near future.  Any requests?

Written by kellyjoseph

26 Feb 2009 at 4:05 pm

Start hacking: machetEC2 released!

with 8 comments

machetEC2, the Infochimps Amazon Machine Image (AMI) designed for data processing, analysis, and visualization, has been released!

Amazon’s Cloud Computing services give you transformatively cheap and scalable computing power, and their Public Data Sets (AWS/PDS) collection (which infochimps is contributing to) is helping to put the world of free, open data at your fingertips.  MachetEC2 lets you summon a “batteries included” computer — or a hundred computers — from the cloud.  As soon as it loads, you’re ready to start crunching and transforming and visualizing data, whether from AWS/PDS, or infochimps.org, or your own pool.

When you SSH into an instance of machetEC2 (brief instructions after the jump), check the README files: they describe what’s installed, how to deal with volumes and Amazon Public Datasets, and how to use X11-based applications.  You can also visit the the machetEC2 GitHub page to see the full list of packages installed, the list of gems, and the list of programs installed from source.

This machete is only as sharp as it is complete. If there’s software that you find indispensable, we encourage you to suggest it here, or even better to help add it to the toolkit (instructions are within).

Read the rest of this entry »

Written by dhruvbansal

6 Feb 2009 at 4:03 pm

Posted in machetEC2

Tagged with , , ,

Hacking through the Amazon with a shiny new MachetEC2

with 11 comments

Hold on to your pith helmets: the Infochimps are releasing an Amazon Machine Image designed for data processing, analysis, and visualization.

Amazon’s Elastic Compute Cloud (EC2) allows users to instantiate a virtual computer with a pre-installed operating system, software packages, and up to 1 TB of data loaded on disk, ready to work with, from a shared image (an “Amazon Machine Image”, or AMI).

MachetEC2 is an effort by a group of Infochimps to create an AMI for data processing, analysis, and visualization. If you create an instance of MachetEC2, you’ll be have an environment with tools designed for working with data ready to go. You can load in your own data, grab one of our datasets, or try grabbing the data from one of Amazon’s Public Data Sets. No matter what, you’ll be hacking in minutes.

We’re taking suggestions for what software the community would be most interested in having installed on the image (peek inside to see what we’ve thought of so far…)

Read the rest of this entry »

Written by dhruvbansal

28 Jan 2009 at 11:07 am

Posted in infochimps.org

Tagged with , ,

Twittersong

leave a comment »

Took the 50M twitter messages we saw between mid-November and mid-January and used Wordle to make a word cloud:  http://bit.ly/tweetcloud Fun!

(If you’re not familiar with a word cloud: the larger a word, the more often it was used. The colors & positions don’t mean anything, they’re just for fun. We stripped out the little words (a, the, with, …), leaving everything that appeared more than 10,000 times in the 50 million+ tweets we examined.)

Then I looked again at the filtered list and noticed something… just awesome.

Here are the forty most-commonly used words, in their exact order of decreasing frequency:

It’s time, Twitter. Love/Christmas blog:

Home! Thanks, people…

Night post:

Getting happy
watching morning
that’s tonight.
Tomorrow: looking news, trying nice? Check.

2009: Hope.
Week: 2008.

Little video:

snow.

Live free. Life. Awesome days!

Doing:

Feel house ready.
Look cool.
Sleep.
Yeah world!

I like your poem, Twitter.
A lot.

Read the rest of this entry »

Written by mrflip

22 Jan 2009 at 10:23 pm

The Asdrubal Cabrera Hall of Fame

leave a comment »

Prompted by my friend’s skepticism that the ballplayer Milton Bradley is really so named, I’m exhuming this old post from elsewhere. — flip

During the 2007 baseball playoffs, announcer Tim McCarver perspicaciously observed that “Asdrubal Cabrera is the only player in the majors with that first name”. Thus inspired, I present The Asdrubal Cabrera Hall of Fame: Major League ballplayers in unique possession of their particular first name. (Some are nicknames, many are not — but these are their official names, as used in newspapers and the rolls of history. F’reals.)

You may be familiar with Honus Wagner, Eppa Rixey, Boog Powell or Yogi Berra. But have you heard recounted the storied diamond exploits of Firpo Mayberry, Zoilo Versalles, Pi Schwert or Bevo LeBourveau? OK, then how about Mysterious Walker, The Only Nolan, or Phenomenal Smith? Mul Holland, Sixto Lezcano, Welcome Gaston or Mox McQuery? There’s a bunch more after the jump, and a complete listing here, including links to each player’s baseball reference page.

For some dinnertime fun over the holidays, discuss the relative merits of naming your next child after Urban Shocker, Twink Twining, Pussy Tebeau, Bris Lord, Boob Fowler, Crazy Schmit, Creepy Crespi, Cuddles Marshall, Vinegar Bend Mizell, or Buttercup Dickerson. (Unfortunately, 12 other “Rusty”s keep fan favorite Rusty Kuntz off this list, and believe it or not two other “Stubby”s bar the way for Stubby Clapp. I apologize to anyone whose internet filter has or has not prevented reading this apology.)

Thanks to the Baseball Databank and Retrosheet, I had this dataset on hand, and thanks to a monastic life of nerdity I had the SQL chops to pull up this query between innings.  But I should be able to do this with anything, whether or not I know a SQL Query from a Queer-Eye Sequel, for silly stunts and for changing lives alike.

Imagine instead I were a public health expert, interested in the effects of limiting medical residents to an 80-hour work week. Might lives be saved if I could effortlessly pull up historical data on rates of doctor-induced complications, board of medicine complaints, relative rates of med school and law school applications, and open-government data on medical regulations?

The long-term mission of infochimps.org is to democratize this: to put the world’s analytic data at our fingertips, supporting tools that let anyone manipulate, interrogate, visualize and explore that data.  Giving baseball geeks a chance to show up Tim McCarver isn’t much of a start, but here we are.

More awesome first names after the jump….

Read the rest of this entry »

Written by mrflip

11 Jan 2009 at 5:59 pm

Massive Scrape of Twitter’s Friend Graph

with 25 comments

UPDATE:

We’ve taken the data down for the moment, at Twitter’s request. STAY CALM. They want to support research on the twitter graph, but feel that since this is users’ data there should be terms of use in place. We’ve taken the data down while those terms are formulated. I pass along from @ev: “Thank you for your patience and cooperation.”


The infochimps have gathered a massive scrape of the Twitter friend graph.  Right now it weighs in at

  • about 2.7M users: we have most of the “giant component”
  • 10M tweets
  • 58M edges

(These and other details will be updated as further drafts are released. See below for technical info).  This is still in rough, rough draft but this dataset is so amazingly rich we couldn’t help sharing it.  We have not done all the double-checking we’d like, and the field order will change in the next (12/30) rev.  We’ll also have a much larger dump of tweets off the public datamining feed.

The data is offline at the moment pending some TOS from twitter.com. If you’re interested in hearing when it’s released, follow the low-traffic @infochimps on twitter or look for a post here.

Big huge thanks to twitter.com: they have given us permission to share this freely. Please go build tools with this data that make both twitter.com and yourself rich and famous: then more corporations will free their data.

Read the rest of this entry »

Written by mrflip

29 Dec 2008 at 8:55 pm

Geography of Newspaper Endorsements in the 2008 US Presidential Election

with 36 comments

First in a series of visualizations / experiments with interconnecting datasets:

Geography of Newspaper Endorsements in the 2008 US Presidential Election

Apart from the unsurprising evidence that (choose one: [[Obama is the overwhelming choice]] -OR- [[there is overwhelming liberal media bias]]), I’m struck by the mismatch between papers’ endorsements and their “Red State” vs “Blue State” alignment.

  • I think the amount of red in the blue states is a market effect. If you’re the Boston Herald, there’s no percentage in agreeing with the Boston Globe; similarly Daily News vs New York Post, SF Examiner vs SF Chronicle. That’s why the Tribune endorsement, even accounting for hometown bias, is so striking. I don’t mean that they’re cynically pandering; rather that in a market with multiple papers readers, and journalists are efficiently sorted into two separate camps. (And the axis doesn’t have to be political: though the Chronic and the Statesman are politically distinct I see their main difference being lifestyle vs. traditional news).
  • The amount of blue in the red states highlights how foolishly incomplete the “Red State/Blue State” model is for anything but electoral college returns. The largest part of the Red/Blue split is Rural/Urban — look at the electoral cartogram for the last election and almost every city is blue, even in the south and mountain; and almost all of our rural areal is red. The exceptions, chiefly Dallas, Houston and Boise, stand noticeably alone as having red unpaired with blue. (Though in this election even the Houston Chronicle is endorsing Obama.)I’m going to try to make a map colored by county, but there are no good off-the-shelf tools for doing this (that I’ve found).

This seems to speak of why so many on the right feel there’s a MSM bias — 50% of the country is urban, 50% rural, but newspapers are located exclusively in urban areas [see below]. So, surprisingly, the major right-leaning papers are all located in parts of the country we consider highly leftish. The urban areas that are the largest are thus both the most liberal and the most likely to have a sizeable conservative target audience.

Read the rest of this entry »

Written by mrflip

22 Oct 2008 at 10:14 am