Open a banana like a Monkey does
Open a Banana like a Monkey – most human primates do it wrong!
To go with open banana here is open banana data:
- The USDA Nutrient Database will help you find the nutritional value of a banana (online search | infochimps entry)
- Per Capita Consumption of Major Food Commodities: 1980 to 2005
- Fresh Fruits and Vegetables–Supply and Use: 2000 to 2006
It’s Hot, Damn Hot. So Hot I saw a Chimp in Orange Robes Burst into Flames.
It’s been ridiculously hot ridiculously early this year in Austin. A friend passed along this link to a visualization of 100+ degree days over the last 10 years. The author couldn’t find data extending back farther than 2000, but luckily I knew where to look.
I pulled the NCDC weather for Austin from 1948-present (see infochimps.org link for details) and got my Tufte on.
This temperature cycle is hotter than but comparable to the 1950-1965 era. I’ve got no idea if it’s global warming or the peak of a cycle. The fundamental conclusion — that this year so far, 2000 and 2008 were damn hot — stands up well.
Congrats Retrosheet – another decade of rich Baseball data online
Congrats to Retrosheet, who now have full major-league baseball box scores from 1920-1930 online! (This is in addition to full box score coverage for 1953-2008, and broad coverage of box scores and play-by-play data from 1871-2008). As Nate Silver has said, “Baseball is the perfect dataset”, and we would not have this astonishingly rich and detailed dataset if not for the dedicated crowdsource efforts of the Retrosheet team.
Infochimps metadata entries for these datasets:
- Box Scores
- Game Logs (play-by-play)
- Ballparks, 1903-current
- Transactions, 1873-current
- Awards and Honors
Freebase Hack Day & Updates
Our friends at Freebase are having another Hack Day in San Francisco this July. It’s only two weeks away now and the remaining tickets can go fast, get involved http://blog.freebase.com/2009/06/26/two-weeks-til-freebase-hack-day-sign-up-now/.
Learn about the many cool things that Freebase is doing with their data, and the tools that can be built using their platform.
On a side note, http://infochimps.org/ has gotten a facelift. We’d love feedback on it: info@infochimps.org. We hope your browsing experience is better, and we will be happy to roll out new features soon!
What’s New
Infochimps has been acknowledged as a finalist by the Capital Factory for 2009.
Infochimps is also a finalist in PepsiCo’s pitch competition.
Infochimps has a Facebook page! Become a fan.
Katherine at The New Civilization is aiding us in UX design for our Beta, to be launched at the end of May. Eve Simon in Washington DC is helping us with the site design. Our two big goals for the Beta are:
1) Improved browseability of the datasets, including a search bar and better surfing through tags, categories, and collections.
2) Uploading capability. Users will be able to create accounts and upload datasets, as well as edit the descriptions of other data on the site.
Drop us a line anytime at info@infochimps.org
@mrflip’s OpenGov Talk: Data Commons and Transparent Government
Here is my (mrflip’s) SxSW OpenGov talk, “How Open Data will help build Open Government“:
There is nothing more painful than watching yourself talk. So I haven’t gone all the way through this video — if you see me don’t give away the ending. Huge thanks to Silona Bonewald (League of Technical Voters) for organizing this, and to Terry Walhus (spring.net) for taping and copying and editing and uploading the videos.
I love it when a plan comes together…
So Simon Willison (@simonw), one of the architects of The Guardian’s Open Platform and co-creator of a modestly popular web frameworks is here at SxSW and gave an informal talk (on Zeppelins, of course – what else?). Freebase community manager Kirrily Robert (@skud) saw my tweet and proposed a meetup. After iteratively solving the three body problem, we put out the word on Sunday morning for a meetup on Sunday evening… SemWebAustin @juansequeda and Freebase @jameshome each pinged their 1-neighborhood and next thing you know I’m sitting next to Jure Cuhalev of Zemanta and machine learning machine @Nikete trying to orchestrate overflow seating for 25+ data geeks.
The reason for the gossip-column style of this post is to show the size and breadth of the data geek crowd. James Home and I agree that we need to turn out this Cyrus’ army of data geeks to take over a much larger part of SxSW next year. We need talks on column-store databases and hadoop, linked data and the construction of the data commons, how NLP and machine learning can power inspiring audience-driven websites, on the developing grammar of Information Visualization, on Processing and Prefuse and R. Pete Skomoroch, Mike Driscoll and Christian Chabot all ended up skipping SxSW this year; we need them leading a panel discussion on how to visualize >10M point datasets with limited-bandwidth desktop and web interfaces. I’d like to hear Deepak Singh and one of the @cloudera’ns drop science about scalable cloud computing.
The evening was just informal mingling and conversation, but on request of request of @mndoci and @dataspora, here is our name-droppy slice of the whirlwind:
@mrflip: Learned about how Zemanta is already putting Linked Data and NLP together to make blogging better. Jure is excited about infinite monkeywrench and might be brave enough to pre-alpha its inchoate HTML munger. Got to hear what Blaine Cook of Osmosoft is doing to solve the fractured twitter/facebook/identi.ca/500M-person-strong-local-social-networks-you’ve-never-heard-of ecosystem, and he gave some great feedback on our upcoming Twitter Census. Also got to learn, after pontificating that OAuth is hard, that I was talking to its architect; a great discussion with Blaine and ENTP Uruguay Evan Henshaw-Plath followed about the Rails authorization/identity/authentication stack.
Mike Migurski of @stamen is going to get together with infochimp @dhruvbansal to push the Open Street Maps dataset into Amazon Public Data Sets collection. Harper Reed of Threadless was running off for a 6am (ugh) flight to babysit servers in Chicago by the time we chatted, but pointed towards his Chicago Transit API project. His post on Hidden APIs is a great read BTW. Ran into @Slicehost Matt Tanase at a party after; Rackspace is getting much Cloud-ier, including a 1.5cents/hour pay-as-you-go 256MB slice offering. I’m hoping to talk later about our MachetEC2 project and get his thoughts about how to put open data on tap in the cloud. Jon Pierce and I discussed the Mets’ chances this year and what he sees for big data startup possibilities. Only got to briefly intersect with Andrew Turner about open geocommons, and was chagrined to learn I was shoulder to shoulder with one of gnip but didn’t get to chat. Hope to fix that later.
This meeting alone made SxSW worth it, and I’m looking forward to more discussion later. You can stalk me on twitter as @mrflip or at http://sxsw2009.sched.org/flip. By the way, I’m giving a lightning talk on Open Data in government at Fiddler’s Hearth, 301 Barton Springs Rd at 12:30 — drop by or catch the webcast later.
Amazon Web Services hosts DBpedia, Freebase data sets
The Infochimps.org community played part in pushing DBpedia and Freebase data sets to Amazon Web Services. This is an auxiliary effort by Infochimps.org to increase access to data. It is important to have the data in places where there are the right tools for people to use it. AWS is the place, look at creating an Amazon Machine Image to start working with the new data sets. Our MachetEC2 can help, please let us know how your experience was in using it.
Thanks to Kingsley Idehen with Linked Open Data for being a good point of contact.
We will upload more data sets to AWS in the near future. Any requests?
Start hacking: machetEC2 released!
machetEC2, the Infochimps Amazon Machine Image (AMI) designed for data processing, analysis, and visualization, has been released!
Amazon’s Cloud Computing services give you transformatively cheap and scalable computing power, and their Public Data Sets (AWS/PDS) collection (which infochimps is contributing to) is helping to put the world of free, open data at your fingertips. MachetEC2 lets you summon a “batteries included” computer — or a hundred computers — from the cloud. As soon as it loads, you’re ready to start crunching and transforming and visualizing data, whether from AWS/PDS, or infochimps.org, or your own pool.
When you SSH into an instance of machetEC2 (brief instructions after the jump), check the README files: they describe what’s installed, how to deal with volumes and Amazon Public Datasets, and how to use X11-based applications. You can also visit the the machetEC2 GitHub page to see the full list of packages installed, the list of gems, and the list of programs installed from source.
This machete is only as sharp as it is complete. If there’s software that you find indispensable, we encourage you to suggest it here, or even better to help add it to the toolkit (instructions are within).
Hacking through the Amazon with a shiny new MachetEC2
Hold on to your pith helmets: the Infochimps are releasing an Amazon Machine Image designed for data processing, analysis, and visualization.
Amazon’s Elastic Compute Cloud (EC2) allows users to instantiate a virtual computer with a pre-installed operating system, software packages, and up to 1 TB of data loaded on disk, ready to work with, from a shared image (an “Amazon Machine Image”, or AMI).
MachetEC2 is an effort by a group of Infochimps to create an AMI for data processing, analysis, and visualization. If you create an instance of MachetEC2, you’ll be have an environment with tools designed for working with data ready to go. You can load in your own data, grab one of our datasets, or try grabbing the data from one of Amazon’s Public Data Sets. No matter what, you’ll be hacking in minutes.
We’re taking suggestions for what software the community would be most interested in having installed on the image (peek inside to see what we’ve thought of so far…)
The Asdrubal Cabrera Hall of Fame
Prompted by my friend’s skepticism that the ballplayer Milton Bradley is really so named, I’m exhuming this old post from elsewhere. — flip
During the 2007 baseball playoffs, announcer Tim McCarver perspicaciously observed that “Asdrubal Cabrera is the only player in the majors with that first name”. Thus inspired, I present The Asdrubal Cabrera Hall of Fame: Major League ballplayers in unique possession of their particular first name. (Some are nicknames, many are not — but these are their official names, as used in newspapers and the rolls of history. F’reals.)
You may be familiar with Honus Wagner, Eppa Rixey, Boog Powell or Yogi Berra. But have you heard recounted the storied diamond exploits of Firpo Mayberry, Zoilo Versalles, Pi Schwert or Bevo LeBourveau? OK, then how about Mysterious Walker, The Only Nolan, or Phenomenal Smith? Mul Holland, Sixto Lezcano, Welcome Gaston or Mox McQuery? There’s a bunch more after the jump, and a complete listing here, including links to each player’s baseball reference page.
For some dinnertime fun over the holidays, discuss the relative merits of naming your next child after Urban Shocker, Twink Twining, Pussy Tebeau, Bris Lord, Boob Fowler, Crazy Schmit, Creepy Crespi, Cuddles Marshall, Vinegar Bend Mizell, or Buttercup Dickerson. (Unfortunately, 12 other “Rusty”s keep fan favorite Rusty Kuntz off this list, and believe it or not two other “Stubby”s bar the way for Stubby Clapp. I apologize to anyone whose internet filter has or has not prevented reading this apology.)
Thanks to the Baseball Databank and Retrosheet, I had this dataset on hand, and thanks to a monastic life of nerdity I had the SQL chops to pull up this query between innings. But I should be able to do this with anything, whether or not I know a SQL Query from a Queer-Eye Sequel, for silly stunts and for changing lives alike.
Imagine instead I were a public health expert, interested in the effects of limiting medical residents to an 80-hour work week. Might lives be saved if I could effortlessly pull up historical data on rates of doctor-induced complications, board of medicine complaints, relative rates of med school and law school applications, and open-government data on medical regulations?
The long-term mission of infochimps.org is to democratize this: to put the world’s analytic data at our fingertips, supporting tools that let anyone manipulate, interrogate, visualize and explore that data. Giving baseball geeks a chance to show up Tim McCarver isn’t much of a start, but here we are.
More awesome first names after the jump….
Geography of Newspaper Endorsements in the 2008 US Presidential Election
First in a series of visualizations / experiments with interconnecting datasets:
Geography of Newspaper Endorsements in the 2008 US Presidential Election
Apart from the unsurprising evidence that (choose one: [[Obama is the overwhelming choice]] -OR- [[there is overwhelming liberal media bias]]), I’m struck by the mismatch between papers’ endorsements and their “Red State” vs “Blue State” alignment.
- I think the amount of red in the blue states is a market effect. If you’re the Boston Herald, there’s no percentage in agreeing with the Boston Globe; similarly Daily News vs New York Post, SF Examiner vs SF Chronicle. That’s why the Tribune endorsement, even accounting for hometown bias, is so striking. I don’t mean that they’re cynically pandering; rather that in a market with multiple papers readers, and journalists are efficiently sorted into two separate camps. (And the axis doesn’t have to be political: though the Chronic and the Statesman are politically distinct I see their main difference being lifestyle vs. traditional news).
- The amount of blue in the red states highlights how foolishly incomplete the “Red State/Blue State” model is for anything but electoral college returns. The largest part of the Red/Blue split is Rural/Urban — look at the electoral cartogram for the last election and almost every city is blue, even in the south and mountain; and almost all of our rural areal is red. The exceptions, chiefly Dallas, Houston and Boise, stand noticeably alone as having red unpaired with blue. (Though in this election even the Houston Chronicle is endorsing Obama.)I’m going to try to make a map colored by county, but there are no good off-the-shelf tools for doing this (that I’ve found).
This seems to speak of why so many on the right feel there’s a MSM bias — 50% of the country is urban, 50% rural, but newspapers are located exclusively in urban areas [see below]. So, surprisingly, the major right-leaning papers are all located in parts of the country we consider highly leftish. The urban areas that are the largest are thus both the most liberal and the most likely to have a sizeable conservative target audience.
