rounded

Hadoop World 2010 & New Propaganda 25 days ago. by mrflip

Yay! Infochimps is going to Hadoop World 2010. Watch out New York! I (flip) am giving a talk titled “Millionfold Mashups” — I’ll talk about how we store, process and analyze massively numerous datasets and datasets of massive size.

We’re going to order propaganda stickers to give out, and we want to get your feedback on which to print.

Favorites? Terrible puns of your own to add? Want us to send you a set? Let us know in the comments!

  • Live Fast and Leave a Beautiful Corpus at Infochimps.org
  • Where Hot Singles come to Dataset
  • Upload Yours.
  • Hadoop-de-doo for you
  • Dammit, No, the Other NLP
  • I’m Consistently Available. Want to see my Partition?
  • Intoxication by Miners is OK at infochimps.org
  • Fit your Curves at infochimps.org
  • Head in the Clouds?
  • Expose your Bits at infochimps.org
  • Support Vector Machines!
  • Free Variables
  • Everyone at our Datacenter has a Nice Rack
  • Bayesians Against Discrimination
  • Map Reduce, Map Reuse, Map Recycle
  • PAXOS in our time
  • Pro Axiom of Choice
  • Big Chimpin’
  • We have the most Cunning Linguists
  • P = NP
  • P != NP

Several of the slogans shamelessly stolen from this protest by CMU Machine Learning researchers, which I love so much it hurts.

Our 7 Most Popular Data Categories 4 months, 23 days ago. by maegan

As a marketplace for data, people often ask us what are the most popular types of data. Interestingly, the answer isn’t very intuitive. For example, who would have guessed that one of our most downloaded datasets is a crossword puzzle word list? With this in mind, we decided to investigate and come up with a more complete answer.

Using metrics from the Infochimps website, such as search queries, downloaded content, data requests, page views, and so forth, we came up with a list of the top 7 searched for data categories:

1. Social Networks
2. Economics & Finance
3. Demographics
4. Education
5. Sports
6. Geography
7. Music

Why is this list important?
Many different people may find this information useful: From data sellers who want to know what kinds of data people would buy, trend watchers who are tracking what is popular, to anyone who is curious about what types of data are out there.

What can we learn from this list?
One thing that we can take away from this list is that there is a good indication of data becoming more mainstream. Data that is being searched for isn’t limited to academically-oriented information that traditional data users (such as researchers) would use. Categories like sports and music are of interest to a wider audience.

Furthermore, there is a demand for these types of data, but not always a supply. As the usefulness of data is becoming more realized, the infrastructure to facilitate the data process is still forming. Infochimps is one of the platforms that aims to bridge the gap, through functions like the dataset request page, and provides a venue for data exchange. If you’ve got data in these categories, know that there are people out there who want your data.

SXSW Data Panels 1 year, 0 months ago. by Joseph Kelly

We are especially excited to announce and share that big data is coming to SXSW.  Here are the panels we like:

Pete Skomoroch of DataWrangling: Petabyte As Platform, Making Big Data Accessible Online – We have long been fans of Pete Skomoroch’s work, this is your chance to hear from him about web applications built on massive datasets.

Our own mrflip: Scraping the Social Web – Flip has done extensive work building massive datasets from social media sites.  Hear him talk about the nuances involved and ask him about best practices.

Michael Driscoll of Dataspora: Cloud Crunching Big Data with HIVE/Hadoop and R and Become a Sexy Data Geek in One Week – Another friend of ours, Michael, will be talking about how to use the right tools to massage and produce results from big datasets, and profiles what you need to do to be a data geek.

Stu Hood of Rackspace: Using Hadoop to Manage a Ton of Data – Hadoop might be the the most important tool to know for working with terabytes and terabytes of data.

Ian Davis of Talis: Set Your Data Free – Talis does great work.  Listen to Ian cover topics very relevant to Infochimps.org’s collection: data copyright and licensing.

Dave Bowker of Designing the News: Engaging Data Visualizations and Infographic Communication – Glad to see some data viz stuff at SXSW.

Casey Caplowe of GOOD: Interactive Infographics – More visualizations, GOOD stuff.

Leave a comment if you know of any other good ones.

Infochimps receives a donation from SmartBear 1 year, 1 month ago. by Joseph Kelly

Smart Bear Software is an Austin-based company whose founder, Jason Cohen, is one of our favorite people.  Jason grew Smart Bear from the ground up, and he has helped the Infochimps team in the past with practical advice.  Jason blogs about marketing and small business at http://blog.asmartbear.com/ and he is well worth reading.  

The Infochimps rely on agile methods for the building of Infochimps.org, a process which can benefit from a code review tool.  Smart Bear’s product, Code Collaborator, is a well-known online peer code review tool that simplifies and expedites code reviews, helping teams produce higher-quality, tested and done code more efficiently.

Smart Bear’s latest promotion offered 5 seats of one of their code review tools for $5.  As a part of this promotion, they selected a start-up company to receive the funds collected from the promotion.  Infochimps won!  Smart Bear has graciously donated $2220 to Infochimps to help our mission of increasing the world’s access to data.  We appreciate their acknowledgment of our work and we know we can put the funds to good use.

To see how we reacted to the news, check out the video below:

[youtube=http://www.youtube.com/watch?v=ZLtR8_qw_yM&hl=en&fs=1&]

Open a banana like a Monkey does 1 year, 1 month ago. by mrflip

Open a Banana like a Monkey – most human primates do it wrong!

[youtube=http://www.youtube.com/watch?v=nBJV56WUDng]

To go with open banana here is open banana data:

It's Hot, Damn Hot. So Hot I saw a Chimp in Orange Robes Burst into Flames. 1 year, 1 month ago. by mrflip

It’s been ridiculously hot ridiculously early this year in Austin. A friend passed along this link to a visualization of 100+ degree days over the last 10 years. The author couldn’t find data extending back farther than 2000, but luckily I knew where to look.

I pulled the NCDC weather for Austin from 1948-present (see infochimps.org link for details) and got my Tufte on.

This temperature cycle is hotter than but comparable to the 1950-1965 era. I’ve got no idea if it’s global warming or the peak of a cycle. The fundamental conclusion — that this year so far, 2000 and 2008 were damn hot — stands up well.

(more…)

Congrats Retrosheet – another decade of rich Baseball data online 1 year, 1 month ago. by mrflip

Congrats to Retrosheet, who now have full major-league baseball box scores from 1920-1930 online! (This is in addition to full box score coverage for 1953-2008, and broad coverage of box scores and play-by-play data from 1871-2008). As Nate Silver has said, “Baseball is the perfect dataset”, and we would not have this astonishingly rich and detailed dataset if not for the dedicated crowdsource efforts of the Retrosheet team.

Infochimps metadata entries for these datasets:

Start-up Checklist 1 year, 2 months ago. by mrflip

On Jessica Hagy’s “Indexed”, the “Start-up Checklist“:

What's New 1 year, 4 months ago. by Joseph Kelly

Infochimps has been acknowledged as a finalist by the Capital Factory for 2009.

Infochimps is also a finalist in PepsiCo’s pitch competition.

Infochimps has a Facebook page! Become a fan.

Katherine at The New Civilization is aiding us in UX design for our Beta, to be launched at the end of May.  Eve Simon in Washington DC is helping us with the site design.  Our two big goals for the Beta are:

1) Improved browseability of the datasets, including a search bar and better surfing through tags, categories, and collections.

2) Uploading capability.  Users will be able to create accounts and upload datasets, as well as edit the descriptions of other data on the site.

Drop us a line anytime at info@infochimps.org

@mrflip's OpenGov Talk: Data Commons and Transparent Government 1 year, 4 months ago. by mrflip

Here is my (mrflip’s) SxSW OpenGov talk, “How Open Data will help build Open Government“:

[youtube=http://www.youtube.com/watch?v=4Aprr9a6XEM]

There is nothing more painful than watching yourself talk. So I haven’t gone all the way through this video — if you see me don’t give away the ending. Huge thanks to Silona Bonewald (League of Technical Voters) for organizing this, and to Terry Walhus (spring.net) for taping and copying and editing and uploading the videos.