rounded

Our 7 Most Popular Data Categories 4 months, 23 days ago. by maegan

As a marketplace for data, people often ask us what are the most popular types of data. Interestingly, the answer isn’t very intuitive. For example, who would have guessed that one of our most downloaded datasets is a crossword puzzle word list? With this in mind, we decided to investigate and come up with a more complete answer.

Using metrics from the Infochimps website, such as search queries, downloaded content, data requests, page views, and so forth, we came up with a list of the top 7 searched for data categories:

1. Social Networks
2. Economics & Finance
3. Demographics
4. Education
5. Sports
6. Geography
7. Music

Why is this list important?
Many different people may find this information useful: From data sellers who want to know what kinds of data people would buy, trend watchers who are tracking what is popular, to anyone who is curious about what types of data are out there.

What can we learn from this list?
One thing that we can take away from this list is that there is a good indication of data becoming more mainstream. Data that is being searched for isn’t limited to academically-oriented information that traditional data users (such as researchers) would use. Categories like sports and music are of interest to a wider audience.

Furthermore, there is a demand for these types of data, but not always a supply. As the usefulness of data is becoming more realized, the infrastructure to facilitate the data process is still forming. Infochimps is one of the platforms that aims to bridge the gap, through functions like the dataset request page, and provides a venue for data exchange. If you’ve got data in these categories, know that there are people out there who want your data.

Congrats Retrosheet – another decade of rich Baseball data online 1 year, 1 month ago. by mrflip

Congrats to Retrosheet, who now have full major-league baseball box scores from 1920-1930 online! (This is in addition to full box score coverage for 1953-2008, and broad coverage of box scores and play-by-play data from 1871-2008). As Nate Silver has said, “Baseball is the perfect dataset”, and we would not have this astonishingly rich and detailed dataset if not for the dedicated crowdsource efforts of the Retrosheet team.

Infochimps metadata entries for these datasets:

Amazon Web Services hosts DBpedia, Freebase data sets 1 year, 6 months ago. by Joseph Kelly

The Infochimps.org community played part in pushing DBpedia and Freebase data sets  to Amazon Web Services.  This is an auxiliary effort by Infochimps.org to increase access to data.  It is important to have the data in places where there are the right tools for people to use it.  AWS is the place, look at creating an Amazon Machine Image to start working with the new data sets.  Our MachetEC2 can help, please let us know how your experience was in using it.

Thanks to Kingsley Idehen with Linked Open Data for being a good point of contact. 

We will upload more data sets to AWS in the near future.  Any requests?

The Asdrubal Cabrera Hall of Fame 1 year, 7 months ago. by mrflip

Prompted by my friend’s skepticism that the ballplayer Milton Bradley is really so named, I’m exhuming this old post from elsewhere. — flip

During the 2007 baseball playoffs, announcer Tim McCarver perspicaciously observed that “Asdrubal Cabrera is the only player in the majors with that first name”. Thus inspired, I present The Asdrubal Cabrera Hall of Fame: Major League ballplayers in unique possession of their particular first name. (Some are nicknames, many are not — but these are their official names, as used in newspapers and the rolls of history. F’reals.)

You may be familiar with Honus Wagner, Eppa Rixey, Boog Powell or Yogi Berra. But have you heard recounted the storied diamond exploits of Firpo Mayberry, Zoilo Versalles, Pi Schwert or Bevo LeBourveau? OK, then how about Mysterious Walker, The Only Nolan, or Phenomenal Smith? Mul Holland, Sixto Lezcano, Welcome Gaston or Mox McQuery? There’s a bunch more after the jump, and a complete listing here, including links to each player’s baseball reference page.

For some dinnertime fun over the holidays, discuss the relative merits of naming your next child after Urban Shocker, Twink Twining, Pussy Tebeau, Bris Lord, Boob Fowler, Crazy Schmit, Creepy Crespi, Cuddles Marshall, Vinegar Bend Mizell, or Buttercup Dickerson. (Unfortunately, 12 other “Rusty”s keep fan favorite Rusty Kuntz off this list, and believe it or not two other “Stubby”s bar the way for Stubby Clapp. I apologize to anyone whose internet filter has or has not prevented reading this apology.)

Thanks to the Baseball Databank and Retrosheet, I had this dataset on hand, and thanks to a monastic life of nerdity I had the SQL chops to pull up this query between innings.  But I should be able to do this with anything, whether or not I know a SQL Query from a Queer-Eye Sequel, for silly stunts and for changing lives alike.

Imagine instead I were a public health expert, interested in the effects of limiting medical residents to an 80-hour work week. Might lives be saved if I could effortlessly pull up historical data on rates of doctor-induced complications, board of medicine complaints, relative rates of med school and law school applications, and open-government data on medical regulations?

The long-term mission of infochimps.org is to democratize this: to put the world’s analytic data at our fingertips, supporting tools that let anyone manipulate, interrogate, visualize and explore that data.  Giving baseball geeks a chance to show up Tim McCarver isn’t much of a start, but here we are.

More awesome first names after the jump….

(more…)

Infinite Monkeywrench hosted on GitHub 2 years, 2 months ago. by dhruvbansal

Rejoice, you open-source orangutans, for the powerful, the weighty, the Infinite Monkeywrench is now hosted on GitHub! Download a copy and start hacking, if you will, and send us your questions and concerns.

The Infinite Monkeywrench (IMW) turns all the screws in the heaving contraption we call infochimps.org but can also be put to good use on more modest projects as well. Learn more about IMW at the official IMW website.

The gems of our collection — The best of what's to come 2 years, 5 months ago. by mrflip

Hooray! The infochimps have been waxy’ed.  Let’s see how the server bonobos stand up.

It’s been suggested that I highlight some of the “gems” of our collection, which we’re going to spend the whole weekend shoveling into the pile. These first few are really deep, and somewhat hard to get / not widely known:

  • Full game state for every play of every baseball game in 2007, majors and minors.  Additionally, for about half of the major league games, *pitch by pitch* trajectory and game state information.  (MLB Gameday)
  • Word frequencies in written text for ~800,000 word tokens (British National Corpus)
  • All the wikipedia infoboxes, turned on their side and put into a table for each infobox type.
  • 250,000+ Materials Safety Data sheets – the chemical and safety information required by OHSA
  • 100 years of Hourly weather data; from 1973 on there’s about 10,000 stations all taking hourly readings … put another way, it’s 475,000+ station-years of hourly readings and weighs in at ~15 GB compressed.

(Incidentally, many of those datasets sell for inexcusable and malicious prices.  For those with a commercial bent, something tells me there’s room in the market if you’re willing to accept a markup of less than 10,000 times).

These are a bit silly but interesting for their ridiculous depth:
* A variety of mathematical constants (pi, e, Catalan’s number, the Golden Ratio, others) calculated to in some cases a preposterous 100 billion decimal places (I’ll probably chop them off at a still-ludicrous 500 million).
* 5000 years of solar eclipse times, 6000 years of precise lunar phase, 6000 years of venus transits.
* Odds of Dying for every Cause of Death listed in the US in a given year.

There are also, of course, the well-known collections: IMDB.com, musicbrainz, dbpedia, CIA factbook, geonames, citeseer, census, statistical abstract and the like.  So let’s see how much of the low-hanging fruit we can toss up there this weekend (the hard parts are adding metadata, and getting the non-copyrightable data out of the copyrighted screenscrapes, so what you’ll see are minimal metadata and the non-screenscraped datasets — still beats paying $1200+/GB though.)

[edit: dates for holidays by country, year-by-year odds of dying for all causes of death from the recent 8 year, NIST values for physical and chemical constants, mechanical properties of common engineering materials, and the spoken and written word frequencies for ~800,000 word tokens datasets should be up later today -- if the site is down briefly we're pushing that update to the server.  (If the site is down not-briefly we've been del.waxyslashdiggdotted)  Thanks to my friend Ned for helping do some drudge work to get those out.]

Good Neighbors and Open Grazing: Datasets, Creative Works and Copyright 2 years, 5 months ago. by mrflip

Many people don’t know how broad our rights to factual data actually are.  Unlike the mishegaas that reigns in copyright land, the world of data is largely open (and rightfully so).  To arrive at the age of ubiquitous information with a sound policy, however, we have to exercise those rights assertively, respectfully and prudently.

Let me start with the traditional IANAL and point out that if you take legal advice from a chimpanzee you deserve what you get. Instead, read iusmentis on database law and bitlaw on compilations and databases. (In which case you can probably skip the rest of this post.) (Also, the following only applies to the US, where the database laws are actually more liberal than elsewhere; I have no idea what the situation is outside the US)

In general, a comprehensive assemblage of facts cannot be copyrighted. Copyright only applies where there is creative content. A comprehensive list of cars and retail prices cannot be copyrighted; a comprehensive collection of reviews of those cars can be copyrighted. A list of all the musical albums released each year is data; the lyrics and music within them is creative. A list of word tokens sorted by artist, genre, release date and song length is data, and a list of the top-100 selling albums by year is data. This is the important Feist Publications v. Rural Telephone Service case:

“Facts, whether alone or as part of a compilation, are not original and therefore may not be copyrighted. A factual compilation is eligible for copyright if it features an original selection or arrangement of facts, but the copyright is limited to the particular selection or arrangement. In no event may copyright extend to the facts themselves.” — Sandra Day O’Connor for the Supreme Court

“A collections of facts are not copyrightable per se … A compilation, like any other work, is copyrightable only if it satisfies the originality requirement (“an original work of authorship”). Facts are never original, so the compilation author can claim originality, if at all, only in the way the facts are presented. The facts must be selected, coordinated, or arranged “in such a way” as to render the work as a whole original.” — Sandra Day O’Connor for the Supreme Court

A presentation of data can be creative — you can’t xerox the blue book and hand that out. However, a conversion of otherwise unrestricted data into your own creative presentation satisfies this restriction. So would a presentation (original or converted) that did not arise from a creative act — you couldn’t claim copyright on a .CSV file of some dataset.

Besides “presentation” and a couple edge cases (“hot news”, “selection and arrangement”), the main one to be aware of is “Terms of Service“. If you have to agree to terms of service that restrict the data, but you take it anyway, you can be guilty of trespass. My understanding there is that if you can a) access the site by robot (no person clicks anything) AND b) there is no robots.txt, they shouldn’t be able to sustain a claim that it’s a restricted resource.

I personally go by balancing two principles:

  1. It’s our world, and we deserve access to the information that describes it.  Besides our legal rights, we have an even stronger moral claim to the chronicle of our collective story.  And we all stand to benefit: there have to be incentives to gather and organize data, but the modest benefits of making a data provider a lot richer don’t stand against the much larger marginal benefit of making the world a timy bit smarter.
  2. Be a good neighbor.  A lot of work goes in to gathering, processing, verifying, distributing an interesting dataset.  If we infochimps run around ignoring people’s requests for modest usage conditions, we’ll have a bit extra of open data and a lot extra of pissed-off ex-kindred souls who feel like we stole their cake.  Inevitably, this will mean that people won’t put data online at all for public access.

The best approach is

  • Scrupulously credit contributions, make clear that their efforts are recognized, and that we’ll link back to them for their ultimate benefit.
  • Clearly state the usage restrictions requested by the contributor, adhere to them, and ask that recipients of the data do the same.
  • Make clear the benefits to the world for making this data available.
  • Make clear the benefits to the contributor — this data will, for free, be enhanced with metadata, converted for use by diverse tools, interlinked with other rich datasets, and power interesting projects.  If your mission statement is “build reliable and exciting cars” or “make powerful music”, then your mission statement isn’t “explore and explain unexpected correlations among disparate rich information pools”.  Let someone else do it for you, and let them build the tools to do so around your data.  Consider how much Baseball has benefitted from its statistical revolution — fed by its incredibly rich ecosystem of open data.
  • Finally, as far as scientific or government prepared data that’s otherwise rights-free: gloves off, we’re taking that data.  If you’re a researcher, and you’re not openly sharing your data, you’re not only a bad scientist but also a bad person.  Ditto for data collected at taxpayer expense.

Stock Market dataset is up 2 years, 5 months ago. by mrflip

40 Years of data on every NYSE, AMEX and NASDAQ listed stock:

These links were busted before but should be worky now.