rounded

The Asdrubal Cabrera Hall of Fame 1 year, 2 months ago. by mrflip

Prompted by my friend’s skepticism that the ballplayer Milton Bradley is really so named, I’m exhuming this old post from elsewhere. — flip

During the 2007 baseball playoffs, announcer Tim McCarver perspicaciously observed that “Asdrubal Cabrera is the only player in the majors with that first name”. Thus inspired, I present The Asdrubal Cabrera Hall of Fame: Major League ballplayers in unique possession of their particular first name. (Some are nicknames, many are not — but these are their official names, as used in newspapers and the rolls of history. F’reals.)

You may be familiar with Honus Wagner, Eppa Rixey, Boog Powell or Yogi Berra. But have you heard recounted the storied diamond exploits of Firpo Mayberry, Zoilo Versalles, Pi Schwert or Bevo LeBourveau? OK, then how about Mysterious Walker, The Only Nolan, or Phenomenal Smith? Mul Holland, Sixto Lezcano, Welcome Gaston or Mox McQuery? There’s a bunch more after the jump, and a complete listing here, including links to each player’s baseball reference page.

For some dinnertime fun over the holidays, discuss the relative merits of naming your next child after Urban Shocker, Twink Twining, Pussy Tebeau, Bris Lord, Boob Fowler, Crazy Schmit, Creepy Crespi, Cuddles Marshall, Vinegar Bend Mizell, or Buttercup Dickerson. (Unfortunately, 12 other “Rusty”s keep fan favorite Rusty Kuntz off this list, and believe it or not two other “Stubby”s bar the way for Stubby Clapp. I apologize to anyone whose internet filter has or has not prevented reading this apology.)

Thanks to the Baseball Databank and Retrosheet, I had this dataset on hand, and thanks to a monastic life of nerdity I had the SQL chops to pull up this query between innings.  But I should be able to do this with anything, whether or not I know a SQL Query from a Queer-Eye Sequel, for silly stunts and for changing lives alike.

Imagine instead I were a public health expert, interested in the effects of limiting medical residents to an 80-hour work week. Might lives be saved if I could effortlessly pull up historical data on rates of doctor-induced complications, board of medicine complaints, relative rates of med school and law school applications, and open-government data on medical regulations?

The long-term mission of infochimps.org is to democratize this: to put the world’s analytic data at our fingertips, supporting tools that let anyone manipulate, interrogate, visualize and explore that data.  Giving baseball geeks a chance to show up Tim McCarver isn’t much of a start, but here we are.

More awesome first names after the jump….

(more…)

The gems of our collection — The best of what's to come 1 year, 11 months ago. by mrflip

Hooray! The infochimps have been waxy’ed.  Let’s see how the server bonobos stand up.

It’s been suggested that I highlight some of the “gems” of our collection, which we’re going to spend the whole weekend shoveling into the pile. These first few are really deep, and somewhat hard to get / not widely known:

  • Full game state for every play of every baseball game in 2007, majors and minors.  Additionally, for about half of the major league games, *pitch by pitch* trajectory and game state information.  (MLB Gameday)
  • Word frequencies in written text for ~800,000 word tokens (British National Corpus)
  • All the wikipedia infoboxes, turned on their side and put into a table for each infobox type.
  • 250,000+ Materials Safety Data sheets – the chemical and safety information required by OHSA
  • 100 years of Hourly weather data; from 1973 on there’s about 10,000 stations all taking hourly readings … put another way, it’s 475,000+ station-years of hourly readings and weighs in at ~15 GB compressed.

(Incidentally, many of those datasets sell for inexcusable and malicious prices.  For those with a commercial bent, something tells me there’s room in the market if you’re willing to accept a markup of less than 10,000 times).

These are a bit silly but interesting for their ridiculous depth:
* A variety of mathematical constants (pi, e, Catalan’s number, the Golden Ratio, others) calculated to in some cases a preposterous 100 billion decimal places (I’ll probably chop them off at a still-ludicrous 500 million).
* 5000 years of solar eclipse times, 6000 years of precise lunar phase, 6000 years of venus transits.
* Odds of Dying for every Cause of Death listed in the US in a given year.

There are also, of course, the well-known collections: IMDB.com, musicbrainz, dbpedia, CIA factbook, geonames, citeseer, census, statistical abstract and the like.  So let’s see how much of the low-hanging fruit we can toss up there this weekend (the hard parts are adding metadata, and getting the non-copyrightable data out of the copyrighted screenscrapes, so what you’ll see are minimal metadata and the non-screenscraped datasets — still beats paying $1200+/GB though.)

[edit: dates for holidays by country, year-by-year odds of dying for all causes of death from the recent 8 year, NIST values for physical and chemical constants, mechanical properties of common engineering materials, and the spoken and written word frequencies for ~800,000 word tokens datasets should be up later today -- if the site is down briefly we're pushing that update to the server.  (If the site is down not-briefly we've been del.waxyslashdiggdotted)  Thanks to my friend Ned for helping do some drudge work to get those out.]

All of Wikipedia's infoboxes & templates, in individual tables for each kind 1 year, 11 months ago. by mrflip

FINALLY — got the wikipedia infoboxen posted to the site, along with some tiny fixes.

This is 3000+ tables on everything from ABA Teams through Simpsons Episodes to Zodiac Signs.  There’s a fair amount of cruft in these, but until I have live metadata editing going I’m not going to worry about it: it takes about 8 hours start to finish to process this dataset, they’re not perfect but they are perfectly usable.

I have the weather dataset and baseball datasets almost ready to go (along with a whole buncha others), but I’m going to take some time to get the site running better first.  Here’s a rough TODO list:

  1. live, versioned metadata editing
  2. uploading
  3. Allow grouping of datasets by collection and add category tags
  4. Make it so fields & contributors tie together.  (For complicated reasons, each dataset creates a new personal version of the field so you can’t actually walk from one “stock price” field to other datasets with that tag.

Then I’ll turn some intensive attention finally to the InfiniteMonkeywrench code.  We need better tools to wrangle these huge datasets into shape.

Statistical Abstract of the United States 2 years, 0 months ago. by mrflip

Added the Statistical Abstract of the United States — the messily, messily formatted analyzed tables released by the US Census Department.  1350+ tables, yum.

infochimps.org is live 2 years, 0 months ago. by mrflip

Just in time for SxSWi – the site is live.

Now that we’ve got the skeleton of the website in place, we can go back and apply the necessary metadata/package/import workflow we’ve developed.

Here’s a rundown of the datasets you can look forward to seeing over the next few weeks:

  • demographics
    • world
      • world bank development data—variety of country data from world bank
      • CIA factbook
    • us
      • Statistical Abstract of the US —an exhaustive categorization of demographic, commercial and social data for the US
      • The full US Census Summary File 3, at the zipcode level.
  • money:
    • US Stock market daily—Daily open/close/lo/hi for all listed stocks since 1970
    • US Campaign finance—Expenditures in US presidential, senate, house and governor races in 2004
    • Constantcurrency—Variety of currencies in constant dollars/pounds etc back to the 1600s
  • huge & Miscellaneous:
    • infoboxen:—All the infoboxes from Wikipedia broken out into individual semantically labelled tables
  • joins:
    • Common coding systems for
      • country codes, including a useful keying database from common names (“USA”, “U.S.A”, “United States”, …) to ISO country code
      • languages
      • currencies, etc.
    • time—conversions among all of the (curiously many) competing means of measuring dates and times
  • health
    • odds of dying—all causes of death in the US, broken down by category and given as rate and odds
    • middle east conflict casualties—civilian and military deaths in Iraq (OIF) and Afghanistan (OEF) since 2003/2001
  • science, math & engineering:
    • nasa_eclipse 5000 years of solar and lunar eclipse, lunar phase, and planetary transits from NASA
    • 270,000+ MSDS (Materials Safety datasheets) listing properties and hazards of common and industrial chemical substances
    • material properties—basic chemical and physical properties for common chemical substances
    • powergrid Network of Power Grid Connections in the Western US (Strogatz1998)
    • fastenerdata Screw, Bolt, and Threaded Fasteners: Dimensions, Mechanical Strengths and Properties, and other useful information
    • mechanical properties Mechanical Properties for a variety of useful materials
    • consts and units Universal constants and unit conversion factors
    • standard mathematical tables Tables of Elementary functions (log, bessel, etc) over large range
    • mathematical constants The fundamental mathematical constants calculated to millions and occasionally billions of decimal places
  • Art and Culture
    • Every movie, act(or|ess), and film courtesy of imdb.com
    • Every musician, album, track and label, courtesy of musicbrainz.org
    • WANTED:ISBN=> author, book, publisher dataset. If you have this please contact us.
  • geo:
    • A huge assortment of GIS layers from nationalatlas.gov
    • Geographical place names & locations from geonames.org
    • TigerLine, a mapping from street address to location for the full US (this will take a while)
    • Postal codes – map from zip code to city and latitude/longitude
  • time
    • tzinfo time zone info for everywhere
    • calendar_kitchensink 3000 years of time zone, calendar conversion, moon phase, accounting information, etc
    • accounting_calendar last fridays of each month, adjusted for holidays etc.
    • holidays major repeating holidays for most countries
  • language:
    • Usage frequency (in speech and print) of every english word, from the British National Corpus
    • Moby Word lists – Word Lists, Multiple Language Lists of Common Words, Hyphenation, Part of Speech, Pronunciation, Thesaurus
    • Natural Language Toolkit Corpora NLTK’s Word lists, semantic networks, lexical data, large text corpora; several languages
    • All the words legal to play in Scrabble™
  • sport:
    • Baseball:
    • retrosheet gamelogs: Game outcome and box score for every MLB game back to 1890s
    • retrosheet event files: Play by play information for almost every game back to 1957 (and all since the mid-1970s).
    • baseballdatabank: Season and Career stats for every MLB player, team, etc of all time
    • MLB Gameday: Players, Game state, Pitch-by-Pitch trajectory and Outcome for ~half of the 2007 MLB games.