rounded

Vote for our SxSW Panel Talk, Get People Thinking about how the Web will help tame the Data Flood 1 year, 7 months ago. by mrflip

Aaron Swartz of get.theinfo.org and watchdog.org, Kurt Bollacker from freebase.com, Shawn O’Connor from timepedia.org, and we infochimps have each put in panel proposals for the SxSWi 2009 conference.  Please consider clicking through to rate (and comment!) on these talks:

By my cursory count, there are about three times as many proposals this year as last that center on using the web for large-scale data exploration, data mashups, visualization, etc. Even if you are not attending, though, your vote will help get more people learning about the current state and future possibilities of massive data exploration on the web.

Descriptions of those talks:

Beyond Mashup: Weaving the Global Data Tapestry
http://panelpicker.sxsw.com/ideas/view/1500
Data mashups of not a few but a few thousand sources become possible as community efforts, enabled by new tools and Creative Commons licensing, unify the world’s exploding store of free, open data. Come find out what’s awesome, what’s hard, and what’s possible when you discover there’s really only one dataset. (P Kromer, infochimps.org)

How the Internet is Transforming Governance
http://panelpicker.sxsw.com/ideas/view/1038
The Internet is starting to revolutionize everything about politics and governance. Panelists will discuss new initiatives that harness the power of the Web to engage citizens in online activism, collaborative governance and oversight in ways that are radically shifting political power structures and fostering more transparency and accountability by elected officials. (Gabriela Schneider, Sunlight Foundation)

Petabyte as Platform – Building “Everything about Something” Sites
http://panelpicker.sxsw.com/ideas/view/1449
Find a topic some audience cares deeply about: their neighborhood, our government, every motorcycle ever made; and let visitors see, explore and understand it, and you make the world a better place. We’ll discuss how participating in the open, global data commons beneficially transform our culture and economy. (Kurt Bollacker, Freebase.com)

Powers of Often: Powers of Ten in Time
http://panelpicker.sxsw.com/ideas/view/1649
In 1977, Charles & Ray Eames made a fascinating short film, Powers of Ten, showing the relative scales in the universe: from picnic, to city, to solar system, to galaxy, and so on, back to cells, molecules, and atomic nuclei. In the same spirit, Powers of Often will explore relative scales in time using real data and hard estimates: patterns of daily life, demographics, census data, generations, long term trends, forecasts, historical cycles, high-frequency finance, and solar cycles. (Shawn O’Connor, Timepedia.org)

Among all talks with “data” in the description, these also look interesting:

If you see any other worthwhile topics please reply.

Thanks!
flip

The gems of our collection — The best of what's to come 1 year, 11 months ago. by mrflip

Hooray! The infochimps have been waxy’ed.  Let’s see how the server bonobos stand up.

It’s been suggested that I highlight some of the “gems” of our collection, which we’re going to spend the whole weekend shoveling into the pile. These first few are really deep, and somewhat hard to get / not widely known:

  • Full game state for every play of every baseball game in 2007, majors and minors.  Additionally, for about half of the major league games, *pitch by pitch* trajectory and game state information.  (MLB Gameday)
  • Word frequencies in written text for ~800,000 word tokens (British National Corpus)
  • All the wikipedia infoboxes, turned on their side and put into a table for each infobox type.
  • 250,000+ Materials Safety Data sheets – the chemical and safety information required by OHSA
  • 100 years of Hourly weather data; from 1973 on there’s about 10,000 stations all taking hourly readings … put another way, it’s 475,000+ station-years of hourly readings and weighs in at ~15 GB compressed.

(Incidentally, many of those datasets sell for inexcusable and malicious prices.  For those with a commercial bent, something tells me there’s room in the market if you’re willing to accept a markup of less than 10,000 times).

These are a bit silly but interesting for their ridiculous depth:
* A variety of mathematical constants (pi, e, Catalan’s number, the Golden Ratio, others) calculated to in some cases a preposterous 100 billion decimal places (I’ll probably chop them off at a still-ludicrous 500 million).
* 5000 years of solar eclipse times, 6000 years of precise lunar phase, 6000 years of venus transits.
* Odds of Dying for every Cause of Death listed in the US in a given year.

There are also, of course, the well-known collections: IMDB.com, musicbrainz, dbpedia, CIA factbook, geonames, citeseer, census, statistical abstract and the like.  So let’s see how much of the low-hanging fruit we can toss up there this weekend (the hard parts are adding metadata, and getting the non-copyrightable data out of the copyrighted screenscrapes, so what you’ll see are minimal metadata and the non-screenscraped datasets — still beats paying $1200+/GB though.)

[edit: dates for holidays by country, year-by-year odds of dying for all causes of death from the recent 8 year, NIST values for physical and chemical constants, mechanical properties of common engineering materials, and the spoken and written word frequencies for ~800,000 word tokens datasets should be up later today -- if the site is down briefly we're pushing that update to the server.  (If the site is down not-briefly we've been del.waxyslashdiggdotted)  Thanks to my friend Ned for helping do some drudge work to get those out.]

All of Wikipedia's infoboxes & templates, in individual tables for each kind 1 year, 11 months ago. by mrflip

FINALLY — got the wikipedia infoboxen posted to the site, along with some tiny fixes.

This is 3000+ tables on everything from ABA Teams through Simpsons Episodes to Zodiac Signs.  There’s a fair amount of cruft in these, but until I have live metadata editing going I’m not going to worry about it: it takes about 8 hours start to finish to process this dataset, they’re not perfect but they are perfectly usable.

I have the weather dataset and baseball datasets almost ready to go (along with a whole buncha others), but I’m going to take some time to get the site running better first.  Here’s a rough TODO list:

  1. live, versioned metadata editing
  2. uploading
  3. Allow grouping of datasets by collection and add category tags
  4. Make it so fields & contributors tie together.  (For complicated reasons, each dataset creates a new personal version of the field so you can’t actually walk from one “stock price” field to other datasets with that tag.

Then I’ll turn some intensive attention finally to the InfiniteMonkeywrench code.  We need better tools to wrangle these huge datasets into shape.