blog.infochimps.org - Organizing Huge Information Sources

Something about Everything about Something

All of Wikipedia’s infoboxes & templates, in individual tables for each kind

with one comment

FINALLY — got the wikipedia infoboxen posted to the site, along with some tiny fixes.

This is 3000+ tables on everything from ABA Teams through Simpsons Episodes to Zodiac Signs.  There’s a fair amount of cruft in these, but until I have live metadata editing going I’m not going to worry about it: it takes about 8 hours start to finish to process this dataset, they’re not perfect but they are perfectly usable.

I have the weather dataset and baseball datasets almost ready to go (along with a whole buncha others), but I’m going to take some time to get the site running better first.  Here’s a rough TODO list:

  1. live, versioned metadata editing
  2. uploading
  3. Allow grouping of datasets by collection and add category tags
  4. Make it so fields & contributors tie together.  (For complicated reasons, each dataset creates a new personal version of the field so you can’t actually walk from one “stock price” field to other datasets with that tag.

Then I’ll turn some intensive attention finally to the InfiniteMonkeywrench code.  We need better tools to wrangle these huge datasets into shape.

One Response to 'All of Wikipedia’s infoboxes & templates, in individual tables for each kind'

Subscribe to comments with RSS or TrackBack to 'All of Wikipedia’s infoboxes & templates, in individual tables for each kind'.

  1. [...] All of Wikipedia’s infoboxes & templates, in individual tables for each kind « blog.infochim… This is 3000+ tables on everything from ABA Teams through Simpsons Episodes to Zodiac Signs. There’s a fair amount of cruft in these, but until I have live metadata editing going I’m not going to worry about it: it takes about 8 hours start to finish (tags: blog.infochimps.org 2008 mes3 dia4 at_home wikipedia infobox download dataset database data_mining ontologia ontology) [...]

Leave a Reply