rounded

Real geeks don’t use IE – Infochimps Browser Usage Analytics 11 hours, 31 minutes ago. by Jesse Crouch

Browser usage by the somewhat normal web

When one is scoping out a web project, one of the first requirements that a designer/web programmer will want to know is “what browsers are we supporting?”. The decision is usually led by a quick googling to find a page like the W3C’s which quickly tells you:

2010 IE8 IE7 IE6 Firefox Chrome Safari Opera
July 15.6% 7.6% 7.2% 46.4% 16.7% 3.4% 2.3%

Over 30.8% of the browser world belongs to IE (much better than the way things were just a few years ago). Almost 15% of your users are using such an old version of IE that you may be tempted to code using IE6 or 7 as your least common denominator.

Browser usage by Infochimps users

Consider who is visiting your site though. Are your users more net savvy? Are they geeks? Here’s what our visitors use:

About 10% of infochimps.org users use IE, almost a third of the norm.
Half of our IE users use IE8 (a much more capable version of IE) leaving a meager 5% in the IE6/7 realm, which is split half and half (2.5% total IE6 users – again, almost a third of the normal).

Conclusion: Real nerds don’t use Internet Explorer

As far as design philosophy goes, we strive to design our sites (infochimps.org, api.infochimps.com) in a progressive enhancement fashion so that all browsers can be supported well (enough) and accessibility is simple and works. IE6 isn’t number one on our list of things to deal with.

When you have limited resources (like a startup), consider who is actually using your site before spending resources on that group.

What's Next: Infinite Monkeywrench starting to take form. 1 year, 11 months ago. by mrflip

We’re starting beta testing of infochimps.org v1.0 — see the following post. In order to start really populating infochimps.org with dataset payloads, the Infinite Monkeywrench is about to get some major love. The following syntax is still evolving, but we’re already using it to do some really fun stuff: here’s a preview.

One of the data set’s we’re proud to be liberating is the National Climate Data Center’s global weather data. To use that data, you need the file describing each of the NCDC weather stations. (I’ll just describe the stations metadata file — the extraction cartoon for the main dataset is basically the same but like 10 feet wide.)

The weather station metadata is found at at ftp://ftp.ncdc.noaa.gov/pub/data/gsod/ish-history.txt, it’s a flat file, it has a header of 17 lines, it contains fields describing each stations latitude, longitude, call sign and all that, and has lines that look like

# USAF   WBAN  STATION NAME                  CTRY  ST CALL  LAT    LON     ELEV(.1M)
# 010014 99999 SOERSTOKKEN                   NO NO    ENSO  +59783 +005350 +00500

Here’s what a complete Infinite Monkeywrench script to download that file, spin each line into a table row, and export as CSV, YAML, and marked-up XML would look like:

    #!/usr/bin/env ruby
    require 'imw'; include IMW
    imw_components :datamapper, :flat_file_parser

    # Stage as an in-memory Sqlite3 connection:
    DataMapper.setup(:staging_db, 'sqlite3::memory:')

    # Load the infochimps schema -- this has table and field names including type info
    ncdc_station_schema = ICSSchema.load('ncdc_station_schema.icss.yaml')

    # Create the tables from the schema
    ncdc_station_schema.auto_migrate!

    # Parse the station info file
    stations = FlatFileParser.new({
	:database  => :staging_db,
	:schema    => ncdc_station_schema,
	:each_line => :station,
	:filepaths => [:ripd, ['ftp://ftp.ncdc.noaa.gov/pub/data/gsod/ish-history.txt']],
	:skip_head => 17,
	:cartoon   => %q{
	# USAF   WBAN  STATION NAME                  CTRY  ST CALL  LAT    LON     ELEV(.1M)
	  s6    .s5   .s30                           s2.s2.s2.s4  ..ci5   .ci6    .ci5
	},
      })

    # Dump as CSV, YAML and XML
    stations.dump_all out_file => [:fixd, "weather_station_info"], :formats => [:csv, :xml, :yaml]

Almost all of that is setup and teardown. Once the infochimps schema has field names, the only part you really have to figure out is the cartoon,

      s6    .s5   .s30                           s2.s2.s2.s4  ..ci5   .ci6    .ci5

If you’ve used perl’s unpack(), you’ll get the syntax — this says ‘take the USAF call sign from the initial 6-character string; ignore one junk character; … take one character as the latitude sign, and an integer of up to 5 digits as the scaled latitude, ….’

Rather load it into a database? Leave the last line out, and stage right into your DB. (Any of MySQL 4.x+, Potsgres 8.2+, SQLite3+ work.)

    # Load parsed files to the 'ncdc_weather' database in a remote MySQL DB store
    DataMapper.setup(:master_weather_db, 'mysql://remotedb.mycompany.com/ncdc_weather')

Surely a hand-tuned scripts will do this more thoroughly (and more quickly), but you can write this in a few minutes, set it loose on the gigabytes of data, and do all the rest from the comfort of your DB, your hadoop cluster, or a script that starts with populated datastructures given by a YAML file.

Another example. The US Nations Institute for Science and Technology (NIST) publishes an authoritative guide to conversion factors for units of measurement. It is, unhelpfully, only available as an HTML table or a PDF file.

If we feed into the InfiniteMonkeywrench

	fields:
	  - { name: unit_from,                  type: str},
	  - { name: unit_to,                    type: str},
	  - { name: conversion_mantissa,        type: float},
	  - { name: conversion_exponent,        type: float},
	  - { name: is_exact,                   type: boolean},
	  - { name: footnotes,
	      type: seq,
	      sequence: str }
  • The cartoon
	  { :each    => '//table.texttable/tr[@valign="top"]:not(:first-child)',
	    :makes   => :unit_conversion, # a UnitConversion struct
	    :mapping => [
	      '/td'      	  => { :unit_from, :unit_to, :conversion_mantissa, :conversion_exponent],
	      '/td/b'    	  => :is_exact,
	      '/td/a'    	  => :footnotes,
	    ]
	  }

We’d get back something like

  - unit_from: 		 'dyne centimeter (dyn · cm)'
    unit_to:		 ' newton meter (N · m)'
    conversion_mantissa:  1.0
    conversion_exponent: -0.7

  - unit_from: 		 'carat, metric'
    unit_to:		 'gram (g)'
    conversion_mantissa:  2.0
    conversion_exponent: -1
    is_exact: 		 true

  - unit_from: 		 'centimeter of mercury (0 °C) <a href="http://physics.nist.gov/Pubs/SP811/footnotes.html#f13">13</a>'
    unit_to:		 ' pascal (Pa)'
    conversion_mantissa: 1.33322
    conversion_exponent: 3
    footnotes:           [ '<a href="http://physics.nist.gov/Pubs/SP811/footnotes.html#f13">13</a>' ]

Now with some tweaking, you could do even more (and you’ll find you need to hand-correct a couple rows), but note:

  • Once one person’s done it nobody else has to.
  • This snippet gets you most of the way to a semantic dataset in your choice of universal formats.
  • In fact, there’s so little actual code left over we can eventually just take schema + url + cartoon as entered on the website, crawl the relevant pages, and provide each such dataset as CSV, XML, YAML, JSON, zip’d sqlite3 file … you get the idea — and we can do that without having to run code from strangers on our server.
  • Most importantly, for an end user this isn’t like trusting some random dude’s CSV file uploaded to a site named after a chimpanzee. The transformation from NIST’s data to something useful is so simple you can verify it by inspection. Of course, you can run the scripts yourself to check; or you can trace the Monkeywrench code itself; and once we have digital fingerprinting set up on infochimps.org anyone willing to stake their reputation on the veracity of a file can sign it — but it’s pretty easy to accept something this terse but expressive as valid. Our goal is to give transparent provenance of infochimps.org data to any desired degree.

Infinite Monkeywrench hosted on GitHub 2 years, 2 months ago. by dhruvbansal

Rejoice, you open-source orangutans, for the powerful, the weighty, the Infinite Monkeywrench is now hosted on GitHub! Download a copy and start hacking, if you will, and send us your questions and concerns.

The Infinite Monkeywrench (IMW) turns all the screws in the heaving contraption we call infochimps.org but can also be put to good use on more modest projects as well. Learn more about IMW at the official IMW website.

The gems of our collection — The best of what's to come 2 years, 5 months ago. by mrflip

Hooray! The infochimps have been waxy’ed.  Let’s see how the server bonobos stand up.

It’s been suggested that I highlight some of the “gems” of our collection, which we’re going to spend the whole weekend shoveling into the pile. These first few are really deep, and somewhat hard to get / not widely known:

  • Full game state for every play of every baseball game in 2007, majors and minors.  Additionally, for about half of the major league games, *pitch by pitch* trajectory and game state information.  (MLB Gameday)
  • Word frequencies in written text for ~800,000 word tokens (British National Corpus)
  • All the wikipedia infoboxes, turned on their side and put into a table for each infobox type.
  • 250,000+ Materials Safety Data sheets – the chemical and safety information required by OHSA
  • 100 years of Hourly weather data; from 1973 on there’s about 10,000 stations all taking hourly readings … put another way, it’s 475,000+ station-years of hourly readings and weighs in at ~15 GB compressed.

(Incidentally, many of those datasets sell for inexcusable and malicious prices.  For those with a commercial bent, something tells me there’s room in the market if you’re willing to accept a markup of less than 10,000 times).

These are a bit silly but interesting for their ridiculous depth:
* A variety of mathematical constants (pi, e, Catalan’s number, the Golden Ratio, others) calculated to in some cases a preposterous 100 billion decimal places (I’ll probably chop them off at a still-ludicrous 500 million).
* 5000 years of solar eclipse times, 6000 years of precise lunar phase, 6000 years of venus transits.
* Odds of Dying for every Cause of Death listed in the US in a given year.

There are also, of course, the well-known collections: IMDB.com, musicbrainz, dbpedia, CIA factbook, geonames, citeseer, census, statistical abstract and the like.  So let’s see how much of the low-hanging fruit we can toss up there this weekend (the hard parts are adding metadata, and getting the non-copyrightable data out of the copyrighted screenscrapes, so what you’ll see are minimal metadata and the non-screenscraped datasets — still beats paying $1200+/GB though.)

[edit: dates for holidays by country, year-by-year odds of dying for all causes of death from the recent 8 year, NIST values for physical and chemical constants, mechanical properties of common engineering materials, and the spoken and written word frequencies for ~800,000 word tokens datasets should be up later today -- if the site is down briefly we're pushing that update to the server.  (If the site is down not-briefly we've been del.waxyslashdiggdotted)  Thanks to my friend Ned for helping do some drudge work to get those out.]

All of Wikipedia's infoboxes & templates, in individual tables for each kind 2 years, 5 months ago. by mrflip

FINALLY — got the wikipedia infoboxen posted to the site, along with some tiny fixes.

This is 3000+ tables on everything from ABA Teams through Simpsons Episodes to Zodiac Signs.  There’s a fair amount of cruft in these, but until I have live metadata editing going I’m not going to worry about it: it takes about 8 hours start to finish to process this dataset, they’re not perfect but they are perfectly usable.

I have the weather dataset and baseball datasets almost ready to go (along with a whole buncha others), but I’m going to take some time to get the site running better first.  Here’s a rough TODO list:

  1. live, versioned metadata editing
  2. uploading
  3. Allow grouping of datasets by collection and add category tags
  4. Make it so fields & contributors tie together.  (For complicated reasons, each dataset creates a new personal version of the field so you can’t actually walk from one “stock price” field to other datasets with that tag.

Then I’ll turn some intensive attention finally to the InfiniteMonkeywrench code.  We need better tools to wrangle these huge datasets into shape.