rounded

SXSW Data Panels 7 months, 0 days ago. by Joseph Kelly

We are especially excited to announce and share that big data is coming to SXSW.  Here are the panels we like:

Pete Skomoroch of DataWrangling: Petabyte As Platform, Making Big Data Accessible Online – We have long been fans of Pete Skomoroch’s work, this is your chance to hear from him about web applications built on massive datasets.

Our own mrflip: Scraping the Social Web – Flip has done extensive work building massive datasets from social media sites.  Hear him talk about the nuances involved and ask him about best practices.

Michael Driscoll of Dataspora: Cloud Crunching Big Data with HIVE/Hadoop and R and Become a Sexy Data Geek in One Week – Another friend of ours, Michael, will be talking about how to use the right tools to massage and produce results from big datasets, and profiles what you need to do to be a data geek.

Stu Hood of Rackspace: Using Hadoop to Manage a Ton of Data – Hadoop might be the the most important tool to know for working with terabytes and terabytes of data.

Ian Davis of Talis: Set Your Data Free – Talis does great work.  Listen to Ian cover topics very relevant to Infochimps.org’s collection: data copyright and licensing.

Dave Bowker of Designing the News: Engaging Data Visualizations and Infographic Communication – Glad to see some data viz stuff at SXSW.

Casey Caplowe of GOOD: Interactive Infographics – More visualizations, GOOD stuff.

Leave a comment if you know of any other good ones.

Infochimps receives a donation from SmartBear 7 months, 20 days ago. by Joseph Kelly

Smart Bear Software is an Austin-based company whose founder, Jason Cohen, is one of our favorite people.  Jason grew Smart Bear from the ground up, and he has helped the Infochimps team in the past with practical advice.  Jason blogs about marketing and small business at http://blog.asmartbear.com/ and he is well worth reading.  

The Infochimps rely on agile methods for the building of Infochimps.org, a process which can benefit from a code review tool.  Smart Bear’s product, Code Collaborator, is a well-known online peer code review tool that simplifies and expedites code reviews, helping teams produce higher-quality, tested and done code more efficiently.

Smart Bear’s latest promotion offered 5 seats of one of their code review tools for $5.  As a part of this promotion, they selected a start-up company to receive the funds collected from the promotion.  Infochimps won!  Smart Bear has graciously donated $2220 to Infochimps to help our mission of increasing the world’s access to data.  We appreciate their acknowledgment of our work and we know we can put the funds to good use.

To see how we reacted to the news, check out the video below:

[youtube=http://www.youtube.com/watch?v=ZLtR8_qw_yM&hl=en&fs=1&]

Open a banana like a Monkey does 8 months, 7 days ago. by mrflip

Open a Banana like a Monkey – most human primates do it wrong!

[youtube=http://www.youtube.com/watch?v=nBJV56WUDng]

To go with open banana here is open banana data:

It's Hot, Damn Hot. So Hot I saw a Chimp in Orange Robes Burst into Flames. 8 months, 9 days ago. by mrflip

It’s been ridiculously hot ridiculously early this year in Austin. A friend passed along this link to a visualization of 100+ degree days over the last 10 years. The author couldn’t find data extending back farther than 2000, but luckily I knew where to look.

I pulled the NCDC weather for Austin from 1948-present (see infochimps.org link for details) and got my Tufte on.

This temperature cycle is hotter than but comparable to the 1950-1965 era. I’ve got no idea if it’s global warming or the peak of a cycle. The fundamental conclusion — that this year so far, 2000 and 2008 were damn hot — stands up well.

(more…)

Congrats Retrosheet – another decade of rich Baseball data online 8 months, 14 days ago. by mrflip

Congrats to Retrosheet, who now have full major-league baseball box scores from 1920-1930 online! (This is in addition to full box score coverage for 1953-2008, and broad coverage of box scores and play-by-play data from 1871-2008). As Nate Silver has said, “Baseball is the perfect dataset”, and we would not have this astonishingly rich and detailed dataset if not for the dedicated crowdsource efforts of the Retrosheet team.

Infochimps metadata entries for these datasets:

Start-up Checklist 8 months, 20 days ago. by mrflip

On Jessica Hagy’s “Indexed”, the “Start-up Checklist“:

What's New 10 months, 24 days ago. by Joseph Kelly

Infochimps has been acknowledged as a finalist by the Capital Factory for 2009.

Infochimps is also a finalist in PepsiCo’s pitch competition.

Infochimps has a Facebook page! Become a fan.

Katherine at The New Civilization is aiding us in UX design for our Beta, to be launched at the end of May.  Eve Simon in Washington DC is helping us with the site design.  Our two big goals for the Beta are:

1) Improved browseability of the datasets, including a search bar and better surfing through tags, categories, and collections.

2) Uploading capability.  Users will be able to create accounts and upload datasets, as well as edit the descriptions of other data on the site.

Drop us a line anytime at info@infochimps.org

@mrflip's OpenGov Talk: Data Commons and Transparent Government 11 months, 1 day ago. by mrflip

Here is my (mrflip’s) SxSW OpenGov talk, “How Open Data will help build Open Government“:

[youtube=http://www.youtube.com/watch?v=4Aprr9a6XEM]

There is nothing more painful than watching yourself talk. So I haven’t gone all the way through this video — if you see me don’t give away the ending. Huge thanks to Silona Bonewald (League of Technical Voters) for organizing this, and to Terry Walhus (spring.net) for taping and copying and editing and uploading the videos.

I love it when a plan comes together… 1 year, 0 months ago. by mrflip

So Simon Willison (@simonw), one of the architects of The Guardian’s Open Platform and co-creator of a modestly popular web frameworks is here at SxSW and gave an informal talk (on Zeppelins, of course – what else?). Freebase community manager Kirrily Robert (@skud) saw my tweet and proposed a meetup. After iteratively solving the three body problem, we put out the word on Sunday morning for a meetup on Sunday evening… SemWebAustin @juansequeda and Freebase @jameshome each pinged their 1-neighborhood and next thing you know I’m sitting next to Jure Cuhalev of Zemanta and machine learning machine @Nikete trying to orchestrate overflow seating for 25+ data geeks.

The reason for the gossip-column style of this post is to show the size and breadth of the data geek crowd. James Home and I agree that we need to turn out this Cyrus’ army of data geeks to take over a much larger part of SxSW next year. We need talks on column-store databases and hadoop, linked data and the construction of the data commons, how NLP and machine learning can power inspiring audience-driven websites, on the developing grammar of Information Visualization, on Processing and Prefuse and R. Pete Skomoroch, Mike Driscoll and Christian Chabot all ended up skipping SxSW this year; we need them leading a panel discussion on how to visualize >10M point datasets with limited-bandwidth desktop and web interfaces. I’d like to hear Deepak Singh and one of the @cloudera’ns drop science about scalable cloud computing.

The evening was just informal mingling and conversation, but on request of request of @mndoci and @dataspora, here is our name-droppy slice of the whirlwind:

@mrflip: Learned about how Zemanta is already putting Linked Data and NLP together to make blogging better. Jure is excited about infinite monkeywrench and might be brave enough to pre-alpha its inchoate HTML munger. Got to hear what Blaine Cook of Osmosoft is doing to solve the fractured twitter/facebook/identi.ca/500M-person-strong-local-social-networks-you’ve-never-heard-of ecosystem, and he gave some great feedback on our upcoming Twitter Census. Also got to learn, after pontificating that OAuth is hard, that I was talking to its architect; a great discussion with Blaine and ENTP Uruguay Evan Henshaw-Plath followed about the Rails authorization/identity/authentication stack.

Mike Migurski of @stamen is going to get together with infochimp @dhruvbansal to push the Open Street Maps dataset into Amazon Public Data Sets collection. Harper Reed of Threadless was running off for a 6am (ugh) flight to babysit servers in Chicago by the time we chatted, but pointed towards his Chicago Transit API project. His post on Hidden APIs is a great read BTW. Ran into @Slicehost Matt Tanase at a party after; Rackspace is getting much Cloud-ier, including a 1.5cents/hour pay-as-you-go 256MB slice offering. I’m hoping to talk later about our MachetEC2 project and get his thoughts about how to put open data on tap in the cloud. Jon Pierce and I discussed the Mets’ chances this year and what he sees for big data startup possibilities. Only got to briefly intersect with Andrew Turner about open geocommons, and was chagrined to learn I was shoulder to shoulder with one of gnip but didn’t get to chat. Hope to fix that later.

This meeting alone made SxSW worth it, and I’m looking forward to more discussion later. You can stalk me on twitter as @mrflip or at http://sxsw2009.sched.org/flip. By the way, I’m giving a lightning talk on Open Data in government at Fiddler’s Hearth, 301 Barton Springs Rd at 12:30 — drop by or catch the webcast later.

Amazon Web Services hosts DBpedia, Freebase data sets 1 year, 0 months ago. by Joseph Kelly

The Infochimps.org community played part in pushing DBpedia and Freebase data sets  to Amazon Web Services.  This is an auxiliary effort by Infochimps.org to increase access to data.  It is important to have the data in places where there are the right tools for people to use it.  AWS is the place, look at creating an Amazon Machine Image to start working with the new data sets.  Our MachetEC2 can help, please let us know how your experience was in using it.

Thanks to Kingsley Idehen with Linked Open Data for being a good point of contact. 

We will upload more data sets to AWS in the near future.  Any requests?