rounded

Partner with us 14 days ago. by Joseph Kelly

2009 was a great year for us.  We made lots of progress on the website (with a long way to go), but we were especially excited for all the great contacts we made with other developers and companies interested in data.  We strongly encourage all of our followers (you!) to get in touch with us to talk about your expertise and data needs.

We will create a page on the site soon which lists our network of data mechanics, data tools, and solution providers.  One of the issues with our site is that many of the datasets can’t be used by everybody – some are too large for Excel and average tools, and others require specialized skills in order to use.  Our TAKS and Twitter datasets are just some examples of datasets that can be really powerful for a lot of businesses only after an expert has had the chance to analyze them.

Here are a few of the great companies who have worked with us so far:

QlikApps: QlikApps is a marketplace for QlikView applications.  QlikView is a business intelligence software that supports third party applications.  Having trouble understanding your AWS reports?  QlikApps has a free AWS report analyzer that we’ve found useful.

UPDATE: Check out QlikApps’ slideshow on some of the data from our Twitter Census!

Data Applied: Data Applied’s application is truly magic.  Their software is putting the power of machine learning into the world’s hands.  Techniques and algorithms that people wrote Phd. thesis’s on 20 years ago are here at the click of a mouse.  Try them out with a free account.

DataMiningTools.net: A startup based in India, DataMiningTools.net is doing a wonderful job working to educate the masses on data mining tools and resources.  Find tutorials on clustering analysis, R, Matlab – you name it.  Check out videos on Data Applied and your very own Infochimps!

If you are a Data Mechanic, another data company, or just interested in being listed as a solutions provider, please get in touch with me at joe@infochimps.org.  Likewise, if you’re a Ruby/Rails developer, we’re hiring!

Data.gov import 27 days ago. by Joseph Kelly

Infochimps is pleased to announce a recent import of all of the data from Data.gov!  Data.gov was one of the more exciting things to happen last year for the world community and it has had a big impact in the US and internationally by setting precedent for government data sharing.  We hope that these datasets’ inclusion in our collection increases the visibility for all these datasets and becomes useful for the world at large.

The fact that users can edit this data makes them much more usable and interesting.  Unlike Data.gov, users on Infochimps can upload datasets and even upload different versions of datasets to the site.  So when a dataset comes from the government in some messy, incomprehensible format, you can do what Infochimps user Ganglion did and upload a better version.  This type of Wikipedia style curation of datasets is where Infochimps got its name.  Because data drudge work (column titles, formatting issues, etc.) is fit for a chimp, this type of work should only be done once.  And may the result live on Infochimps!

Take a look at the Data.gov collection to get started.

Visualizing Chinese media 1 month, 12 days ago. by nickster

For data geeks interested in the developing world, few places are more compelling to gather numbers about than China. This owes much to its legendary economic growth, the staggering size of its population and global footprint, and hybrid political system. But there is another, often overlooked characteristic of the country at work here: its relentless pursuit of what it calls “scientific development”, which emphasizes the use of scientific research as a means to achieve social harmony and balanced economic growth, has led to an explosion in data-fueled, science-based policy. As a result, China is now one of the largest and most sophisticated data-gathering entities in the world.

There’s a good reason for this. Unlike China’s early post-revolution cadres, the ranks of China’s top leadership today are brimming with scientists and engineers, including President Hu Jintao, who has a degree in hydraulic engineering. When government “works”, these technocrats steer Chinese policy down a painfully cautious course based on five, ten, and even twenty year plans crafted to satisfy discrete social, economic, and technological benchmarks. At any given moment, the country is teeming with pilot projects spanning areas like subsidized housing, health care, industrial development, and family planning, which will ultimately be scrutinized by the country’s National Reform and Development Commission for use at the national level.

This science-based approach is exactly why China has recently come forward with ambitious carbon emissions targets–global warming has a direct, significant impact on its population, and therefore social stability. None of these projects could be completed without good data, and China knows it.

While we’ve been emphasizing social media data with recent posts, we hope to shine more light on the state of Chinese data and bring more of it into the repository in the near future. To this end, and as a special holiday treat, we’re releasing a visualization of major Chinese websites we scraped this past October during the country’s meticulously executed 60th anniversary of its founding. We find the bright colors and flashing lights to be particularly seasonally appropriate.

Click here for the visualization

Open Data Applications 2 months, 1 day ago. by maegan

With President Obama’s Open Government Directive and news about Data.gov’s overhaul, more and more people have been talking about the benefits of open data. Yes, this includes greater transparency and a more accountable government, but it also gives birth to useful apps that use these newly available datasets.

A lot of these apps have been made for competitions like Sunlight Lab’s Apps for America and various cities’ own initiatives like NYC BigApps. Understandably, they provide appealing incentives for programmers. (If not the recognition, the cash prizes are appealing).

All that said, these competitions have spawned very useful apps. Here are 5 that we feel are great examples of the good that can be done with government data:

This We Know_ Explore U.S. Government Data About Your Community-1
1. This We Know (www.thisweknow.org)
This We Know is a excellent tool that provides a wealth of information sourced mainly from Data.gov. You name a place and it tells you what we know about that location – things like demographics or the number of factories in the area. It’s also presented in a very clear fashion, condensing data into an easily understandable and still useful format.

stumble
2. StumbleSafely (www.outsideindc.com/stumblesafely)
This app from DC literally helps you stumble safely home. It uses data on crime and geography to map out safe routes from the more (in)famous bars in the city, no matter what time you like to party – day, evening or night.

photo_185
3. NYC Way (www.nycway.com)
An iPhone app, NYC Way provides you with a plethora of useful information for locals and tourists alike right at your fingertips. Location aware, it draws from a bunch of various datasets from the NYC.gov Data Mine and gives you facts about nearby zoos, wi-fi spots, emergency rooms, and a lot of other useful places to help you find your way in the big city.

everyblock_0
4. EveryBlock (www.everyblock.com)
This one’s not yet available in Austin, but it does have versions for 15 cities across the nation. EveryBlock provides you with a newsfeed of things going on around a user specified address or location in these cities. It also allows you to browse by topic and track trends overtime.

ikid
5. iKidNY (www.ikidny.com)
Not all apps are useful just for adults – this iPhone app, iKidNY, helps you find kid-friendly places all over the NYC. It provides you with locations and information about activities, kid-friendly restaurants, playgrounds, and even changing tables and subway elevators.

If you want to look at more apps, these competitions’ submission galleries are worth a look:
Apps for America 2
Apps for Democracy
NYC BigApps
DataSF

Did we miss out on your favorite app? Let us know! We’d love to check it out.

Twitter data, open questions to Developers, Academics, and Data Geeks 2 months, 21 days ago. by Joseph Kelly

We are excited to announce the re-release of the Twitter datasets, and a discount to the Twitter API Map dataset.  Again, the datasets are:

and

Conversation Metrics, with Token Count of:

This time the data is being released with Twitter’s approval.  We are talking with them about how we can increase access to more and more bulk data, and need your help in showing them how useful this data really is.

We want to make clear to people with privacy concerns that we absolutely hear and respect your points, and so does Twitter.  These datasets contain NO personally identifiable information, they do NOT contain whole tweets, and they meet the guidelines laid out in this EFF document (on personally id’able info).

We encourage everybody to take advantage of this weekend’s discount and go build great things with this data.  Let’s show Twitter and the world what is possible when one has access to bulk data:

  • Data geeks and Visualization studs: what would you do if you could run jobs across our massive crawl (or the full Twitter graph)?
  • App devs: what data do you want those nerds to extract?  How would it improve the experience of Twitter or enable new things?
  • Businesses: how can this data improve your services?  How can this data make you money?
  • Academic researchers: what amazing things will you uncover by exploring the social network’s deep structure?

Reach out to us in the comments or send us ideas at info@infochimps.org

The data landscape (Part 2), and Microsoft 2 months, 22 days ago. by Joseph Kelly

The data platform industry has a new entrant this week!  Yesterday Microsoft announced a data store of their own at their developer conference.  Called Dallas, their offering is another example of a data marketplace.  The market for selling data online in an open way is still young (how many platforms besides ours and Microsoft’s do you know?) and so it is validating to see another entrant in this space.  We know that Microsoft will encourage the developer community to explore what these new platforms make possible.

Like many other services, Dallas meters out data through an API which is helpful to programmers with limited resources.  With Infochimps, however, developers get full datasets in bulk, which is better for many applications and essential for any kind of analytic work.

Both our marketplaces have the same value proposition: open up your data and profit.  When trying to convince an organization to open up its data, API’s can be an easier sell.  Even though they are costly to build and run, organizations may prefer the control they get over what people can access when compared to our simple and cheap bulk solution.

It is still unclear what the size and format restrictions are on Dallas.  If they are like other services out there (Socrata, Factual), they need data that comes in a structured, rectangular format.  These constraints enable these services to display their data live online.  While Infochimps doesn’t have that feature (yet!), we can handle datasets at the terabyte scale as well as those that don’t fit the spreadsheet paradigm, such as social network graphs.

Dallas is also part of a platform that forces users to integrate with other Microsoft services.  Infochimps’ mission is simply to connect people with the data they’re looking for, and we let anyone download data without having to register for an account.

We are proud to be a part of a strong community that’s grown over the past year, and to continue our commitment to an open data comons.  On the commercial side, we are narrowing focus on the right verticals after months of talking with this new market about what is possible.  That ultimately is what this is about – enabling something that couldn’t be done before, and connecting buyers to sellers and people to knowledge.

Twitter data update 2 months, 25 days ago. by Joseph Kelly

Our launch of the Twitter data was a great success, and we thank Marshal Kirkpatrick at ReadWriteWeb (also) and Jordan Golson at GigaOm for their coverage. The community reaction has been overwhelming and energizing. We accomplished our two main goals: crack open some issues close to our hearts and kick-start the conversation about sharing data online.

Twitter has advanced some reasonable concerns, however, and have asked us to take the datasets down. We have temporarily disabled downloads while we discuss licensing terms. The outcome of discussions will, we hope, encourage more internet services to open up and share data in bulk. The two biggest issues this data release highlighted are third party redistribution and user privacy.

Redistribution rights. Twitter maintains a legendarily open API:

“Except as permitted through the Services (or these Terms), you have to use the Twitter API if you want to reproduce, modify, create derivative works, distribute, sell, transfer, publicly display, publicly perform, transmit, or otherwise use the Content or Services.
“We encourage and permit broad re-use of Content. The Twitter API exists to enable this.” [highlighting added by us]

However, Twitter wants to more closely control who has access to data at massive scale and to prevent its malicious use. We understand this concern — innovation is always a double-edged sword. The applications and services that can use this data to make the world a better place far outnumber those with bad intentions, however, and good people need better access to this type of data. The best solution is to apply a reasonable license to the data. We are addressing this in our talks with Twitter, and we expect to have a resolution soon.

User privacy. What little criticism we heard from the community was the potential for a breach of user privacy. This is an issue with many types of internet data, and one we take seriously. We ensured that the datasets released posed no such dangers. The Token Count data contained no personally identifying information, only what the entire mass of twitter users were discussing over time. The API ID Mapping Dataset is simply a sort of phone book for the Twitter APIs: it converts screen names to numeric IDs and reveals absolutely nothing about the corresponding user. Infochimps.org’s policy is to not host any personally identifying information of non-consenting individuals — we apply this rule to any data that goes on the site from any source.

These are hard issues and it took a bold move to bring them into the open. It will take further sharing and discussion to establish best practices for these concerns so that Twitter and other internet services (Facebook, Amazon, etc.) can share their data to the benefit of the greater online community. Stay tuned while we agree upon appropriate licensing for open sharing of this social data.

Twitter Census: Publishing the First of Many Datasets 3 months, 0 days ago. by Joseph Kelly

As useful as the Twitter API is, developers, designers, and researchers have long clamored for more than the trickle of data that service currently allows. We agree — some of the sexiest uses of data require processing not just all that is now, but the vast historical record. Twitter doesn’t provide the only use case for this, but until now its historical bulk data has been hard to find.

Today we are publishing a few items collected from our large scrape of Twitter’s API. The data was collected, cleaned, and packaged over twelve months and contains almost the entire history of Twitter: 35 million users, one billion relationships, and half a billion Tweets, reaching back to March 2006. The initial datasets are a part of our Twitter Census collection.

The first dataset, a Token Count, counts the number of tokens (hashtags, smiley’s and URL’s) that have been tweeted. The data is available for free by month and for pay by hour. Think about comparing this data to the stock market, new movies, new video games, or even trendingtopics.org. For example, use it to look at the adoption of Google Wave on the rate of its mentions. On one payload’s page you will find a snippet with a sample taken during Kanye West’s outburst in September, and on another’s you can see that the “:)” emoticon has been used 135,000 times.

The second dataset solves a large problem developers have when they use Twitter’s Search API and the Twitter API, as each API gives back a different unique string for every user on Twitter. This dataset maps user IDs between the two API’s for 24.5 million users. This mapping should be a godsend to Twitter app developers, as it allows them to easily combine data from each API, letting API calls for friends lists mix easily with searches on the Twitter Search API.

These datasets are only views from the massive collection we have been growing over the last year. We will be releasing additional datasets regularly over the next few weeks so please check back for updates. If you’d like a custom slice or analysis done on this data, please get in touch at imw@infochimps.org.

With the release of this data, we hope to send a signal that this data is valuable and useful to real-time search engines, Twitter apps, and social media researchers. This should start a conversation about where value really lies in this type of data, the various ownership and privacy issues that arise, and that Infochimps.org is the place to go to find data. We invite interested parties to get in touch and begin uploading their data(try invite code “newsupplier”) today as part of the Infochimps marketplace.

Eric Reis’ Startup Lessons Learned 3 months, 8 days ago. by Joseph Kelly

In June the Infochimps attended an event in Austin where Eric Reis gave a talk about the Lean Startup. His ideas inspired further reading, and we have been applying his methodology to making Infochimps.org a sustainable and profitable web service. Here is a breakdown of two of the ideas Eric writes about, which also crossover with Steve Blank’s wonderful book, The 4 Steps to the Epiphany.

1) Product development vs. customer development: In product development the team builds a product that they spec’d out themselves in the early stages. Customer development instead is about developing the market. It is a more holistic approach to building a company and launching a product. And customer development deeply integrates with agile software development. Every code deploy happens for a reason – it is in the service of some story that solves an identified need of the customer or users. How do you know what those needs are? You need to have talked to real customers and users.

Our site is built by two Physics researchers – scientists intimately familiar with the problems of finding and sharing data on the web. They have thought well into the future about how our site can solve these issues. Our feature list is long and describes a killer application. Problems arise, however, when we try to organize and prioritize this list. User testing helps tremendously. Observing how people used the site teaches us which features our users have trouble with and which features we can neglect because they aren’t being used. For example, user testing showed that Search is our most important feature, and that browsing by categories was less important.

Once we started talking to customers, our organizational priorities became much clearer as well. Through talking to Data Suppliers, we learned what features are most important to them on the site, which clauses of our Data Supplier Agreement they had most trouble with, and what the best way is to talk to them about selling their data on our site.

2) What type of market are you in? Steve Blank drives this point home in nearly every chapter of his book. Is your product competing in a market that already exists? If so, does it resegment that market by price or niche? Or is your product creating a new market?

Steve’s clearest example of this is the PDA market. When the first PDA came out, it created a new market. People could now do something they had never been able to do before – that is, sync their computer with a handheld device and work on the go. Marketing and PR efforts had to go towards educating people on these new tools and what they could do, and not talk about product features. Once PDA’s became an existing market with multiple players, marketing and PR efforts had to switch goals, and the conversations became less about the new possibilities and more about individual features, like whether this PDA had 8MB of memory and a 10in screen.

Infochimps has to split our pitch between the existing markets we resegment, and the new markets we create. Data is already sold in the Market Research and Finance industries – our website resegments this existing industry by offering different features and benefits. When we spoke to Zogby we didn’t have to tell them they could sell their data, they already do this. We just had to show them why Infochimps is different and a better solution. Data is not already sold by businesses everywhere, but our website is enabling just this. It is much harder to talk a taxicab company into selling their data – we first have to make the case that this is a profitable possibility. Our job is to educate this mainstream market to the new opportunities they can take advantage of with their data.

The data landscape online, as we see it. Part 1 3 months, 28 days ago. by Joseph Kelly

Nathan at FlowingData did a wonderful job last week culling 30 great resources from the world wide web for finding data. Yesterday another site launched – Factual, making great resource number 31. We are excited to see a growing number of companies spring up that in turn increase everyone’s access to data. Solving the problems with data online is no small task fit for any single player. It’s a team effort, which we are proud to be a part of.

We thought we would take a minute today to talk about the problems as we see them, and how players within the online data market are choosing to tackle these problems.

The first problems are finding and sharing data. Most of these sources already solve this problem. Socrata and Factual let users upload data onto their sites, and each company’s datasets are easily searchable along with what’s on Data.gov and Numbrary.

There are also other, more technical issues. Swivel, Socrata, Factual, Many Eyes – all of these websites allow users to play around with data live on the site. This opens up costly issues for the hosting company.

1. The data has to live in their platform and reconcile with the whole.

2. Many new datasets are on the order of gigabytes in size.

Whereas datasets on Infochimps can be of any size, format, or shape, their datasets must be in a standard csv/tsv/xls format and are limited to a few hundred megabytes. In reality, statisticians want data in .sas formats, and geographical data comes in .gis formats. Because of the larger size of today’s datasets, tools within a browser will be insufficient to work with and understand the data, and a person’s options for distributing that data are also limited.

Data, especially valuable data, is often proprietary. The owners of that data won’t release it unless there are clear licenses and terms of use. We differ from these other open data players in our commitment to host open data for free and maintain our open data commons for everyone’s benefit, but we will also host licensed data. Unfortunately, open data doesn’t include all of the data in the world. Instead, what we offer organizations is the ability to permit only users that have agreed to a license or paid for access to download their data. As the data marketplace grows, we believe more and more buyers will realize the value proposition in looking for data on Infochimps. Our aim is to give incentive to the long tail of businesses with data gathering dust on hard drives that could otherwise be useful to another person or organization.