rounded

Eric Reis’ Startup Lessons Learned 10 months, 3 days ago. by Joseph Kelly

In June the Infochimps attended an event in Austin where Eric Reis gave a talk about the Lean Startup. His ideas inspired further reading, and we have been applying his methodology to making Infochimps.org a sustainable and profitable web service. Here is a breakdown of two of the ideas Eric writes about, which also crossover with Steve Blank’s wonderful book, The 4 Steps to the Epiphany.

1) Product development vs. customer development: In product development the team builds a product that they spec’d out themselves in the early stages. Customer development instead is about developing the market. It is a more holistic approach to building a company and launching a product. And customer development deeply integrates with agile software development. Every code deploy happens for a reason – it is in the service of some story that solves an identified need of the customer or users. How do you know what those needs are? You need to have talked to real customers and users.

Our site is built by two Physics researchers – scientists intimately familiar with the problems of finding and sharing data on the web. They have thought well into the future about how our site can solve these issues. Our feature list is long and describes a killer application. Problems arise, however, when we try to organize and prioritize this list. User testing helps tremendously. Observing how people used the site teaches us which features our users have trouble with and which features we can neglect because they aren’t being used. For example, user testing showed that Search is our most important feature, and that browsing by categories was less important.

Once we started talking to customers, our organizational priorities became much clearer as well. Through talking to Data Suppliers, we learned what features are most important to them on the site, which clauses of our Data Supplier Agreement they had most trouble with, and what the best way is to talk to them about selling their data on our site.

2) What type of market are you in? Steve Blank drives this point home in nearly every chapter of his book. Is your product competing in a market that already exists? If so, does it resegment that market by price or niche? Or is your product creating a new market?

Steve’s clearest example of this is the PDA market. When the first PDA came out, it created a new market. People could now do something they had never been able to do before – that is, sync their computer with a handheld device and work on the go. Marketing and PR efforts had to go towards educating people on these new tools and what they could do, and not talk about product features. Once PDA’s became an existing market with multiple players, marketing and PR efforts had to switch goals, and the conversations became less about the new possibilities and more about individual features, like whether this PDA had 8MB of memory and a 10in screen.

Infochimps has to split our pitch between the existing markets we resegment, and the new markets we create. Data is already sold in the Market Research and Finance industries – our website resegments this existing industry by offering different features and benefits. When we spoke to Zogby we didn’t have to tell them they could sell their data, they already do this. We just had to show them why Infochimps is different and a better solution. Data is not already sold by businesses everywhere, but our website is enabling just this. It is much harder to talk a taxicab company into selling their data – we first have to make the case that this is a profitable possibility. Our job is to educate this mainstream market to the new opportunities they can take advantage of with their data.

The data landscape online, as we see it. Part 1 10 months, 23 days ago. by Joseph Kelly

Nathan at FlowingData did a wonderful job last week culling 30 great resources from the world wide web for finding data. Yesterday another site launched – Factual, making great resource number 31. We are excited to see a growing number of companies spring up that in turn increase everyone’s access to data. Solving the problems with data online is no small task fit for any single player. It’s a team effort, which we are proud to be a part of.

We thought we would take a minute today to talk about the problems as we see them, and how players within the online data market are choosing to tackle these problems.

The first problems are finding and sharing data. Most of these sources already solve this problem. Socrata and Factual let users upload data onto their sites, and each company’s datasets are easily searchable along with what’s on Data.gov and Numbrary.

There are also other, more technical issues. Swivel, Socrata, Factual, Many Eyes – all of these websites allow users to play around with data live on the site. This opens up costly issues for the hosting company.

1. The data has to live in their platform and reconcile with the whole.

2. Many new datasets are on the order of gigabytes in size.

Whereas datasets on Infochimps can be of any size, format, or shape, their datasets must be in a standard csv/tsv/xls format and are limited to a few hundred megabytes. In reality, statisticians want data in .sas formats, and geographical data comes in .gis formats. Because of the larger size of today’s datasets, tools within a browser will be insufficient to work with and understand the data, and a person’s options for distributing that data are also limited.

Data, especially valuable data, is often proprietary. The owners of that data won’t release it unless there are clear licenses and terms of use. We differ from these other open data players in our commitment to host open data for free and maintain our open data commons for everyone’s benefit, but we will also host licensed data. Unfortunately, open data doesn’t include all of the data in the world. Instead, what we offer organizations is the ability to permit only users that have agreed to a license or paid for access to download their data. As the data marketplace grows, we believe more and more buyers will realize the value proposition in looking for data on Infochimps. Our aim is to give incentive to the long tail of businesses with data gathering dust on hard drives that could otherwise be useful to another person or organization.

Calling all Pollsters 10 months, 26 days ago. by Joseph Kelly

Carl Bialik, from the WSJ Numbers Guy blog highlighted the recent controversy in the opinion polling industry over Strategic Vision’s choice to not share their polling methodology or raw data.  Pollster.com and FiveThirtyEight have also weighed in on the problem.

Our message to opinion polling firms is this: share or sell your data on Infochimps.org.

Free, public polls can be distributed for free on our site.  If you’d like to charge for the download of your data, set your own price. Your data will live in a place where the whole world can find it, bringing you a larger and broader audience.

Get in touch at upload@infochimps.org.

New site is live 11 months, 15 days ago. by Joseph Kelly

Thanks to everyone new that’s come by. We appreciate the coverage from www.gigaom.com and others. We thought we’d spend a moment to cover what we hope to accomplish from this launch.

With this launch anyone can edit or add datasets to the site. Very soon, uploading will work and we can host and distribute open licensed datasets for free. These are our steps towards building an open data commons.

Additionally, this new site offers a few datasets for sale. These datasets are not ours, but owned by others. We make a commission on the sale of these datasets. An example is the TAKS dataset, which contains all of the test scores data for students in the state of Texas on standardized tests. This dataset has cost one particular researcher $1400 to free from the government coffers, and the format it came in was awful. On Infochimps you can find the same dataset but in a cleaned up format, and for a much lower price – $15.

We consider this marketplace offering an incentive to the world of data gatherers to put their data somewhere others can find it. By letting people charge for their data, we encourage data to come out of the woodwork that might otherwise remain behind closed doors.

We hope you enjoy playing around with the site. If you are excited to send data our way before we get upload working, please get in touch: upload@infochimps.org.

jammin’ to data 11 months, 16 days ago. by maegan

While swinging through the jungle, one of the Infochimps came across this awesome video featured on FlowingData by They Might Be Giants entitled Meet the Elements

[youtube=http://www.youtube.com/watch?v=d0zION8xjbM]

Inspired, we decided to start hunting for more awesome data viz music videos

Here’s Radiohead’s House of Cards (uses 3D data)

[youtube=http://www.youtube.com/watch?v=8nTFjVm9sTQ]

Another one of our favorites is Royskopp’s Remind Me

[youtube=http://www.youtube.com/watch?v=1Xhdy9zBEws]

Hungry for more? Check out search results from FlowingData on music and video.

API’s and Datasets, living in harmony 11 months, 20 days ago. by Joseph Kelly

The most popular way for one to access data on the web right now is through an API.  API’s provide real-time data, an incredible advantage, and outsourced computation.  These are advantages for the end-user and the developer, where the API provider has to eat the cost of providing such a service.  It is worth it, though, for the provider of the API, as a myriad of services can be conjoined with their primary service.

There are some things an API can’t give you, though.  An API can generally not give you historical data, as with Twitter’s API only letting you go back XX number of tweets.  This means that a service built late in the game may not carry the same value as a service that was built in the early days of an API, as the latter’s data goes back further.

Next, API’s only give you peices.  The scale of questions you can ask is limited by the rate limit and sizes of the peices to return.  Services can’t ask for everything and they may be further limited by the bandwidth and load on the primary API.

The types of questions we’re talking about have to do with the deep structure of the data in question.  One of the reasons our near-complete scrape of Twitter’s friend graph was so popular is because this type of dataset is extremely valuable to network researchers.  The sort of research a graph like Twitter’s makes possible is phenomenal.  Without such a dataset, reserachers are left like the Antarctic exploresrs of the past – slowly crawling new territory, making maps and filling in details only as they come along, peice by peice.

The value API’s provide to the service and the outside world is undeniable.  The problems that API’s leave open can be solved by those services providing complete dumps periodically.  These datasets of complete and historical data will not only let researchers get to work improving their science, but will also allow applications to seed their service with the latest dataset, then begin updating through the API.

Should services share their data on a platform like Infochimps, they not only provide a great service to applications and researchers, but they also reduce their own costs.  The load on their API is lighter as less requests have to be made for data.  And, when researchers have the complete dataset sitting on their hard drive, the API’s provider will not be depended upon for compute time, as the researcher’s local access to the data will make his job much faster and easier.

The two solutions for sharing data are complimentary.  Freebase does a great job at this, we are hoping other services will soon follow suit.