rounded

SxSW 2010 — Lecture Notes 3 days, 3 hours ago. by mrflip

Here are notes from Infochimps on interesting talks at SxSW!

See also:

(more…)

Announcing bulk redistribution of MySpace data 5 days, 6 hours ago. by Joseph Kelly

Today, we’re excited to announce the availability of MySpace data for bulk download on Infochimps. We started speaking with MySpace in December and these datasets are the results of an agreement with them to redistribute their data, with revenue share, on Infochimps. This is a major step forward for Myspace in our eyes – a move that signifies their seriousness about data and the developers and academics that work with that data.

This data is not sold by MySpace, but given out for free from their API and then packaged by Infochimps for redistribution. By giving developers free access to publically available real-time data (such as status updates, music, photos, videos) MySpace reinforces its commitment to powering the real-time social Web and the development of open standards.

  • Every day, MySpace processes over 32 million activities and updates
  • MySpace opened up its real-time data with free-to-use APIs letting developers create robust products
  • MySpace offers more scale and richer content like music, photo, videos, apps than anyone else
  • Real-time data input and the ability to then share that in real-time will drive the socialization of content on the web
  • Data available for bulk download will help usher the next generation of data-driven research and application development. Now, using a dataset like word count by hour, developers and content providers can better understand how things are talked about and when.

    The benefits of having data available for bulk download instead of just an API are numerous. Developers can start with a sample dataset and get their apps started faster. Academics are much better served by a .csv than an API, and developers can take advantage of the datasets these experts create as a result of their research. Opening one’s data to the big data community makes all this and much more possible.

    API’s aren’t enough. New tools like Hadoop allow for the processing of huge datasets but necessitate having a local copy of the entire dataset. The advanced analytics that come from computing on top of a huge dataset (and at 25GB/day the MySpace stream is massive) will power the next generation of applications.

    The developers looking for this data can come to Infochimps to find the data they need. Let’s harbor a division of labor between the people who are experts in mining this data for insight, and the pros who can develop the applications on top of those discoveries. For example, Ryan Rosario of UCLA created a dataset of user’s moods by zip code, a historical emotional context for researchers, psychologists, and possibly a developer looking to take advantage of this MySpace feature.

    With that said, we hope everyone from our community will get involved, and upload their own creations of MySpace datasets. We’ll premiere the “best of MySpace” datasets in the hopes of supporting a relationship between MySpace and data-driven research and development. And any API owners out there should get in touch to talk about how we can make your data computable for the big data community.

    UPDATE: Here is a visualization of Users with geolocations from our dataset, User locations by lat/long:
    jacob

    Data Cluster Meetup 8 days ago. by maegan

    Austin, TX may be the live music capital of the world, but next weekend Rackspace, together with Infochimps, WolframAlpha, Factual, and knowmore, are putting together an event that will prove it’s not just about the music.

    Data geeks from all over the nation will come together to discuss the latest developments in the world of data during birds-of-a-feather sessions, talks and pure and simple mingling (not to mention munching on free food) at the Data Cluster Meetup (Sunday, March 14, 6pm at Opal Divine’s Freehouse).

    Not excited yet? Read on…

    Non-relational Database Smackdown
    Stu Hood of the Cassandra project will lead a discussion that will debate the merits of various non-relational databases. Any CouchDB or MongoDB users out there? RSVP and get in touch to be involved in the panel.

    Birds-of-a-feather
    There will be five birds-of-a-feather sessions going on concurrently. Each discussion topic chosen so that you’ll be able to find one that you are most interested in:

    1. Operations (managing data) – Stu Hood of Rackspace and the Apache Cassandra project will lead a discussion on non relational databases
    2. Analytics (exploring data) – [No moderators locked in, interested? Email info@infochimps.org]
    3. Web Applications (humanizing data) – [No moderators locked in, interested? Email info@infochimps.org]
    4. Visualization (seeing data) – [No moderators locked in, interested? Email info@infochimps.org]
    5. Data Commons (freeing data) – Infochimps’s own Flip Kromer, together with Factual’s Gil Elbaz will lead a discussion on building a cross-domain data commons.

    Mingling
    The best part of this event is the people. You’ll have time to talk, eat, and network with some of the greatest minds in the data world and exchange cutting edge ideas.

    If you’re a really smart data geek, you can’t miss out on this chance to immerse yourself in the world you love. RSVP now at http://datacluster.infochimps.org Afterwards, check out our Facebook event page for more information on who’s coming and the latest updates.

    None of this would be possible without our sponsors, Rackspace, Infochimps, WolframAlpha, Factual and knowmore. To all of them, thank you!

    How to create datasets that the rest of the world needs 13 days ago. by Joseph Kelly

    We recently created a dataset for the web site that is a map between IP addresses to zip codes and census demographic information. The work that was involved in this is representative of the type of community we want to have involved with Infochimps in the future. The type of people that will find this dataset useful – web site owners, internet advertisers – are not always going to be the same people that can create such a dataset. This division of labor can only happen when experts at data gathering can share their data in a place where people that want to use the data can find it.

    Our social media expert Maegan recently interviewed Carl, a member of our data team, to talk about this dataset creation process. You can find the IP-Census data he’s talking about here: http://infochimps.org/collections/ip-address-to-us-census-data.

    M: Hi Carl, would you start by introducing yourself and telling us what you do for Infochimps?

    C: I’m a member of the data team here at Infochimps. Basically, the team in charge of gathering data that’s available on the web, cleaning it up and making it more useful for other people out there that are looking for this sort of data.

    M: I can imagine how appealing that data is to a lot of people. Speaking of useful data, I heard that you recently came up with a collection of datasets that link IP addresses to Census information. Can you tell me more about it?

    C: Well, we heard from a few people that that sort of thing might be interesting. There are a lot of people out there want to know more about the people that come to their website. Using this dataset, they can get demographic details by using the IP address of their visitors. That way they can improve their understanding of their audience and target the content on their website better. The dataset that we have links IP addresses to zip codes, and then zip codes to all sorts of demographic data from the Census.

    M: I saw that you have so many different types of information from the Census. Where did you go to find the data to mash together?

    C: For the Census data, that’s a fairly well-known source. The US government has a Census website, Factfinder.census.gov, where you can go to download all sorts of information. As far as the IP to geolocation data, there are lot of datasets available. We were looking for one that had good coverage of IP addresses, was available for free, and had a license that allowed us to take that data, do what we wanted with it and make it available on our site.

    M: Is this a new kind of dataset? Or is it available elsewhere?

    C: The IP to geolocation dataset is available from where we got it – at MaxMind. Linking that to the Census data is something that I don’t think we’ve seen elsewhere.

    M: How did the process work once you had the data?

    C: The Census data is divided into a lot of different geographic segments – national, state, city, county and all those sorts of things, but the IP geolocation data only uses zip codes. We wanted just the data from the Census that’s associated with the zip codes, so I had to comb through the Census data and pull out just the lines of the data that are associated with zip codes and then use that to match up to the IP addresses in the geolocation data.

    M: Is it just how they’re organized?

    C: Yeah, it’s more of how it’s organized. The Census data is organized into a few different files. You have one file that lists all the different breakdowns of how the data is divided up – like how I was saying, by state, city, zip code or the country. Each of those breakdowns was associated with this logical record number. Then, the actual Census data files have the logical record number at the beginning and then all the numbers associated with the different fields in the rest of the file. I had to pick out just all the logical record numbers that were associated with the zip codes in the first files and then pull all those out of the Census data to match it to the zip codes from the IP addresses.

    M: I would imagine that Census data would involve big files – did this make them difficult to manage?

    C: Yeah, the Census data files are really large and so it took a lot of space to load everything into memory. Then, I made a list of what data we needed from the Census data files and searched through them line by line to match zip codes to demographic information.

    M: That sounds like a lot of work. Did you have to do anything else to process the data?

    C: The other thing that I did was figure out the column headings to make it more useful. The way it was presented by the US Census bureau is that each column of data has a column heading that is just a code that you look up somewhere else to figure out what it actually meant. I went through and did a lot of manual editing to make the column headings more readable. Now if you just look at it, you have a better idea of what’s actually going on and it’s not just meaningless code.

    M: How did you find data with licenses that actually let you mash them?

    C: We were looking for specific datasets that had the licenses with certain properties that let you freely download, mash and mix up the data with other datasets, and sell it on your own site or do anything commercial with it. Of course, most of these licenses have attribution requirements, so we made sure to list all our sources in the dataset. The final dataset that we have available clearly says that this data originally came from the US Census Bureau and this MaxMind website.

    M: In the end, what licenses did you put on the dataset that you made?

    C: The license that is on there now is a very open license that lets users use the data for whatever they need. It is the Open Database License.

    M: Are there any other difficulties you faced?

    C: One of the issues that we wanted to make sure was cleared up was that the IP address data that we got was reliable and would cover a lot of IP addresses. It needed to have broad coverage of general IP addresses. We did a quick test and used the logs from our own website, took IP addresses from 6 months worth of page visits, and ran all those IP addresses through the IP address database. It turned out that it matched over 90% of the IP addresses that we had, and so that was a pretty good indication that the IP address dataset we had was fairly complete and had very good coverage compared to others which we heard would have only 50% coverage.

    M: Is the availability of the IP addresses a privacy concern?

    C: I don’t think it’s a privacy concern because it’s not matching it up to a specific address, but it’s matching it up to a zip code. Since zip codes have a very large number of people, it’s hard to determine if that IP address is coming from one specific person or even one specific household.

    M: Ok, thank you very much, Carl.

    Partner with us 1 month, 20 days ago. by Joseph Kelly

    2009 was a great year for us.  We made lots of progress on the website (with a long way to go), but we were especially excited for all the great contacts we made with other developers and companies interested in data.  We strongly encourage all of our followers (you!) to get in touch with us to talk about your expertise and data needs.

    We will create a page on the site soon which lists our network of data mechanics, data tools, and solution providers.  One of the issues with our site is that many of the datasets can’t be used by everybody – some are too large for Excel and average tools, and others require specialized skills in order to use.  Our TAKS and Twitter datasets are just some examples of datasets that can be really powerful for a lot of businesses only after an expert has had the chance to analyze them.

    Here are a few of the great companies who have worked with us so far:

    QVApps: QVApps is a marketplace for QlikView applications.  QlikView is a business intelligence software that supports third party applications.  Having trouble understanding your AWS reports?  QVApps has a free AWS report analyzer that we’ve found useful.

    UPDATE: Check out QVApps’ slideshow on some of the data from our Twitter Census! See below for the imbedded slideshow.

    Data Applied: Data Applied’s application is truly magic.  Their software is putting the power of machine learning into the world’s hands.  Techniques and algorithms that people wrote Phd. thesis’s on 20 years ago are here at the click of a mouse.  Try them out with a free account.

    DataMiningTools.net: A startup based in India, DataMiningTools.net is doing a wonderful job working to educate the masses on data mining tools and resources.  Find tutorials on clustering analysis, R, Matlab – you name it.  Check out videos on Data Applied and your very own Infochimps!

    If you are a Data Mechanic, another data company, or just interested in being listed as a solutions provider, please get in touch with me at joe@infochimps.org.  Likewise, if you’re a Ruby/Rails developer, we’re hiring!

    Data.gov import 2 months, 3 days ago. by Joseph Kelly

    Infochimps is pleased to announce a recent import of all of the data from Data.gov!  Data.gov was one of the more exciting things to happen last year for the world community and it has had a big impact in the US and internationally by setting precedent for government data sharing.  We hope that these datasets’ inclusion in our collection increases the visibility for all these datasets and becomes useful for the world at large.

    The fact that users can edit this data makes them much more usable and interesting.  Unlike Data.gov, users on Infochimps can upload datasets and even upload different versions of datasets to the site.  So when a dataset comes from the government in some messy, incomprehensible format, you can do what Infochimps user Ganglion did and upload a better version.  This type of Wikipedia style curation of datasets is where Infochimps got its name.  Because data drudge work (column titles, formatting issues, etc.) is fit for a chimp, this type of work should only be done once.  And may the result live on Infochimps!

    Take a look at the Data.gov collection to get started.

    Open Data Applications 3 months, 7 days ago. by maegan

    With President Obama’s Open Government Directive and news about Data.gov’s overhaul, more and more people have been talking about the benefits of open data. Yes, this includes greater transparency and a more accountable government, but it also gives birth to useful apps that use these newly available datasets.

    A lot of these apps have been made for competitions like Sunlight Lab’s Apps for America and various cities’ own initiatives like NYC BigApps. Understandably, they provide appealing incentives for programmers. (If not the recognition, the cash prizes are appealing).

    All that said, these competitions have spawned very useful apps. Here are 5 that we feel are great examples of the good that can be done with government data:

    This We Know_ Explore U.S. Government Data About Your Community-1
    1. This We Know (www.thisweknow.org)
    This We Know is a excellent tool that provides a wealth of information sourced mainly from Data.gov. You name a place and it tells you what we know about that location – things like demographics or the number of factories in the area. It’s also presented in a very clear fashion, condensing data into an easily understandable and still useful format.

    stumble
    2. StumbleSafely (www.outsideindc.com/stumblesafely)
    This app from DC literally helps you stumble safely home. It uses data on crime and geography to map out safe routes from the more (in)famous bars in the city, no matter what time you like to party – day, evening or night.

    photo_185
    3. NYC Way (www.nycway.com)
    An iPhone app, NYC Way provides you with a plethora of useful information for locals and tourists alike right at your fingertips. Location aware, it draws from a bunch of various datasets from the NYC.gov Data Mine and gives you facts about nearby zoos, wi-fi spots, emergency rooms, and a lot of other useful places to help you find your way in the big city.

    everyblock_0
    4. EveryBlock (www.everyblock.com)
    This one’s not yet available in Austin, but it does have versions for 15 cities across the nation. EveryBlock provides you with a newsfeed of things going on around a user specified address or location in these cities. It also allows you to browse by topic and track trends overtime.

    ikid
    5. iKidNY (www.ikidny.com)
    Not all apps are useful just for adults – this iPhone app, iKidNY, helps you find kid-friendly places all over the NYC. It provides you with locations and information about activities, kid-friendly restaurants, playgrounds, and even changing tables and subway elevators.

    If you want to look at more apps, these competitions’ submission galleries are worth a look:
    Apps for America 2
    Apps for Democracy
    NYC BigApps
    DataSF

    Did we miss out on your favorite app? Let us know! We’d love to check it out.

    Twitter data, open questions to Developers, Academics, and Data Geeks 3 months, 27 days ago. by Joseph Kelly

    We are excited to announce the re-release of the Twitter datasets, and a discount to the Twitter API Map dataset.  Again, the datasets are:

    and

    Conversation Metrics, with Token Count of:

    This time the data is being released with Twitter’s approval.  We are talking with them about how we can increase access to more and more bulk data, and need your help in showing them how useful this data really is.

    We want to make clear to people with privacy concerns that we absolutely hear and respect your points, and so does Twitter.  These datasets contain NO personally identifiable information, they do NOT contain whole tweets, and they meet the guidelines laid out in this EFF document (on personally id’able info).

    We encourage everybody to take advantage of this weekend’s discount and go build great things with this data.  Let’s show Twitter and the world what is possible when one has access to bulk data:

    • Data geeks and Visualization studs: what would you do if you could run jobs across our massive crawl (or the full Twitter graph)?
    • App devs: what data do you want those nerds to extract?  How would it improve the experience of Twitter or enable new things?
    • Businesses: how can this data improve your services?  How can this data make you money?
    • Academic researchers: what amazing things will you uncover by exploring the social network’s deep structure?

    Reach out to us in the comments or send us ideas at info@infochimps.org

    The data landscape (Part 2), and Microsoft 3 months, 28 days ago. by Joseph Kelly

    The data platform industry has a new entrant this week!  Yesterday Microsoft announced a data store of their own at their developer conference.  Called Dallas, their offering is another example of a data marketplace.  The market for selling data online in an open way is still young (how many platforms besides ours and Microsoft’s do you know?) and so it is validating to see another entrant in this space.  We know that Microsoft will encourage the developer community to explore what these new platforms make possible.

    Like many other services, Dallas meters out data through an API which is helpful to programmers with limited resources.  With Infochimps, however, developers get full datasets in bulk, which is better for many applications and essential for any kind of analytic work.

    Both our marketplaces have the same value proposition: open up your data and profit.  When trying to convince an organization to open up its data, API’s can be an easier sell.  Even though they are costly to build and run, organizations may prefer the control they get over what people can access when compared to our simple and cheap bulk solution.

    It is still unclear what the size and format restrictions are on Dallas.  If they are like other services out there (Socrata, Factual), they need data that comes in a structured, rectangular format.  These constraints enable these services to display their data live online.  While Infochimps doesn’t have that feature (yet!), we can handle datasets at the terabyte scale as well as those that don’t fit the spreadsheet paradigm, such as social network graphs.

    Dallas is also part of a platform that forces users to integrate with other Microsoft services.  Infochimps’ mission is simply to connect people with the data they’re looking for, and we let anyone download data without having to register for an account.

    We are proud to be a part of a strong community that’s grown over the past year, and to continue our commitment to an open data comons.  On the commercial side, we are narrowing focus on the right verticals after months of talking with this new market about what is possible.  That ultimately is what this is about – enabling something that couldn’t be done before, and connecting buyers to sellers and people to knowledge.

    Twitter data update 4 months, 1 day ago. by Joseph Kelly

    Our launch of the Twitter data was a great success, and we thank Marshal Kirkpatrick at ReadWriteWeb (also) and Jordan Golson at GigaOm for their coverage. The community reaction has been overwhelming and energizing. We accomplished our two main goals: crack open some issues close to our hearts and kick-start the conversation about sharing data online.

    Twitter has advanced some reasonable concerns, however, and have asked us to take the datasets down. We have temporarily disabled downloads while we discuss licensing terms. The outcome of discussions will, we hope, encourage more internet services to open up and share data in bulk. The two biggest issues this data release highlighted are third party redistribution and user privacy.

    Redistribution rights. Twitter maintains a legendarily open API:

    “Except as permitted through the Services (or these Terms), you have to use the Twitter API if you want to reproduce, modify, create derivative works, distribute, sell, transfer, publicly display, publicly perform, transmit, or otherwise use the Content or Services.
    “We encourage and permit broad re-use of Content. The Twitter API exists to enable this.” [highlighting added by us]

    However, Twitter wants to more closely control who has access to data at massive scale and to prevent its malicious use. We understand this concern — innovation is always a double-edged sword. The applications and services that can use this data to make the world a better place far outnumber those with bad intentions, however, and good people need better access to this type of data. The best solution is to apply a reasonable license to the data. We are addressing this in our talks with Twitter, and we expect to have a resolution soon.

    User privacy. What little criticism we heard from the community was the potential for a breach of user privacy. This is an issue with many types of internet data, and one we take seriously. We ensured that the datasets released posed no such dangers. The Token Count data contained no personally identifying information, only what the entire mass of twitter users were discussing over time. The API ID Mapping Dataset is simply a sort of phone book for the Twitter APIs: it converts screen names to numeric IDs and reveals absolutely nothing about the corresponding user. Infochimps.org’s policy is to not host any personally identifying information of non-consenting individuals — we apply this rule to any data that goes on the site from any source.

    These are hard issues and it took a bold move to bring them into the open. It will take further sharing and discussion to establish best practices for these concerns so that Twitter and other internet services (Facebook, Amazon, etc.) can share their data to the benefit of the greater online community. Stay tuned while we agree upon appropriate licensing for open sharing of this social data.