<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Hacking through the Amazon with a shiny new MachetEC2</title>
	<atom:link href="http://blog.infochimps.org/2009/01/28/hacking-through-the-amazon-with-a-shiny-new-machetec2/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.infochimps.org/2009/01/28/hacking-through-the-amazon-with-a-shiny-new-machetec2/</link>
	<description>Organizing huge information resources</description>
	<lastBuildDate>Fri, 12 Mar 2010 23:07:33 -0600</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: M. Edward (Ed) Borasky</title>
		<link>http://blog.infochimps.org/2009/01/28/hacking-through-the-amazon-with-a-shiny-new-machetec2/comment-page-1/#comment-143</link>
		<dc:creator>M. Edward (Ed) Borasky</dc:creator>
		<pubDate>Sat, 14 Mar 2009 03:23:25 +0000</pubDate>
		<guid isPermaLink="false">http://blog.infochimps.org/?p=129#comment-143</guid>
		<description>My *hard* requirements for data analysis / visualization software:

1. 64-bit Gnu/Linux, 2.6.25 kernel or later. I don&#039;t much care which distro, although I personally use openSUSE 11.1. Ubuntu Intrepid Ibex or later, Fedora 9 or later will also work. Hardy Heron&#039;s kernel is too old. If Gentoo ever gets their release engineering act together, I&#039;d consider it. CentOS / RHEL 5 kernel is too old. Debian Lenny might work, but I&#039;ve never used it and Ubuntu seems to be more tuned to a workstation than Debian, which is mostly a server distro.

2. R 2.8.1-patched or later. This has access to all the repositories; older versions don&#039;t know about RForge. 2.9.0 is almost ready for release; I&#039;ll be alpha testing it daily.

3. *All* of the CRAN task views! That includes the dependencies. For example, if you have Rgraphviz, you need graphviz and graphviz-devel (on openSUSE).

4. ggobi 2.1.8. or later.

5. PostgreSQL 8.3.5 -- sorry, dolphins. You can have MySQL but I won&#039;t use it. pgadmin3 is also required.

6. Lyx 1.6.1 or later. This integrates with R for publication-quality graphics, &quot;literate programming&quot; and &quot;reproducible research&quot;.

That&#039;s what I&#039;ve got on my workstation, and that&#039;s what I&#039;d expect as a minimum for software.

I can give you openSUSE 11.1 build scripts; the standard openSUSE 11.1 and even 11.2 (&quot;Factory&quot;) packages are behind the state of the art for R, GGobi and LyX, so I build them from upstream source. I&#039;ve got most of this working on two different machines, and the scripts are up on Github, but I don&#039;t have an Amazon AMI to test with yet. I could probably build an image here locally with the tools openSUSE has (Kiwi, Xen).

Bonus points if there&#039;s a way to build the ATLAS Automatically Tuned Linear Algebra Subroutines on a virtual machine. :)</description>
		<content:encoded><![CDATA[<p>My <strong>hard</strong> requirements for data analysis / visualization software:</p>

<p>1. 64-bit Gnu/Linux, 2.6.25 kernel or later. I don&#8217;t much care which distro, although I personally use openSUSE 11.1. Ubuntu Intrepid Ibex or later, Fedora 9 or later will also work. Hardy Heron&#8217;s kernel is too old. If Gentoo ever gets their release engineering act together, I&#8217;d consider it. CentOS / <span class="caps">RHEL</span> 5 kernel is too old. Debian Lenny might work, but I&#8217;ve never used it and Ubuntu seems to be more tuned to a workstation than Debian, which is mostly a server distro.</p>

<p>2. R 2.8.1-patched or later. This has access to all the repositories; older versions don&#8217;t know about RForge. 2.9.0 is almost ready for release; I&#8217;ll be alpha testing it daily.</p>

<p>3. <strong>All</strong> of the <span class="caps">CRAN </span>task views! That includes the dependencies. For example, if you have Rgraphviz, you need graphviz and graphviz-devel (on openSUSE).</p>

<p>4. ggobi 2.1.8. or later.</p>

<p>5. PostgreSQL 8.3.5 &#8212; sorry, dolphins. You can have MySQL but I won&#8217;t use it. pgadmin3 is also required.</p>

<p>6. Lyx 1.6.1 or later. This integrates with R for publication-quality graphics, &#8220;literate programming&#8221; and &#8220;reproducible research&#8221;.</p>

<p>That&#8217;s what I&#8217;ve got on my workstation, and that&#8217;s what I&#8217;d expect as a minimum for software.</p>

<p>I can give you openSUSE 11.1 build scripts; the standard openSUSE 11.1 and even 11.2 (&#8221;Factory&#8221;) packages are behind the state of the art for R, GGobi and LyX, so I build them from upstream source. I&#8217;ve got most of this working on two different machines, and the scripts are up on Github, but I don&#8217;t have an Amazon <span class="caps">AMI </span>to test with yet. I could probably build an image here locally with the tools openSUSE has (Kiwi, Xen).</p>

<p>Bonus points if there&#8217;s a way to build the <span class="caps">ATLAS</span> Automatically Tuned Linear Algebra Subroutines on a virtual machine. :)</p>]]></content:encoded>
	</item>
	<item>
		<title>By: infochimps Amazon Machine Image for data analysis and viz</title>
		<link>http://blog.infochimps.org/2009/01/28/hacking-through-the-amazon-with-a-shiny-new-machetec2/comment-page-1/#comment-145</link>
		<dc:creator>infochimps Amazon Machine Image for data analysis and viz</dc:creator>
		<pubDate>Sat, 14 Feb 2009 14:09:37 +0000</pubDate>
		<guid isPermaLink="false">http://blog.infochimps.org/?p=129#comment-145</guid>
		<description>[...] initial announcement, Hacking through the Amazon with a shiny new MachetEC2, says  &#8220;MachetEC2 is an effort by a group of Infochimps to create an AMI for data processing, [...]</description>
		<content:encoded><![CDATA[<p>[...] initial announcement, Hacking through the Amazon with a shiny new MachetEC2, says  &#8220;MachetEC2 is an effort by a group of Infochimps to create an <span class="caps">AMI </span>for data processing, [...]</p>]]></content:encoded>
	</item>
	<item>
		<title>By: Start hacking: machetEC2 released! &#171; blog.infochimps.org - Organizing Huge Information Sources</title>
		<link>http://blog.infochimps.org/2009/01/28/hacking-through-the-amazon-with-a-shiny-new-machetec2/comment-page-1/#comment-144</link>
		<dc:creator>Start hacking: machetEC2 released! &#171; blog.infochimps.org - Organizing Huge Information Sources</dc:creator>
		<pubDate>Sat, 07 Feb 2009 01:04:57 +0000</pubDate>
		<guid isPermaLink="false">http://blog.infochimps.org/?p=129#comment-144</guid>
		<description>[...] missing and what needs to be improved. We&#8217;ve incorporated many of the suggestions from our RFC post, but not all &#8212; either for reasons of time or (disk) space &#8212; have made it in to this [...]</description>
		<content:encoded><![CDATA[<p>[...] missing and what needs to be improved. We&#8217;ve incorporated many of the suggestions from our <span class="caps">RFC </span>post, but not all &#8212; either for reasons of time or (disk) space &#8212; have made it in to this [...]</p>]]></content:encoded>
	</item>
	<item>
		<title>By: Stephen</title>
		<link>http://blog.infochimps.org/2009/01/28/hacking-through-the-amazon-with-a-shiny-new-machetec2/comment-page-1/#comment-142</link>
		<dc:creator>Stephen</dc:creator>
		<pubDate>Fri, 06 Feb 2009 01:51:38 +0000</pubDate>
		<guid isPermaLink="false">http://blog.infochimps.org/?p=129#comment-142</guid>
		<description>It would be nice if it had Condor or Globus toolkit, and also some kind of script to make it easy to add these nodes to a condor cluster.</description>
		<content:encoded><![CDATA[<p>It would be nice if it had Condor or Globus toolkit, and also some kind of script to make it easy to add these nodes to a condor cluster.</p>]]></content:encoded>
	</item>
	<item>
		<title>By: Philip (flip) Kromer</title>
		<link>http://blog.infochimps.org/2009/01/28/hacking-through-the-amazon-with-a-shiny-new-machetec2/comment-page-1/#comment-140</link>
		<dc:creator>Philip (flip) Kromer</dc:creator>
		<pubDate>Fri, 30 Jan 2009 18:54:05 +0000</pubDate>
		<guid isPermaLink="false">http://blog.infochimps.org/?p=129#comment-140</guid>
		<description>Ooh I also like that Dhruv -- the .icss (Infochimps Stupid Schema -- all the data types, notes, links, etc metadata off the infochimps page) files can be live updated at one&#039;s preference from there.</description>
		<content:encoded><![CDATA[<p>Ooh I also like that Dhruv &#8212; the .icss (Infochimps Stupid Schema &#8212; all the data types, notes, links, etc metadata off the infochimps page) files can be live updated at one&#8217;s preference from there.</p>]]></content:encoded>
	</item>
	<item>
		<title>By: Philip (flip) Kromer</title>
		<link>http://blog.infochimps.org/2009/01/28/hacking-through-the-amazon-with-a-shiny-new-machetec2/comment-page-1/#comment-139</link>
		<dc:creator>Philip (flip) Kromer</dc:creator>
		<pubDate>Fri, 30 Jan 2009 18:51:06 +0000</pubDate>
		<guid isPermaLink="false">http://blog.infochimps.org/?p=129#comment-139</guid>
		<description>Cool idea on the spell-checking lists Neal.

We can put in volume images that have &quot;smaller&quot; but related datasets -- so, the BSD etc/words, the panoply of spellchecking programs, the BNC corpus word frequency list, the gutenberg.org word lists and dictionaries.

NLTK&#039;s corpora will probably live on their own.</description>
		<content:encoded><![CDATA[<p>Cool idea on the spell-checking lists Neal.</p>

<p>We can put in volume images that have &#8220;smaller&#8221; but related datasets &#8212; so, the <span class="caps">BSD </span>etc/words, the panoply of spellchecking programs, the <span class="caps">BNC </span>corpus word frequency list, the gutenberg.org word lists and dictionaries.</p>

<p><span class="caps">NLTK&#8217;</span>s corpora will probably live on their own.</p>]]></content:encoded>
	</item>
	<item>
		<title>By: Neal Richter</title>
		<link>http://blog.infochimps.org/2009/01/28/hacking-through-the-amazon-with-a-shiny-new-machetec2/comment-page-1/#comment-138</link>
		<dc:creator>Neal Richter</dc:creator>
		<pubDate>Fri, 30 Jan 2009 07:19:40 +0000</pubDate>
		<guid isPermaLink="false">http://blog.infochimps.org/?p=129#comment-138</guid>
		<description>Software:  Solr, Lucene, Mahout, memcache, memcachedb, memcacheq

Packages:  Every spell checking package you can put in there (for the word lists!)

Data: Wikipedia catagories dump, freebase, dmoz category dump.  NIST text data if you can get it.</description>
		<content:encoded><![CDATA[<p>Software:  Solr, Lucene, Mahout, memcache, memcachedb, memcacheq</p>

<p>Packages:  Every spell checking package you can put in there (for the word lists!)</p>

<p>Data: Wikipedia catagories dump, freebase, dmoz category dump.  <span class="caps">NIST </span>text data if you can get it.</p>]]></content:encoded>
	</item>
	<item>
		<title>By: dhruvbansal</title>
		<link>http://blog.infochimps.org/2009/01/28/hacking-through-the-amazon-with-a-shiny-new-machetec2/comment-page-1/#comment-137</link>
		<dc:creator>dhruvbansal</dc:creator>
		<pubDate>Thu, 29 Jan 2009 20:06:16 +0000</pubDate>
		<guid isPermaLink="false">http://blog.infochimps.org/?p=129#comment-137</guid>
		<description>We&#039;ll post the build scripts we used for the image when we make the image public.

@Pete, your list of Python packages was indispensible!  And thanks for the info about startup scripts -- looks like a good way to manage fast-changing resources (like the Infinite Monkeywrench...).

Check http://machetec2.org for updates (right now it just points at this blog entry but soon will be its own wiki).</description>
		<content:encoded><![CDATA[<p>We&#8217;ll post the build scripts we used for the image when we make the image public.</p>

<p>@Pete, your list of Python packages was indispensible!  And thanks for the info about startup scripts &#8212; looks like a good way to manage fast-changing resources (like the Infinite Monkeywrench&#8230;).</p>

<p>Check <a href="http://machetec2.org" rel="nofollow">http://machetec2.org</a> for updates (right now it just points at this blog entry but soon will be its own wiki).</p>]]></content:encoded>
	</item>
	<item>
		<title>By: Pete Skomoroch</title>
		<link>http://blog.infochimps.org/2009/01/28/hacking-through-the-amazon-with-a-shiny-new-machetec2/comment-page-1/#comment-136</link>
		<dc:creator>Pete Skomoroch</dc:creator>
		<pubDate>Thu, 29 Jan 2009 15:04:26 +0000</pubDate>
		<guid isPermaLink="false">http://blog.infochimps.org/?p=129#comment-136</guid>
		<description>Unless you already have this built in, but you might want to consider a startup boot script to optionally install updated versions of libraries and pull code from an infochimps repo.

http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1403

If the list of installs grows fairly large and you are building things from source on new versions of the ami, you can hit issues with unavailable repos or mirrors.  A good strategy there is to build those non apt packages from an infochimps src repo just in case they move in the future.

I ended up using puppet for a lot of this package management stuff at work, but that might be a pain for public AMIs.  A simple ami installtion bash script often does the job.</description>
		<content:encoded><![CDATA[<p>Unless you already have this built in, but you might want to consider a startup boot script to optionally install updated versions of libraries and pull code from an infochimps repo.</p>

<p><a href="http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1403" rel="nofollow">http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1403</a></p>

<p>If the list of installs grows fairly large and you are building things from source on new versions of the ami, you can hit issues with unavailable repos or mirrors.  A good strategy there is to build those non apt packages from an infochimps src repo just in case they move in the future.</p>

<p>I ended up using puppet for a lot of this package management stuff at work, but that might be a pain for public <span class="caps">AMI</span>s.  A simple ami installtion bash script often does the job.</p>]]></content:encoded>
	</item>
	<item>
		<title>By: Pete Skomoroch</title>
		<link>http://blog.infochimps.org/2009/01/28/hacking-through-the-amazon-with-a-shiny-new-machetec2/comment-page-1/#comment-135</link>
		<dc:creator>Pete Skomoroch</dc:creator>
		<pubDate>Thu, 29 Jan 2009 14:41:07 +0000</pubDate>
		<guid isPermaLink="false">http://blog.infochimps.org/?p=129#comment-135</guid>
		<description>Hope that package list helps... I released a much more python-focused AMI last year based on that list that allows you to fire up an MPI cluster in a similar fashion to the hadoop cluster bash scripts.  That is useful for running existing MPI code like the parallel boost graph library, etc.

http://code.google.com/p/elasticwulf/

My image is way out of date at this point, and doesn&#039;t do things like automatically load EBS volumes. I&#039;ve been planning on updating it, but maybe I&#039;ll hijack the infochimps image if you publish build scripts.

One thing I would suggest is to make sure that you have a 64 bit version of the ami, and that you use the corresponding 64 bit version of python before building numpy/scipy - otherwise you will hit a 2GB memory limit when processing data with scipy.</description>
		<content:encoded><![CDATA[<p>Hope that package list helps&#8230; I released a much more python-focused <span class="caps">AMI </span>last year based on that list that allows you to fire up an <span class="caps">MPI </span>cluster in a similar fashion to the hadoop cluster bash scripts.  That is useful for running existing <span class="caps">MPI </span>code like the parallel boost graph library, etc.</p>

<p><a href="http://code.google.com/p/elasticwulf/" rel="nofollow">http://code.google.com/p/elasticwulf/</a></p>

<p>My image is way out of date at this point, and doesn&#8217;t do things like automatically load <span class="caps">EBS </span>volumes. I&#8217;ve been planning on updating it, but maybe I&#8217;ll hijack the infochimps image if you publish build scripts.</p>

<p>One thing I would suggest is to make sure that you have a 64 bit version of the ami, and that you use the corresponding 64 bit version of python before building numpy/scipy &#8211; otherwise you will hit a 2GB memory limit when processing data with scipy.</p>]]></content:encoded>
	</item>
</channel>
</rss>
