Re: ClusterTalk Podcast Episode One
What a great first episode, I look forward to enjoying more of this podcast in the future!
In the spirit of public documentation and transparency, and while not wanting to criticize the work of the podcast hosts, I thought I'd clarify a couple of points.
There are two major factors that make a traditional Hadoop storage environment inefficient: storing your data in another environment just for Hadoop processing and the fact that traditional HDFS stores data in 3 locations/copies for redundancy/protection. These multiple storage copies result in a 200% storage overhead in HDFS compared to a 20% or less overhead on OneFS plus the fact that you can simply maintain a single storage pool (data lake) of all of your data for consumption in legacy systems and simply add Hadoop access via an HDFS protocol.
In our testing we did not achieve, nor do we ever intend to perform, a query of all 8PB of unstructured data on Isilon in a single Hadoop query. As was correctly pointed out, the 8+ PB data lake is made up of data collected from and on behalf of our entire enterprise customer base and the objective is to be able to run a Hadoop query for any of these individual customers without first relocating the data to a traditional Hadoop infrastructure. The largest query we ran in our testing and reported in the white paper was 65TB. So the goal is to be able to query any of the 8PB at any time, but not necessarily all of it at one go.
*Source: I'm the guy from Adobe that did the work (with the help of so many others at Adobe and the EMC Federation) and helped create the white paper.