Welcome to this premiere episode of the EMC Isilon ClusterTalk podcast. In our pilot episode, Scott Pinzon and Chris Adiletta discuss innovations in Big Data, such as putting every car on the Internet and a body cam on every cop. Also: Cool Commands for OneFS users; a cutting-edge Hadoop solution from Adobe; insights from the large, active EMC Isilon user community; and happy talk about Exploding Kittens.
This episode should be the first in a monthly podcast series. If you like what you hear, please rate or comment on the discussion here, find us and subscribe in iTunes or your favorite Podcast app and send us your thoughts. We hope to bring you a great informational podcast listening experience, and through your feedback we can continue to improve the content.
Leave your thoughts below, comment on the podcast posting in iTunes, or email us directly, ClusterTalk@emc.com
What a great first episode, I look forward to enjoying more of this podcast in the future!
In the spirit of public documentation and transparency, and while not wanting to criticize the work of the podcast hosts, I thought I'd clarify a couple of points.
There are two major factors that make a traditional Hadoop storage environment inefficient: storing your data in another environment just for Hadoop processing and the fact that traditional HDFS stores data in 3 locations/copies for redundancy/protection. These multiple storage copies result in a 200% storage overhead in HDFS compared to a 20% or less overhead on OneFS plus the fact that you can simply maintain a single storage pool (data lake) of all of your data for consumption in legacy systems and simply add Hadoop access via an HDFS protocol.
In our testing we did not achieve, nor do we ever intend to perform, a query of all 8PB of unstructured data on Isilon in a single Hadoop query. As was correctly pointed out, the 8+ PB data lake is made up of data collected from and on behalf of our entire enterprise customer base and the objective is to be able to run a Hadoop query for any of these individual customers without first relocating the data to a traditional Hadoop infrastructure. The largest query we ran in our testing and reported in the white paper was 65TB. So the goal is to be able to query any of the 8PB at any time, but not necessarily all of it at one go.
*Source: I'm the guy from Adobe that did the work (with the help of so many others at Adobe and the EMC Federation) and helped create the white paper.
Thanks for the comment and for being a part of our first episode! We hope you are still listening! In episode 2 we definitely went back to revisit your comments on our first story. We depend on people like you to keep us straight!