Welcome to this EMC Ask the Expert session. On this occasion we'll answer questions on EMC Big Data Solutions such as Isilon, ViPR, and the ECS Appliance.
EMC has always been about data; storage is just the means to keep, access, protect and use it. Big Data is the latest data management challenge. That’s why EMC was so excited to be at Hadoop World and showcase our storage and data management solutions for Big Data. EMC does not just tackle storage problems, we solve data management challenges. Our experts are getting ready to take all of your questions on this topic.
Here are your Subject Matter Experts:
George Hamilton is a senior product marketing manager for EMC ViPR Global Data Services and EMC Centera and Atmos object storage platforms. George has nearly 20 year of technology industry experience. Prior to joining EMC, he was an industry analyst and research director for Yankee Group covering cloud computing and services, IT infrastructure and IT management software. Connect with George on Twitter.
Ryan M Peterson MBA is an internationally recognized industry expert with repeated success in the design, development, and delivery of ground-breaking, high performance technology solutions. He is a technology thought leader and pioneer with a “can do” attitude, who finds workable, technologically advanced solutions to complex issues. Ryan currently directs the efforts of EMC Isilon’s Solutions Architecture organization focusing on industry integration with the best of breed applications and technologies offered. He has also become recognized as a thought leader within the area of Big Data Analytics applications such as Hadoop and enjoys discussing the future of technology and its positive impact to the world. Connect with Ryan on Twitter
This Event will take place from October 27 - November 7th, 2014.
Share this event on Twitter:
>> Join our Ask the Expert: Store Everything, Analyze Everything, and Build What You Need with EMC Hadoop Storage Solutions http://bit.ly/1CYVHrg #EMCATE <<
Welcome everyone, this ATE session has began. Our experts are now ready to answer any question you post on this thread. Enjoy!
I understand that one of the questions consistently asked at the Hadoop World booth was "Which Hadoop distribution should I choose?"
How do our Big Data experts respond to that?
Which Hadoop distribution should you use? In the case of ViPR HDFS, EMC gives you the option to choose the Hadoop distribution that best fits your needs. ViPR Services is an object-based unstructured storage engine. ViPR Services supports access to the underlying data via Object APIs such as S3, OpenStack Swift and EMC Atmos. It also provides an HDFS interface to an object bucket. ViPR presents an HDFS-compatible file system. ViPR HDFS provides a client library (ViPR-HDFS Client) that is installed on all the data nodes that run MR jobs on the customer’s Hadoop cluster. As such, the customer can use the distribution of their choice.
When a task running on the datanode needs to read a file, the request will go to the ViPR-HDFS client (the customer will point to viprfs:// as their data source) and the ViPR client will communicate with the HDFS head on the ViPR data node. The ViPR client passes in a authN token that identifies the user to the HDFS Head.
The HDFS head in the ViPR Data node receives requests from the ViPR-HDFS client . The HDFS Head then verifies the user’s identity by authenticating against the KDC. Then it talks to the ViPR Services engine and the controller process running on the node to fetch the requested data once authN and authZ succeed.
Bottom line, the goal of ViPR HDFS is to extend analytic capabilities to additional data sources, for example, a large, PB-scale archive for metadata querying, etc. But you can use your existing Hadoop distribution.
Thanks for the opportunity. I have more than question, hope this is not a problem.
Sorry for the multiple questions. I understand I might be needing to learn more about the technologies. However, I find it an excellent chance to get replies from the experts. Replies may include references to external resources. Thanks again.
I will let the Experts reply to your questions, but let me recommend the latest October blog posts on the Isilon Community to you about OneFS and Hadoop, and Splunk as an alternative to Hadoop.
Isilon has multiple benefits for Big Data. Here is a list, although not exhaustive, it should give a good idea.
Distributed NameNode operations (every one of Isilon's nodes is active, providing N-to-N redundancy and better scale than HDFS.
Erasure Coding instead of 3X replica. OneFS doesn't run over the top of another files system like HDFS does (usually EXT3). As a result, HDFSneeds to run defragmentation on EXT3 and wastes space having to format and structure the data blocks under HDFS. This leads to a 4:1 ratio of DAS required. e.g. A customer needs 10PB of data storage. They will have to purchase 40PB of DAS (and all of the racks that it needs) if using HDFS. If using OneFS, they will need 10PB.
Deduplication at the 8K block level. This usually will increase usable capacity to greater than 100%.
Edit capabilities (HDFS only allows Read, Write, Append, Delete)
Maximum usable cluster size of 52 PB vs HDFS maximum of about 25PB.
Simultaneous Native Multi-protocol support of SMB1, 2, and SMB3-MC, NFSv3, NFSv4, HTTP, FTP, sFTP, WEBDAV, SWIFT, HDFSv1, HDFSv2.x
Simultaneous Distribution support of your favorite distribution (Cloudera, Hortonworks, Pivotal) This feature also allows for portability to you can more easily switch from one vendor to another. But I feel most customers will need more than one distribution as the features of each start to diverge.
Multi-tenancy for Hadoop installations (Multiple Cloudera clusters) - Share the same data, set xACLs to segregate data in the same container, or use Access Zones to create a new "volume" and completely segregate data.
Block based replication (HDFS requires a job to be schedule called distcp and full file copies) where OneFS can replicate on the fly at the 8K block level. i.e. Snap-replicate
Independently scale Compute from Storage makes it so you don't have to build new data centers to store your Hadoop compute nodes.
Governance, Security, and Compliance. OneFS offers SEC17a-4 WORM compliance where the root user can be removed and files can be committed to an immutable state. With Auditing and Self Encrypting Disks, performance is NOT degraded (often by 30%) by having to use software level encryption.
Cost: As a result of Deduplication, reduction of copies, and independent scaling, we are often significantly less expensive.
Performance: IDC did a lab valaidation that shows that Isilon performs much better than HDFS. Some of this is a result of the 3X replicas. As you write a file to HDFS, it must write all 3 copies before the file can be said that it is protected. This makes writes very slow. Isilon is ~300% better performance on writes as a result. For read performance, Isilon breaks up files across hundreds of disks and as a result, read performance is about 120% that of HDFS which writes in 64MB or greater chunks. Isilon also handles small files (128K to 64MB) much better than HDFS. Finally, the Java-based NFS re-export of HDFS which is the way to get NFS access to an HDFS cluster performs terribly. Isilon is 2,000% faster on NFS reads as a result and much faster on writes as well.
As I said, there are hundreds of differenciators from features to cost to performance, this is just my short list that I discuss with customers when they want to know what we have to offer.
Please let me know if you have any questions.
Hello Ryan Peterson
Honestly, those are the answers I was looking for. You really have no idea how much you helped me with your answer. Thank you for the thorough answer, and the effort and time you took to provide this answer. Actually, I have more questions, hope you don't mind.
Sorry for the questions. I am in academia, and I am starting my research career in "File Systems". A topic I find very useful actually. Sorry for any inconvenience and thanks for your generosity.
Here is a link to the Isilon community page where you can learn all about Isilon Erasure Codes and has other Isilon documentation: https://community.emc.com/community/products/isilon
Basicaly speaking, Isilon doesnt look to protect the underlying disks in a system, but instead manages data protection by distributing blocks and parity blocks across the cluster. Although a little more complicated than this, imagine you have three Isilon nodes and two held the data, the third would hold the parity. This allows the loss of physcial disks or even entire nodes without loss of data. You will see terminology such as n+1 or n+3:1 thrown around when talking to Isilon folks. What they are saying is that for each file, directory, or system level (the single volume), you can set the protection level differently. n+3:1 for example would say to the system to take those files and write enough extra parity to be able to simultaenously lose three disks in the cluster or a single node. n+3:2 would mean 3 disks or 2 nodes. n+1 would mean 1 disk or 1 node. We typically suggest n+2:1 as a default. Not to overly confuse anything, but technically we also have a data protection schema that allows for data mirroring similar to the way Hadoop does replicas, but we seldom utilize it.
For OpenStack SWIFT, you are talking about an object API that allows access to data in the form of metadata and data wrapped together as George has discussed in his response. Isilon is releasing access to the underlying file data in the form of SWIFT in the upcoming release slated for next week. This will allow you to use Openstack SWIFT APIs to get to the same data as you could using SMB, NFS, or otherwise. Think of SWIFT as using GET() and PUT() to the data.
I hope that helps. Please continue the questions, they are quite good ones!
Hi Haithem. Thanks for your question. Let me see if I can address the object portion.
Regardless of whether a file is stored as an object, file or a block, it is technically stored as a contiguous block of data on a disk. File/NAS storage and Object storage are simply abstractions above that process. As far as handling large files, that is precisely what Object storage is designed for. Rather than using a file system with a hierarchical structure, Object stores a file with both the metadata and raw data packaged together as a unique object. This object is then stamped with a unique identifier and placed in a non-hierarchical bucket. It seems as though you are referencing Content Addressed Storage (CAS). Centera is an example of CAS. With Centera, the application requests to create a new file and the app server sends the file to Centera. Centera performs the Content Address calculation using a proprietary hash and sends the address back to application. The application database stores the content address for future reference. The content address is a unique, digital fingerprint that guarantees content authenticity and immutability. When an application needs to access the file, the application only needs to know the content address. The authorization data is not stored within the object. That is governed by the application and the user's privileges at the application layer.
Other object platforms work similarly but use different methods of creating the unique object ID which is stored in an index.
As far as security, each operation is individually authenticated. So, if a user is not authenticated, they will not have permission to access a file. Again, this is done at the application and access control layer.
Access to object storage is via an API, most often a restful API such as Amazon S3,. EMC Atmos or OpenStack Swift.
For a more detailed explanation, here are a few resources for you: