Welcome to the Ask the Expert conversation. On this occasion we will be covering the recently announced EMC Business Data Lake. Our experts from the EMC Big Data Solutions Management team are here to answer your questions around the architecture, supported configurations of Generation 1.0, and best practices.
Meet Your Subject Matter Experts:
Sr. Director of Product Management for EMC's Big Data Solutions Team.
14 years prior to EMC at various start-ups, most recently at Digital Reef, a Big Data company delivering solutions for eDiscovery and Storage Management. Prior start-ups in grid/high performance computing, enterprise software, data quality, and CAD related companies. Some of the product that Ted has experience with are:
Big Data/Unstructured Content, Distributed Computing, Strategic Planning & Growth, Bringing Products to Market, Customer Acquisition and Relationships.
Product Manager, Big Data Solutions - EMC
Matthew joined EMC in 2012 as the product manager for VPLEX Virtual Edition. In June of 2014 he joined the Big Data Solutions team. He has a B.S. in Computer Science and Math from Franklin Pierce University and an M.S. in Artificial Intelligence from The University of Georgia. Matthew has a strong technical background in computer science and analytics such as machine learning, evolutionary algorithms, neural networks and computer vision. He's had a lot of exposure to storage and compute virtualization in his current and previous roles.
This discussion takes place from March 23rd - April 6th. Get ready by bookmarking this page or signing up for e-mail notifications.
Share this event on Twitter or LinkedIn:
>> Ask the Expert, EMC Business Data Lake: Fully Engineered, Simple to Configure & Run at Scale http://bit.ly/1HwvbYc #EMCATE <<
Follow @DellEMCBigData: www.twitter.com/dellemcbigdata
Follow EMC Big Data blog: http://bigdatablog.emc.com
Which applications do you regularly use for big data analytics for which you’d see high value in a validated, simplified deployment onto the Data Lake?
Please describe the use case and any pain points in ease of deployment and usage today
Welcome everyone, this ATE discussion has began. Feel free to post your questions and our SMEs will be sure to answer them before the end of our event. Let's keep this discussion respectful and informative. Thanks!
What value do you see in pre-filtering the data lake before the data is pooled - when dealing with massive lists of repeated (but not redundant, necessarily - such as many sensors' reporting of the same data many times) data, is deduplication a possibility to retain the uniqueness of the points in time but eliminate the repetition?
Deduplication needs to be addressed at some point to prevent the bloating of the Lake with redundant, raw, data entries.
That said, it needs to be done intelligently so that even though the raw data is not duplicated, relevant unique metadata surrounding that raw data is persisted to support subsequent analytics and search.
In your sample, the sensor raw data may be the same, but also as important is to "which sensor" captured the data "when" from "where" - as examples.
Arguably similar to unstructured content in an enterprise. The content store, url, and acls are sometimes as interesting as the content itself (re: data governance).
A Data Catalogue can be defined to have entries for each item in the Lake, each with its own specific metadata, but also with a many-to-one relationship, pointing to a single, raw data artifact.
Thanks, Ted! In your opinion/experience, is that deduplication effort best handled as an input "filter" at the point the data is stored to the lake, or as a periodic "skim" done on a schedule to search and optimize existing data, or possibly both? Is there a substantial processing charge associated with either of these approaches that can be mapped to recommend one approach in certain applications and another approach in other applications?
Excellent suggestion. The dedupe algorithm should work equally well at ingest as when the data is at rest.
I like the "configurable approach" very much as the "knowledge of the characteristics of the data" should guide the most suitable best practice configuration.
Have you considered virtualizing your Hadoop environment? How about in conjunction with a shared storage model? What benefits or drawbacks have you seen?
It would be perfect to have a section in Business Datalake on ECN that explains how to make the app, i saw that Cloudera was on the stage and a partner, but it would be nice to see som use case exampels on how they did it for at EMC Business Datalake customer.