Start a Conversation

Unsolved

This post is more than 5 years old

11062

March 10th, 2015 13:00

Ask the Expert, EMC Business Data Lake: Fully Engineered, Simple to Configure and Run at Scale

Welcome to the Ask the Expert conversation. On this occasion we will be covering the recently announced EMC Business Data Lake. Our experts from the EMC Big Data Solutions Management team are here to answer your questions around the architecture, supported configurations of Generation 1.0, and best practices.

 

Meet Your Subject Matter Experts:

 

profile-image-display.jspa?imageID=13149&size=350 Ted Bardasz

Sr. Director of Product Management for EMC's Big Data Solutions Team.

14 years prior to EMC at various start-ups, most recently at Digital Reef, a Big Data company delivering solutions for eDiscovery and Storage Management. Prior start-ups in grid/high performance computing, enterprise software, data quality, and CAD related companies. Some of the product that Ted has experience with are:

Big Data/Unstructured Content, Distributed Computing, Strategic Planning & Growth, Bringing Products to Market, Customer Acquisition and Relationships.

 

profile-image-display.jspa?imageID=13105&size=350

Product Manager, Big Data Solutions - EMC

Matthew joined EMC in 2012 as the product manager for VPLEX Virtual Edition. In June of 2014 he joined the Big Data Solutions team. He has a B.S. in Computer Science and Math from Franklin Pierce University and an M.S. in Artificial Intelligence from The University of Georgia. Matthew has a strong technical background in computer science and analytics such as machine learning, evolutionary algorithms, neural networks and computer vision. He's had a lot of exposure to storage and compute virtualization in his current and previous roles.

 

This discussion takes place from March 23rd -  April 6th. Get ready by bookmarking this page or signing up for e-mail notifications.

 

Share this event on Twitter or LinkedIn:

>> Ask the Expert, EMC Business Data Lake: Fully Engineered, Simple to Configure & Run at Scale http://bit.ly/1HwvbYc #EMCATE <<

 

Follow @DellEMCBigData: www.twitter.com/dellemcbigdata

Follow EMC Big Data blog: http://bigdatablog.emc.com

March 23rd, 2015 09:00

Welcome everyone, this ATE discussion has began. Feel free to post your questions and our SMEs will be sure to answer them before the end of our event. Let's keep this discussion respectful and informative. Thanks!

5 Practitioner

 • 

274.2K Posts

March 23rd, 2015 09:00

Which applications do you regularly use for big data analytics for which you’d see high value in a validated, simplified deployment onto the Data Lake? 

Please describe the use case and any pain points in ease of deployment and usage today

5 Practitioner

 • 

274.2K Posts

March 23rd, 2015 10:00

Deduplication needs to be addressed at some point to prevent the bloating of the Lake with redundant, raw, data entries.

That said, it needs to be done intelligently so that even though the raw data is not duplicated, relevant unique metadata surrounding that raw data is persisted to support subsequent analytics and search.

In your sample, the sensor raw data may be the same, but also as important is to "which sensor" captured the data "when" from "where" - as examples.

Arguably similar to unstructured content in an enterprise. The content store, url, and acls are sometimes as interesting as the content itself (re: data governance).

A Data Catalogue can be defined to have entries for each item in the Lake, each with its own specific metadata, but also with a many-to-one relationship, pointing to a single, raw data artifact.

5 Practitioner

 • 

274.2K Posts

March 23rd, 2015 10:00

What value do you see in pre-filtering the data lake before the data is pooled - when dealing with massive lists of repeated (but not redundant, necessarily - such as many sensors' reporting of the same data many times) data, is deduplication a possibility to retain the uniqueness of the points in time but eliminate the repetition?

5 Practitioner

 • 

274.2K Posts

March 23rd, 2015 11:00

Thanks, Ted!  In your opinion/experience, is that deduplication effort best handled as an input "filter" at the point the data is stored to the lake, or as a periodic "skim" done on a schedule to search and optimize existing data, or possibly both?  Is there a substantial processing charge associated with either of these approaches that can be mapped to recommend one approach in certain applications and another approach in other applications?

5 Practitioner

 • 

274.2K Posts

March 23rd, 2015 11:00

Excellent suggestion. The dedupe algorithm should work equally well at ingest as when the data is at rest.

I like the "configurable approach" very much as the "knowledge of the characteristics of the data" should guide the most suitable best practice configuration.

Ted Bardasz

5 Practitioner

 • 

274.2K Posts

March 26th, 2015 11:00

Have you considered virtualizing your Hadoop environment? How about in conjunction with a shared storage model? What benefits or drawbacks have you seen?

82 Posts

March 27th, 2015 03:00

Good one

17 Posts

April 1st, 2015 05:00

Hi

It would be perfect to have a section in Business Datalake on ECN that explains how to make the app, i saw that Cloudera was on the stage and a partner, but it would be nice to see  som use case exampels on how they did it for at EMC Business Datalake customer.

159 Posts

April 4th, 2015 02:00

we see most common sensor and statistics data from engineering companies. Seiscmic and Nautic Data live Sea Sounds for anaylyzing movement of ships and their impact on the Sea is a Big Lake i have currently seen, since that data is also kept to analize movements of foreig ships . . .

14.3K Posts

April 4th, 2015 09:00

When it comes to data protection of data lake nowadays and trends, are we talking about:

a) data protection (and archiving) at levels of data lake components

b) data protection (and archiving) at levels of data lake manager

c) hybrid

d) something else


I believe latest mantra is that that we no longer talk about backup, but rather management of copies, but I still miss that plugin or product which enables integration with this new modern view (not to mention legacy environment and compliance management).

97 Posts

April 5th, 2015 13:00

Are Tier aproach valid to Data Lakes? Like Tier 1 Isilon S or X Series, Tier 2 NL or HD Series and Tier 3 LTFS (Linear Tape Filesystem). Are any integration or solution with this type of aproach?

51 Posts

April 7th, 2015 00:00

Regarding Data Protection,

Does EMC has any specific product to use with data lake? We have great products, but not sure they will fit for this role.

Regards.

5 Practitioner

 • 

274.2K Posts

April 7th, 2015 09:00

Hello Petter,

This is a great suggestion.  Please follow our community Everything Big Data and we will plan on posting an example of making an app.

Everything Big Data at EMC

Thanks,

Mona Patel

April 7th, 2015 10:00

This Ask the Expert has concluded. Many thanks to those who participating with their questions, but especial thanks to our SMEs who loan their time to answer questions.

No Events found!

Top