The Business Data Lake from a Data Scientist Perspective

The Business Data Lake (BDL) is positioned as the one-stop-shop for all of the organization’s (big) data storage and analytics requirements. It is intended to address the three V’s of Big Data analytics – Volume, Variety and Velocity – by providing a vast amount of storage, ingestion of streaming, mini-batches and batches of data, either structured, semi-structured or unstructured. It fundamentally shifts the paradigm in business data storage and analytics by consolidating the multiple silos of data that can be found in organizations today.

Being a data scientist, all of this sounds fantastic. It gets us access to an environment that stores all of the data in all shapes and forms on a centrally managed, distributed, computationally capable and coherent platform.

We expect this new environment to eliminate a number of bottlenecks we had to overcome in all of our past projects:

  • Gaining access to the data – this step seems simple but can be the biggest blocker in many projects.
  • Getting the data to the right compute (or vice versa) — is it already in a Big Data platform? Do we have the permission and resources to run our heavy duty data science jobs there? Can we easily integrate tools and technologies as we experiment on the fly?
  • Assimilating the model into the business process (productization) – once we have an awesome machine-learning model, how soon can we provide it to the business?

The data-driven enterprise is definitely marching in a promising direction with the business data lake being used as a centralized tool to obtain insights required for business critical decisions. Sounds great, right?

Click to Enlarge

Mostly right — there are some caveats and challenges requiring our attention while getting there. Going through the data science project lifecycle, I will try to highlight a few of these.

A data science engagement starts with stating the business problem to be resolved. It is mostly composed of meetings with the customer or business unit, trying to understand the challenges they are facing (at least the challenges that can be addressed using a data-driven approach.) Then, after some long discussions, we apply a methodological approach to prioritize these challenges. A few example criteria we can use for this task include:

  • Return on investment
  • Implementation complexity. This is the estimated difficulty to achieve a viable solution. Applying linear regression to predict next quarters support requirements is not of the same complexity as developing a content based optimization algorithm for storage devices, even though, the latter would probably yield higher revenue for a company.
  • The customer’s readiness for addressing the challenge. Are there data and domain Subject Matter Experts (SME’s) available to guide the technical development of the solution? Is there executive level support? Is the data available?

We intentionally leave the data availability question to last, as this is where the BDL can come in handy. Answering the question of data availability is very challenging in this very initial stage, especially for problems involving multiple data sources and several SMEs. Having a single, easily accessible environment that hosts all of the data can dramatically reduce the time required to answer the data readiness question (at least at a high level – we’ll address this issue more in depth shortly) compared to answering that question when operating in a siloed environment.

Once the challenges are mapped and prioritized, we conduct some exploratory data analysis. This stage is where the data scientist starts playing around with the data (and the DBA or operations team starts getting anxious) to identify all of the relevant data sources, explore the content and quality of the data and generally try to understand if the business problem can be effectively solved using the data at hand. Here, the BDL can either be an extremely effective tool or the entire project’s downfall.

Beware the Data Swamp

A single repository for all business data, without the proper mechanisms to assure data quality, coherent business glossary and metadata, risks in turning into a data swamp, says Gartner. And in a swamp, it is difficult to see and evaluate the availability of informative data that is required to solve the business problem.

On the other hand, having these mechanisms in place might provide a clear view into the available sources of information. Data set descriptions, data dictionaries, business glossary specification, business processes reflected on data, sample queries, examples of reports generated from the data would all enable assessing the efficacy of a possible data-driven solution or take us quickly back to the business problems prioritization list to pick one other.

Some call this process “fail fast,” though, are we failing if we can efficiently identify that using the current data sets we could not apply an effective data driven solution? I think not. During the exploratory phase we might identify that additional data is required to develop an effective solution. We then introduce the problem to a queue of high value business problems and encourage the business to start collecting relevant data to answer these, thus promoting data science and advanced analytics development.

Once we do identify a business problem that can be solved using available data, we begin data preparation. This stage is composed of preparing the data towards a data scientist preferred state – modeling.  Unfortunately, we spend too little time doing this. Instead, we spend our time identifying, collecting, consolidating, combining, cleaning and aggregating data as part of the data preparation stage and, in fact, it usually is the most time-consuming phase of the entire engagement. This is where, done right, the business data lake can really shine, being a single environment to conduct all of these data activities–.overcoming silos of data, combining structured and unstructured data, streaming with batch data using a high level of flexibility and modularity.

Being very conscious of what happens in the corporate IT environment, once data is loaded into the BDL and gets declared as business critical, a wall of regulations is being built. No one can smell, feel or touch the BDL environment, especially these havoc-raising data scientists with their useless data exploration workload.

Data preparation is a very intensive yet extremely important phase— no good models can be developed using bad data. Statistical and machine learning models are as good as the data we use to fit them, hence, it is absolutely essential that the BDL accommodate  these workloads, preferably on a sandbox environment that has a view to all of the data in the BDL.

A Platform for Collaboration

Next in the modeling phase, we usually create key performance indicators (KPI’s) or features from the data that we then use to build and evaluate models. The BDL must be flexible and powerful enough to support the modeling process. Modeling is extremely computationally intensive; in most cases it involves iterating model fitting and evaluation up until we obtain the required results. Trying that on a high volume and multi format data is challenging.

Done right, the BDL can also serve as a platform for collaboration between data scientists, data engineers and architects sharing data, KPI’s, models and code building blocks that can be used to introduce additional complexity and accuracy to the entire modeling process.BDL1

The last stage before going back to business problem definition and refinement in this ongoing cycle focuses on insights and deliverables. This stage mostly involves presenting results and delivering the products of the analysis. Presenting the results might not dramatically change with the availability of the BDL, as the quality of PowerPoint presentations is probably not affected by it. Implementing the solution in the production environment, on the other hand, does.

If a BDL is designed such that the development or sandbox environment is very similar to the production environment and the data is being shared across the development and production environments, migrating the code from development to production is like flipping a switch and should be easily accomplished.

To summarize, the BDL, done right, will definitely benefit data science and data science products in the organization. While we build these platforms and others, we should focus on some key guiding principles, the most important of which is collaboration. We should make sure that any platform being built can host the domain experts, data engineer, business intelligence expert and data scientist and promote collaboration, making sure that each focuses on their areas of expertise.

For more about the data lakes, read Creating New Business Value around the Business Data Lake.

About the Author: Oshry Ben-Harush