Data Lakehouse

Article by Ankit Sharma, Big Data and Cloud Senior Solution Architect

Overview

In this blog we will discuss what’s good and bad with Warehouse and Data Lake, and how Lakehouse is going to achieve best out of these two approaches.

Data Warehouse

Source — Databricks

A data warehouse is a central source of information that can be analyzed to make more informed business decisions. The data warehouse stores data from many heterogeneous sources into a single area which gets transformed according to decision support system, which contributes to business decision making. Concept of Modern data warehousing evolved to a multi-platform, distributed approach to meet business need for enterprises and MPP architectures helps organizations to handle large data sets.

But while Data warehouses are great for structured data, in the world of IoT, a lot of enterprises have to deal with unstructured, semi-structured, data with high variety, velocity and volume. Also, with limited flexibility with SQL like tools, there is no data science and machine learning capabilities offered. Supporting schema on write, makes warehouse less efficient for schema evolution and storage cost adds another drawback. Modern cloud-based EDW’s, addresses most of these challenges, but some challenges still remain, the obvious one being, lack of unstructured data support.

Data Lake

Source — Databricks

Idea of Data Lake is basically a dumping ground of raw data in a variety of formats within a lower-cost storage environment. ‘Schema on read’ provides high agility and ease of data capture and also support of unstructured and semi-structured data allows enterprises to choose not to reject any data inflow. Compatibility with data science and machine learning tools gives more power to data scientists to get more insights and build models out of it.

But while storing data with variety and volume, data lake lacks some critical features:

  • Does not support transactions (ACID properties)
  • Lacks on Data Quality and poor support for BI.
  • Lack of consistency makes it hard for appends and reads.
  • Multiple sources dumping huge volumes of data to data lake, and ungoverned and non-catalogued data turning data lake to data swamp.

Although data lake is good for data science and machine learning as it provides deep analytics, it misses operational use cases where data should be structured to get key metrics and reports. Hence to get Business intelligence, enterprises need to load data from data lake to data warehouse.

Enterprises need BI and reporting capabilities from Warehouse so they should be able to bring data in an open format to one place easily and cost-effectively.

Data Lakehouse

Source — Databricks

Data Lakehouse is a hybrid concept that offers the key feature of both a Data Lake and a Data Warehouse. It enables a new system design : implementing similar data structure and data management feature of warehouse , directly on kind of low cost storage used for data lakes. This approach saves a lot of operational cost as it eliminates the ETL/ELT workloads which transform data from data lake to data warehouse as now query engine will be querying directly from data lake. Along with this aspect of Lakehouse , some other key features that Lakehouse offers:

  • Transaction Support : Support of ACID transaction, as there can be multiple data pipeline, which requires read and write transactions without compromising data integrity.
  • Schema Enforcement and Governance: Lakehouse should support schema enforcement and evolution. It should also provide robust governance and auditing mechanism.
  • BI tools directly over source data: Lakehouse enable using BI tools directly to source data. It eventually reduce the time taken from RAW Zone to visualization.
  • Compute Storage Decoupling: Compute and storage use separate cluster to enable more concurrent users and large data size handling.
  • Openness: Capable to use open and standardized format with ability to provide API, so a variety of tools and engine can directly access data.
  • Support for diverse workloads: Lakehouse should support the capability of SQL and Analytics with data science and machine learning.
  • Support diverse data types: Lakehouse should provide access to broad range of data including files, videos, audio, system logs semi-structured and text.
  • End-to-End Streaming: Lakehouse should support streaming analytics, which facilitates read time reporting.

Final Thoughts

--

--

Abzooba is an AI and Data Company. BD&C Practice is one of the fastest growing groups in Abzooba helping several fortune 500 clients in there cognitive journey

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
BigData & Cloud Practice

Abzooba is an AI and Data Company. BD&C Practice is one of the fastest growing groups in Abzooba helping several fortune 500 clients in there cognitive journey