Designing an Agile Data Lake

Article by Gautam Kumar, Big Data & Cloud Practice Lead

Purpose

Data Lake

Introduction

Difference between Data Lake and Data Warehouse

Difference between Data Lake and Data Warehouse

Functional Components of a Data Lake

  • Ingestion service: Ingesting the data from varied sources.
  • Storage service: Cheap Storage mechanism for all forms of data.
  • Cataloging service: Creating structure on top of stored data.
  • Data Processing service: Processing the data as per the use-case.
  • Security service: Implement the RBAC, encryption, and other compliance requirements.
  • Data Access service: Service layer to expose data to consumers.

Reference Architecture of a Data Lake

The architecture of a Data Lake

Our Experiences with various Data Lake Implementation over the years

Data Lake with static HDFS cluster

Data Lake with Dynamic cluster

Data Lake with Serverless Compute

Design considerations of a Cloud-Native Data Lake Implementation

  • Segregation of storage and compute
  • Scalability
  • Polyglot Persistence
  • Serverless Compute
  • Central Security mechanism
  • Configurability ensuring cloud agnostic

Based on the above-mentioned principles, what we get is a purely cloud-native implementation that is not dependent on any particular cloud vendor platform.

Data Virtualization

Prevalent challenges with Data Lake implementation and how enabling DV solves it

This is where Data Virtualization comes into the picture and promises to solve the above issues. So, let’s understand what is Data Virtualization ( will be referring to as DV here onwards). DV is the ability to view, access, and analyze data without the need to know its location. It can integrate data sources across multiple data types and locations, turning it into a single logical view without needing to do any sort of physical data movement. It also has auxiliary services like cataloging, Optimizers, and Governance.

Benefits of Data Virtualization:

  • A unified view of data and logical centralization of data sources
  • Reduced Errors as it cuts down the data hops
  • Location flexibility as physical data movement not needed
  • Federate Queries among varied data sources and compute optimization
  • Supports Urgency of Insights and Time to Market

Functional components of a DV system:

  • Catalog service:- To provide a unified view of all data sources
  • Federated and Distributed Query Engine: To query the supported data sources in a federated manner
  • Caching and Optimization: To use the fastest compute available and reduce latency
  • Security and Governance Layer: To have Role-based access to the data sources
  • Data distribution/access layer: To expose the data to downstream systems

Some of the tools/technologies which enables DV

Agile Data Lake

Introduction

How to bring Agility to a Data Lake

Reference Architecture of an Agile Data Lake

The architecture of an Agile Data Lake

Other services of the data lake will function as it is, just that the addition of DV layer will ensure that if end-users need some urgent insights from a new data source, then it does not have to go through the full ingestion and ETL process, in fact, just the connector configuration of the new data source in the DV tool will enable the querying of the required data.

Conclusion

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
BigData & Cloud Practice

Abzooba is an AI and Data Company. BD&C Practice is one of the fastest growing groups in Abzooba helping several fortune 500 clients in there cognitive journey