Designing an Agile Data Lake
Article by Gautam Kumar, Big Data & Cloud Practice Lead
Through this article, I would like to familiarize the readers with some of the basic concepts of Data Lake and also take them through the journey of various flavors of data lake implementations across the industry. I will also deep dive into Data Virtualization concepts and show how a judicious mix of virtualization with the data lake components helps us to get the required agility.
A Data Lake is defined as a single version of truth for all types of data (structured, semi-structured, and unstructured) across the functions in an enterprise. Data Lake has various layers of unprocessed/processed data like Raw, Bronze, Silver, and Gold where Gold is the most pristine form of data. And Data lake is not just about Data Storage and Processing but also about proper security mechanisms along with data cataloging features. The data access layer for various end-users is also equally important to serve reporting, visualization, advanced analytics, and machine learning use-cases.
Difference between Data Lake and Data Warehouse
Although the purpose of both Data Lake and Data Warehouse is data storage and Processing. But there are some key differences between them.
Functional Components of a Data Lake
Below are some of the main pillars of a Data Lake Implementation which are self-explanatory from their names.
- Ingestion service: Ingesting the data from varied sources.
- Storage service: Cheap Storage mechanism for all forms of data.
- Cataloging service: Creating structure on top of stored data.
- Data Processing service: Processing the data as per the use-case.
- Security service: Implement the RBAC, encryption, and other compliance requirements.
- Data Access service: Service layer to expose data to consumers.
Reference Architecture of a Data Lake
Here is how a typical Data Lake implementation looks like.
Our Experiences with various Data Lake Implementation over the years
In this section, I would like to take you through the journey of various data lake implementation which I have seen over the years and how these implementation evolved with time.
Data Lake with static HDFS cluster
Initially, we started with the implementation of data lakes on the static HDFS cluster either on-Premises or on Cloud VM’s like AWS EC2. Hadoop distribution packages like Cloudera, Hortonworks, MapR were used to install on top of these VMs. Then as per the use-case, various Services were enabled on the cluster. These clusters were supporting workloads in a multi-tenant setup. Scalability along with resource-contention were the main issues of such setup. Since Storage and compute co-existed, even though there was no processing needed, still cluster had to be running 24*7.
Data Lake with Dynamic cluster
To overcome the issues facing with the Static Setup. We moved to data lake implementation with Dynamic compute cluster. The main principle behind such a setup is the segregation of the Storage and compute services. This was possible by using services like S3, Blob, GCS for storage and EMR, HDInsights, DataProc for Processing, this also enabled auto-scaling of processing cluster as per the need, without any data re-distribution. Also, with storage segregated, compute clusters need to be running only during the workloads. This resulted in reduced processing costs and also covered some of the extra costs administered through cloud services. Although it solved most of the previous issues, still the management of the clusters was overhead and it becomes extra painful if design required separate clusters for each workload, which can mean literally hundreds of clusters spinning up and down every hour during the peak.
Data Lake with Serverless Compute
To solve the cluster management overhead, here came another variant of the data lake implementation which uses serverless services for its computing. Although with serverless, there is a lack of control of the execution environment, if it suits the use-case, this is definitely an attractive proposition to look into as it gets you away with the infra management efforts.
Design considerations of a Cloud-Native Data Lake Implementation
Below are the design principles which needs to be considered when designing a cloud-native Data Lake Implementation
- Segregation of storage and compute
- Polyglot Persistence
- Serverless Compute
- Central Security mechanism
- Configurability ensuring cloud agnostic
Based on the above-mentioned principles, what we get is a purely cloud-native implementation that is not dependent on any particular cloud vendor platform.
Prevalent challenges with Data Lake implementation and how enabling DV solves it
While Data Lakes have been extremely helpful in supporting the ever-important use-cases of Analytics, Reporting, AI/ML, and Visualization. There have been few problems with which data lakes have been mired with constantly, namely, struggle to keep the datasets in sync with their data sources, duplication of data to support different views of data and last but probably the biggest challenge is to physically move the data from the source system to the Data lake resulting in latency(maybe days/weeks), security challenges (GDPR, HIPAA regulations) and governance issues.
This is where Data Virtualization comes into the picture and promises to solve the above issues. So, let’s understand what is Data Virtualization ( will be referring to as DV here onwards). DV is the ability to view, access, and analyze data without the need to know its location. It can integrate data sources across multiple data types and locations, turning it into a single logical view without needing to do any sort of physical data movement. It also has auxiliary services like cataloging, Optimizers, and Governance.
Benefits of Data Virtualization:
- A unified view of data and logical centralization of data sources
- Reduced Errors as it cuts down the data hops
- Location flexibility as physical data movement not needed
- Federate Queries among varied data sources and compute optimization
- Supports Urgency of Insights and Time to Market
Functional components of a DV system:
- Connectors: To connect to various data sources
- Catalog service:- To provide a unified view of all data sources
- Federated and Distributed Query Engine: To query the supported data sources in a federated manner
- Caching and Optimization: To use the fastest compute available and reduce latency
- Security and Governance Layer: To have Role-based access to the data sources
- Data distribution/access layer: To expose the data to downstream systems
Some of the tools/technologies which enables DV
There is an ample number of DV tools available in the market. But some of the most prevalent ones which I have seen across client is Denodo (Market leader in DV technology), Dremio (Open source Data Lake engine which also supports DV), Presto (Federated Query engine which when integrated with other auxiliary services provides the DV experience), FraXses, and IBM Cloud Pak.
Agile Data Lake
An Agile Data Lake is a setup that has all the advantages of a data lake and which also brings in the required level of agility in data access needed to support urgent business needs and time to market, which may be missing in a general data lake implementation as they are constrained by the requirement of physical movement of data.
How to bring Agility to a Data Lake
While Data virtualization is an extremely useful tool to have in the architectural toolkit — but it is not a panacea. DV can really be helpful in some use cases where urgent insights are needed or if there are some restrictions on the movement of data from the source (GDPR), but it cannot suffice all the use-cases as some will require physical movement of the data in the data lake and apply the needed processing to transform the data. So, a judicious mix of physical consolidation and DV can result in the required level of agility needed for a modern and agile Data Lake
Reference Architecture of an Agile Data Lake
Here is how an Agile Data Lake should look like
Other services of the data lake will function as it is, just that the addition of DV layer will ensure that if end-users need some urgent insights from a new data source, then it does not have to go through the full ingestion and ETL process, in fact, just the connector configuration of the new data source in the DV tool will enable the querying of the required data.
Any Enterprise/Business function looking to implement a data lake should certainly explore the possibilities of bringing in a DV tool in the kitty as it can be very helpful in creating a truly agile and evolved data lake design. Please feel free to reach out to me or the Abzooba Big Data and Cloud team, in case you are interested in further discussions.