Architectural Best Practices for cloud-native Data Lake Solutions

Article by Gautam Kumar, Big Data & Cloud Practice Lead

Abstract

This paper is written for the professionals in the technology roles or IT folks interested in learning more about the architecture and implementation of cloud-native solutions. We hope that this paper assists in evaluating the measures used while designing big data/Data Lake solutions natively on the cloud. It may also act as a reference for companies that wish to compare their current and/or competing Data Lake solutions offering on the market. Given its target audience, the paper often uses terms of art that, while familiar to many, maybe new to some readers of this paper. Wherever possible acronyms are explained, but the paper is not meant as a tutorial or authoritative source for the details of topics like Virtualization, Cataloging, encryption, or the architecture of Data-Lake/Big-Data systems.

Introduction

We have divided the architectural best practices into four broad categories and the points mentioned are mostly based on our experience and learnings which we had from various cloud-native big data implementation which we have done over the years and still apply to work with our clients to create the best-in-class architecture for them. The categories for the architectural best practices are:

  • Design and Tech Choices
  • Cost Optimization
  • Security
  • Operational Execution and Reliability

After reading this paper, you will understand the best practices and strategies to use when designing cloud architectures. This paper doesn’t provide deep implementation details or architectural patterns; however, it does include references to appropriate resources for this information. We hope that these best practices would benefit the wider community and also invoke further discussion on why and how of these implementations.

Design and Tech Choices

Iterative Design

This also requires the design to be modular in nature so that each of the components can be plugged in and out as needed.

The pace at which the cloud vendors are bringing new services and upgrading the existing ones is just phenomenal and that’s the reason an iterative design becomes so very imperative. For Example just in 2020 till now AWS has made around 700+ new announcements of adding new services or adding new capabilities to the existing services.

https://aws.amazon.com/about-aws/whats-new/2020/

Data Virtualization

While Data Lakes have been extremely helpful in supporting the ever-important use-cases of Analytics, Reporting, AI/ML, and Visualization. There have been few problems with which data lakes have been mired with constantly, namely, struggle to keep the datasets in sync with their data sources, duplication of data to support different views of data and last but probably the biggest challenge is to physically move the data from the source system to the Data lake resulting in latency(maybe days/weeks), security challenges (GDPR, HIPAA regulations) and governance issues.

This is where Data Virtualization comes into the picture and promises to solve the above issues. DV is the ability to view, access, and analyze data without the need to know its location. It can integrate data sources across multiple data types and locations, turning it into a single logical view without needing to do any sort of physical data movement. It also has auxiliary services like cataloging, Optimizers, and Governance.

Benefits of Data Virtualization:

  • A unified view of data and logical centralization of data sources
  • Reduced Errors as it cuts down the data hops
  • Location flexibility as physical data movement not needed
  • Federate Queries among varied data sources and compute optimization
  • Supports Urgency of Insights and Time to Market

Functional components of a DV system:

  • Connectors: To connect to various data sources
  • Catalog service:- To provide a unified view of all data sources
  • Federated and Distributed Query Engine: To query the supported data sources in a federated manner
  • Caching and Optimization: To use the fastest compute available and reduce latency
  • Security and Governance Layer: To have Role-based access to the data sources
  • Data distribution/access layer: To expose the data to downstream systems

reference https://ansera.si/en/tibco-data-virtualization/

Some of the tools/technologies which enables DV

There is an ample number of DV tools available in the market. But some of the most prevalent ones which we have seen across client is Denodo (Market leader in DV technology), Dremio (Open source Data Lake engine which also supports DV), Presto (Federated Query engine which when integrated with other auxiliary services provides the DV experience), FraXses, and IBM Cloud Pak.

Design for Scalability

  • Ability to support an unprecedented explosion of growth of data volume.
  • Ability to handle all types of data Structured, Unstructured and semi-structured
  • Need for a low cost and highly scalable infra to store all data versions like Raw, Intermediate, and Processed
  • Need the ability to Ingest Data at a high rate from various data sources and endpoints
  • Need the ability to have a unified view of all data, such that they together provide new business insights like the 360-degree view of the customer
  • Ability to support multiple consumption patterns

As we can see from above that most of these needs require the Data lake system to be scalable in nature, so that the computing can be optimized over time also the compute clusters need not by running 24*7 but only needed when some processing power is required in the system. Thus it helps to control the overall operations cost. Below are the guiding principles which will ensure a scalable Data lake solution.

  • Segregate Compute and Storage
  • Dynamic compute with Auto-Scaling capability
  • Containerization/ Distributed Processing/ Stateless Application

Polyglot Persistence

For Data Lake systems the persistence service needs to be chosen as per the access pattern, nature of data, and the use-case requirements like high throughput, global availability, transactional, Analytical, etc.

Like in the diagram shown below, we can have relationships data going to a GraphDB, Analytics data to HDFS, Transactional structured data to RDBMS, key/value, and logs to a No-SQL columnar database.

Also, the usage pattern and the use case requirement plays a major role in determining the persistent store used. For example, Heavy Analytical queries need to be run against a Data Warehousing system, while a high throughput requirement can be handled by a No-SQL DB like AWS DynamoDB or Azure CosmosDB, For Processing of unstructured data, it needs to be stored either in HDFS or Object store like S3/Blob.

image courtesy: http://bright-person.com/polyglot-definition/big-data-and-polyglot-persistence-dummies.html

Integrated Data Catalog

Modern Data Catalog solution like Alation comes up with machine learning capabilities as well.

Image Courtesy: https://www.attivio.com/blog/post/data-catalog-modern-data-architecture

Cloud Agnostic design

On the other hand, if you avoid cloud vendor-specific services and use generic services like Docker, Kubernetes, EMR, Databricks, Elastic Search, lambda functions, etc. and make your code completely configurable such that configuration changes related to the cloud provider make your code ready to run for any specific cloud platform, makes it a very favorable design to have.

Being cloud-agnostic means that you’re capable of switching tracks to a different public cloud provider should the need arise, with minimal hiccups and disruption to your business.

Cost Optimization

Reserved Instances

https://azure.microsoft.com/en-us/pricing/reserved-vm-instances/
https://aws.amazon.com/aws-cost-management/aws-cost-optimization/reserved-instances/

Optimize over time

Billing Alerts

https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/budgets-managing-costs.html
https://docs.microsoft.com/en-us/azure/cost-management-billing/costs/cost-mgt-best-practices

Spot Instances

https://aws.amazon.com/about-aws/whats-new/2016/06/new-amazon-ec2-spot-console-now-supports-spot-fleet-and-spot-blocks/#:~:text=Spot blocks allow you to, interrupted while your job completes.
https://azure.microsoft.com/en-us/pricing/spot/#documentation

Resource tagging

Resource tracking and utilization monitoring

Advisor Services

https://aws.amazon.com/premiumsupport/technology/trusted-advisor/best-practice-checklist/
https://azure.microsoft.com/en-us/services/advisor/#documentation

Security

Central Security Controls

Image Courtesy: https://azure.microsoft.com/en-gb/services/security-center/

Least Privilege Principle

Infrastructure as a code

Image Courtesy: https://docs.microsoft.com/en-us/azure/devops/learn/what-is-infrastructure-as-code

Security Vaults

Image Courtesy: https://www.alamy.com/cloud-security-concept-with-3d-rendering-bank-vault-on-circuit-cloud-image256845305.html

Security Leakages check

End to End Encryption Service

Image Courtesy: https://itiscoolhere1.live/?utm_campaign=QPF8euu28II5lw7O2iHhCugVqK5RzfdNsTpLaMM91qY1&t=main9

Operational Execution and Reliability

Interactive Development Environment

HA and DR

Every business seeks data solutions that can address their operational requirements. These requirements often translate to specific values of the Recovery Time Objective (RTO), and Recovery Point Objective (RPO). The RTO indicates how long the business can endure database and application outages, and the RPO determines how much data loss is tolerable. For example, an RTO of one hour tells us that, in the hapless event of an application outage, the recovery plans should aim to bring the application back online within one hour. Likewise, an RPO of zero indicates that, should there be any minor or major issues impacting the application, there should be no data loss after the application is brought back online. The combination of RTO and RPO requirements dictates what solution should be adopted. Typically, applications with RPO and RTO values close to zero need to use a high availability (HA) solution, whereas disaster recovery (DR) solutions could be used for those with higher values. In many cases, HA and DR solutions can also be mixed to address more complex requirements.

Availability Zones are designed to provide separate failure domains while keeping workloads in relative proximity for low latency communications. Availability Zones are a good solution for synchronous replication of your databases using Mirroring, Always On Availability Groups, Basic Availability Groups, or Failover Cluster Instances. This is one of the main differences between most on-premises deployments and cloud deployments.

For critical components, preform multi-AZ, multi-region deployment to maintain system availability

Image Courtesy: https://www.mssqltips.com/sqlservertip/6397/azure-vm-high-availability-and-disaster-recovery-options-for-sql-server/

Chaos Engineering

It helps us to define “How much confidence we can have in the distributed and complex systems that we put into production?” We need to identify weaknesses before they manifest in system-wide, aberrant behaviors. Systemic weaknesses could take the form of; improper fallback settings when a service is unavailable; retry storms from improperly tuned timeouts; outages when a downstream dependency receives too much traffic; cascading failures when a single point of failure crashes; etc. We must address the most significant weaknesses proactively before they affect our customers in production.

Image courtesy: https://www.lynda.com/Developer-tutorials/DevOps-Foundations-Chaos-Engineering/5028636-2.html

Central Monitoring Mechanism

Image Courtesy: https://www.networkmanagementsoftware.com/solarwinds-security-and-event-manager-review/

Automatic Ticket Generation

Conclusion

We hope that the discussion of these best practices sheds some additional light on a large number of features the big data applications implemented cloud-based data lake solutions. We also hope that this provides a new set of questions to consider in evaluating your own solutions. Abzooba Big Data and Cloud Practice will continue to innovate and build upon this foundation with more features and learnings that raise the bar in various Data Lake implementations across clients.

--

--

Abzooba is an AI and Data Company. BD&C Practice is one of the fastest growing groups in Abzooba helping several fortune 500 clients in there cognitive journey

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
BigData & Cloud Practice

Abzooba is an AI and Data Company. BD&C Practice is one of the fastest growing groups in Abzooba helping several fortune 500 clients in there cognitive journey