Architectural Best Practices for cloud-native Data Lake Solutions

20 min readJan 12, 2021

Article by Gautam Kumar, Big Data & Cloud Practice Lead

Abstract

This whitepaper focuses on architectural best practices to attain the most value for the least cost with the required level of performance and reliability when building cloud-native Data-Lake/big-data solutions. In particular, this whitepaper explains how you can maintain a robust and best-in-class design, minimize your costs, maximize the availability of your system, optimize your infrastructure for maximum performance, and tighten it for security compliance, while also enabling operational excellence for ongoing maintenance. The flexibility of cloud services, combined with the power of Big Data solutions, can provide expanded capabilities for those who seek innovative approaches to optimize their applications and transform their businesses.

This paper is written for the professionals in the technology roles or IT folks interested in learning more about the architecture and implementation of cloud-native solutions. We hope that this paper assists in evaluating the measures used while designing big data/Data Lake solutions natively on the cloud. It may also act as a reference for companies that wish to compare their current and/or competing Data Lake solutions offering on the market. Given its target audience, the paper often uses terms of art that, while familiar to many, maybe new to some readers of this paper. Wherever possible acronyms are explained, but the paper is not meant as a tutorial or authoritative source for the details of topics like Virtualization, Cataloging, encryption, or the architecture of Data-Lake/Big-Data systems.

Introduction

This whitepaper provides architectural guidance and advice on technical design patterns and how they are applied in the context of Big Data and cloud platforms. This is intended for those in technology roles, such as chief technology officers (CTOs), IT Heads, architects, developers, and operations team members. It helps you understand the pros and cons of decisions you make while building Data Lake/Big Data systems on the Cloud. By using the below best practices you will learn to design and operate reliable, secure, efficient, and cost-effective systems in the cloud. It provides a way for you to consistently measure your architectures against best practices and identify areas for improvement.

We have divided the architectural best practices into four broad categories and the points mentioned are mostly based on our experience and learnings which we had from various cloud-native big data implementation which we have done over the years and still apply to work with our clients to create the best-in-class architecture for them. The categories for the architectural best practices are:

Design and Tech Choices
Cost Optimization
Security
Operational Execution and Reliability

After reading this paper, you will understand the best practices and strategies to use when designing cloud architectures. This paper doesn’t provide deep implementation details or architectural patterns; however, it does include references to appropriate resources for this information. We hope that these best practices would benefit the wider community and also invoke further discussion on why and how of these implementations.

Design and Tech Choices

Iterative Design

One of the important aspects of designing a cloud-native application is to keep it iterative in nature which means that the design need to be evaluated at regular intervals say every quarter and every component needs to be compared with the available services so that the architecture is updated with the latest and greatest.

This also requires the design to be modular in nature so that each of the components can be plugged in and out as needed.

The pace at which the cloud vendors are bringing new services and upgrading the existing ones is just phenomenal and that’s the reason an iterative design becomes so very imperative. For Example just in 2020 till now AWS has made around 700+ new announcements of adding new services or adding new capabilities to the existing services.

https://aws.amazon.com/about-aws/whats-new/2020/

Data Virtualization

A judicious mix of physical consolidation and DV can result in the required level of agility needed for modern and agile Data Lake solutions.

While Data Lakes have been extremely helpful in supporting the ever-important use-cases of Analytics, Reporting, AI/ML, and Visualization. There have been few problems with which data lakes have been mired with constantly, namely, struggle to keep the datasets in sync with their data sources, duplication of data to support different views of data and last but probably the biggest challenge is to physically move the data from the source system to the Data lake resulting in latency(maybe days/weeks), security challenges (GDPR, HIPAA regulations) and governance issues.

This is where Data Virtualization comes into the picture and promises to solve the above issues. DV is the ability to view, access, and analyze data without the need to know its location. It can integrate data sources across multiple data types and locations, turning it into a single logical view without needing to do any sort of physical data movement. It also has auxiliary services like cataloging, Optimizers, and Governance.

Benefits of Data Virtualization:

A unified view of data and logical centralization of data sources
Reduced Errors as it cuts down the data hops
Location flexibility as physical data movement not needed
Federate Queries among varied data sources and compute optimization
Supports Urgency of Insights and Time to Market

Functional components of a DV system:

Connectors: To connect to various data sources
Catalog service:- To provide a unified view of all data sources
Federated and Distributed Query Engine: To query the supported data sources in a federated manner
Caching and Optimization: To use the fastest compute available and reduce latency
Security and Governance Layer: To have Role-based access to the data sources
Data distribution/access layer: To expose the data to downstream systems

reference https://ansera.si/en/tibco-data-virtualization/

Some of the tools/technologies which enables DV

There is an ample number of DV tools available in the market. But some of the most prevalent ones which we have seen across client is Denodo (Market leader in DV technology), Dremio (Open source Data Lake engine which also supports DV), Presto (Federated Query engine which when integrated with other auxiliary services provides the DV experience), FraXses, and IBM Cloud Pak.

Design for Scalability

Some of the basic needs of cloud-native Data Lake systems are:

Ability to support an unprecedented explosion of growth of data volume.
Ability to handle all types of data Structured, Unstructured and semi-structured
Need for a low cost and highly scalable infra to store all data versions like Raw, Intermediate, and Processed
Need the ability to Ingest Data at a high rate from various data sources and endpoints
Need the ability to have a unified view of all data, such that they together provide new business insights like the 360-degree view of the customer
Ability to support multiple consumption patterns

As we can see from above that most of these needs require the Data lake system to be scalable in nature, so that the computing can be optimized over time also the compute clusters need not by running 24*7 but only needed when some processing power is required in the system. Thus it helps to control the overall operations cost. Below are the guiding principles which will ensure a scalable Data lake solution.

Segregate Compute and Storage
Dynamic compute with Auto-Scaling capability
Containerization/ Distributed Processing/ Stateless Application

Polyglot Persistence

Polyglot persistence is the concept of using different data storage technologies to handle different data storage needs within a given enterprise and even software application.

For Data Lake systems the persistence service needs to be chosen as per the access pattern, nature of data, and the use-case requirements like high throughput, global availability, transactional, Analytical, etc.

Like in the diagram shown below, we can have relationships data going to a GraphDB, Analytics data to HDFS, Transactional structured data to RDBMS, key/value, and logs to a No-SQL columnar database.

Also, the usage pattern and the use case requirement plays a major role in determining the persistent store used. For example, Heavy Analytical queries need to be run against a Data Warehousing system, while a high throughput requirement can be handled by a No-SQL DB like AWS DynamoDB or Azure CosmosDB, For Processing of unstructured data, it needs to be stored either in HDFS or Object store like S3/Blob.

image courtesy: http://bright-person.com/polyglot-definition/big-data-and-polyglot-persistence-dummies.html

Integrated Data Catalog

For most data lake/ Big Data solutions the data is stored across multiple datastore depending on the nature of data, their usage pattern, and the use-case requirement. For the consumers of the data, it becomes very important to have a single unified view of data across all layers which indexes your data by source, so that the data can be analyzed seamlessly by the end-users and everyone in the organization can find the data they need to collaborate. This is where an Integrated Data Catalog comes as a very good option, instead of going for cloud-specific services like Glue Catalog, we can go for 3rd Party Data Catalog tools. The two most common data catalog tools we are seeing across clients are Collibra and Alation.

Modern Data Catalog solution like Alation comes up with machine learning capabilities as well.

Image Courtesy: https://www.attivio.com/blog/post/data-catalog-modern-data-architecture

Cloud Agnostic design

One of the most important considerations while implementing a cloud-native solution is, how much committed are you to a particular cloud vendor. Are you using could functions or services which are specific to a platform vendor and which can result in the risk of vendor lock-in. If in the future you have a requirement to migrate your workloads from one cloud vendor to another, because of some organization level strategy re-alignment, it may happen that you will have to refactor a good percentage of your code to make it compatible with the new cloud platform.

On the other hand, if you avoid cloud vendor-specific services and use generic services like Docker, Kubernetes, EMR, Databricks, Elastic Search, lambda functions, etc. and make your code completely configurable such that configuration changes related to the cloud provider make your code ready to run for any specific cloud platform, makes it a very favorable design to have.

Being cloud-agnostic means that you’re capable of switching tracks to a different public cloud provider should the need arise, with minimal hiccups and disruption to your business.

Cost Optimization

Reserved Instances

If an organization is committed to a cloud provider for its need then it would be a better choice to go for reserved instances. Every cloud provider offers various plans based on upfront payment and time commitment. The cost saving can be up to 50% to 75% depending upon the plan selection. On the other hand, reserving the access capacity should be avoided. And to make sure that only required capacity is reserved, an organization must do a forecast on its requirements of instances class, their count, and the amount of usage. Also, an organization must do a cost comparison between on-demand and cost and reserved cost. There are cases wherein if the daily usage hours are low then on-demand can give batter cost reduction than reserved instances. If an organization is not in a position to do a forecast, they can use the convertible/flexible RI (Reserved Instance) wherein cloud providers offers to convert their reserved capacity from one instance class to another.

https://azure.microsoft.com/en-us/pricing/reserved-vm-instances/
https://aws.amazon.com/aws-cost-management/aws-cost-optimization/reserved-instances/

Optimize over time

In the traditional world, an exercise of capacity planning is carried out to determine the size of infrastructure required. As building and/or extending the infrastructure is very time consuming, generally this exercise considers workloads for a few years. Thus organizations would end up paying for the excess capacity. With the cloud, adding and removing infrastructure capacity can be done in a few minutes so it is advisable to start with the initial capacity required and then scale the infrastructure as needed. An organization must review its infrastructure usage at a regular frequency and determine the peak hours, idle hours, and moderate hours of consumption. Also, a matrix of resource utilization for various hours should be maintained. Once this matrix is created, an organization should look to either change the instance class and size or add more instances using the scaling feature provided by cloud providers. Initially, the scaling should be carried out in a controlled manner, manually at the start and once it gets stable, the auto-scaling rules can be configured to add and remove the capacity. This process should be repeated regularly as cloud providers periodically release new, optimized features on current services or release a new service that can give the required performance with the reduced cost.

Billing Alerts

Every cloud provider offers a service of billing and cost exploration to track the overall monthly cost and provides various options to break up the cost. This helps in understanding which service, which application, which user is utilizing how much of the resources, and how much they are contributing to the monthly cost. Along with this, each cloud provider offers to create a budget for a month and send alerts when a configured threshold is breached. Along with the cost, they also offer the tracking of RI usage and RI coverage as well. This all helps in making sure that the actual cost is not overshooting the defined budget. It is recommended to use this service for batter cost tracking.

https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/budgets-managing-costs.html
https://docs.microsoft.com/en-us/azure/cost-management-billing/costs/cost-mgt-best-practices

Spot Instances

Spot instances are based on the auction and if the right price is found, it can provide cost savings of up to 90%. As spot instances are auction-based and can be reclaimed at any time, it is advisable to run only dev and test workflows on them and that too in a hybrid fashion that some instances are on-demand/RI and some are spots. AWS also provides spot block wherein a spot instance can be reserved for up to 6 hours, with this a small, time-bound workloads can be set up on spot instances. With spot block, cost savings of up to 50% is achieved

https://aws.amazon.com/about-aws/whats-new/2016/06/new-amazon-ec2-spot-console-now-supports-spot-fleet-and-spot-blocks/#:~:text=Spot blocks allow you to, interrupted while your job completes.
https://azure.microsoft.com/en-us/pricing/spot/#documentation

Resource tagging

In the cloud environment, keeping a track/record of each resource is very critical in achieving cost goals. A well-designed tagging standard helps in segregating cloud resources per the information supplied over the tags. Resource tags get available in the cost exploration as one of the filtering/slicing columns and thus a well thought through tagging mechanism can provide an answer to many important questions such as who created the resource, when it was created, for which requirement/application it was created for any more. There is no restriction in supplying this additional information that can help in breaking up the cost. As a standard practice, tags of resource name, cost center, creation date, and the business unit should be attached to each resource.

Resource tracking and utilization monitoring

One of the key aspects of cloud cost is to track the resources created and their utilization. There are chances that many cloud resources are not being used or used at very low capacity. This can happen if someone creates a resource in a different region and then forgets to delete or creates a larger instance at the start to avoid issues and then it’s not reviewed on time. It is essential to identify such resources and either delete them or reduce their capacity. One way is to create an in-house script that can create a daily report of each resource and its utilization, another way is to restrict the users to create resources in a certain region using IAM roles and the final way is to use cloud provided advisory services to track and monitor the resources.

Advisor Services

Advisor service is offered by every cloud provider. This service scans all the resources, the workload for an account, and publishes a comprehensive summary that can be useful in not only reducing the cost but also for identifying security gaps in applications, improving performance, and achieving fault tolerance. In terms of cost, they provide recommendations on idle resources, underutilized resources, unutilized resources, reserved instance lease expiration, and reserved instance optimization. It is advisable to run through advisor report on a daily/weekly basis and create a thorough plan/strategy on how to implement them as soon as possible

https://aws.amazon.com/premiumsupport/technology/trusted-advisor/best-practice-checklist/
https://azure.microsoft.com/en-us/services/advisor/#documentation

Security

Security is a critical pillar in any IT landscape and it takes a center stage in public cloud implementation. This is the reason why most cloud providers offer various services/methods to safeguard IT resources falling into all seven layers of the OCI model. There are many whitepapers and blogs published by AWS and Azure which are a very good read to understand cloud security models and what parameters/strategies to use to design a security model suitable to you. Below are some of them, Intro to AWS security, AWS security best practices, Azure shared responsibility model, and Azure security documentation. In this whitepaper, we will restrict our discussion on the security aspects of a data lake.

Central Security Controls

A data lake is created not just by the data but by a coherent strategy of data governance, data cataloging and search, and data authentication and authorization. These all are the separate entities who when stuck together give a holistic view of the data lake. These components are separate systems and thus it becomes difficult to sync security controls, especially object and data level authentication, and authorization, between them. To overcome this and to simplify the security, most cloud providers offer a data lake service that works as a centralized hub for data governance, cataloging and search, and for ACLs. AWS Lake Formation and Azure Data Lake are a few examples of such services.

Image Courtesy: https://azure.microsoft.com/en-gb/services/security-center/

Least Privilege Principle

The Least Privilege Principle is one of the best practices to be followed in designing any security model. Grant the permissions that are required to carry out a task only. With open access, there are chances that users may access/create resources which they are not entitled to do. This increases the monthly cost and also adds efforts of auditing user access and actions. Thus it is a more secure and easy to audit/maintainable approach than to start with lenient access and then find ways to tighten them. A detailed description of it is available in the cloud security best practices whitepapers provided earlier in this section.

Infrastructure as a code

With an increase in data and size of data lake processing, Infrastructure-As-A-Code is becoming a popular way of building cloud resources. During the planning/design phase, the organization’s central admin team creates a blueprint of the data lake and publishes a set of standards to be followed in creating cloud resources. However, as the application count grows, it becomes very difficult to track if all of them are following the standards laid down by the central admin team. In such a scenario, Infrastructure-As-A-Code becomes a vital mechanism to ensure that all the resources, that are being created, follow the guidelines published. A template can be created and published by the central admin team which all application teams need to follow in creating their resources. The template can be created as a Terraform script or as a Cloud-formation template (if in AWS), Resource Manager Templates in Azure. Another advantage of Infrastructure-as-a-Code is that when an application is being decommissioned, only delete action of it to be performed and it will delete all the resources used by the application. This leaves very little chances of unused resources which can add up in monthly billing.

Image Courtesy: https://docs.microsoft.com/en-us/azure/devops/learn/what-is-infrastructure-as-code

Security Vaults

Nowadays most organizations are building a centralized IT/Admin team that manages credentials, secrets, and certificates of the entire organization. Also in modern scenarios, to protect the systems and their data from various threats, it is desired to protect, rotate, and audit these secrets and certificates. Protection can be achieved through secure storage, encryption, and authentication/authorization mechanism to access the secrets. Also once the secrets are set, there is a need to rotate them at a regular interval or when required. Also need to make sure that the new secret is working as expected after every rotation. Also, there should be proper auditing in place to understand when the secret was last accessed, how many times it was accessed, and by whom. Organizations can build their own solutions, rotation, and auditing scripts to achieve this. But this can become difficult to manage when the infrastructure grows. In this case, third party solutions can be very helpful. There is the third party and cloud-provided secrets vault available which out of the box provide features of secure storage, encryption, access control list, rotation, and testing, auditing, revocation. AWS secrets manager and Azure key vault are a few examples of cloud provided secrets vault

Image Courtesy: https://www.alamy.com/cloud-security-concept-with-3d-rendering-bank-vault-on-circuit-cloud-image256845305.html

Security Leakages check

Secrets vault solves most of the problems of securing credentials and their management. However, there is a risk of human error. A developer may end up storing secrets in the code and can end up publishing it on a code repository. To avoid such leaks, there is a need to periodically check the code repository for password leaks. Tools like TruffleHog can become very useful in this type of requirement. TruffleHog scans the entire repository including all branches and tags and alerts when it finds a suspicious string that can be a secret. It can be integrated with CI/CD to avoid security leaks in the code at the time of check-in.

End to End Encryption Service

In a data lake, data protection, both at rest and in transit, is a basic ask. The common encryption methods convert the data into binary format and then encrypt it. Organizations nowadays thrive to encrypt the data from its origin and in the same format rather than converting it into binary format for encryption. This is where FPE (Format Preserving Encryption) and an end to end encryption tools like Voltage encryption can be very useful.

Image Courtesy: https://itiscoolhere1.live/?utm_campaign=QPF8euu28II5lw7O2iHhCugVqK5RzfdNsTpLaMM91qY1&t=main9

Operational Execution and Reliability

Interactive Development Environment

One of the major issues faced by Big Data and cloud developers is that the local IDE tools cannot help them much with the development like that of Spark-based programs. Even if they write the code in their local system, they cannot run it locally as the code need to run in a cluster environment. You may have to create a process of CI/CD pipeline to run it which in many cases may not be feasible. Under such scenarios, our recommendation is to use Notebooks( Jupyter/ Zeppelin) based development environment for Spark programs integrated with the native-cloud/custom CI/CD pipelines. Notebooks are interactive in nature and will help with the unit level testing while the CI/CD pipeline will help with the overall Integration testing.

HA and DR

Design for high availability and get the right disaster recovery strategy as per the criticality of the system

Every business seeks data solutions that can address their operational requirements. These requirements often translate to specific values of the Recovery Time Objective (RTO), and Recovery Point Objective (RPO). The RTO indicates how long the business can endure database and application outages, and the RPO determines how much data loss is tolerable. For example, an RTO of one hour tells us that, in the hapless event of an application outage, the recovery plans should aim to bring the application back online within one hour. Likewise, an RPO of zero indicates that, should there be any minor or major issues impacting the application, there should be no data loss after the application is brought back online. The combination of RTO and RPO requirements dictates what solution should be adopted. Typically, applications with RPO and RTO values close to zero need to use a high availability (HA) solution, whereas disaster recovery (DR) solutions could be used for those with higher values. In many cases, HA and DR solutions can also be mixed to address more complex requirements.

Availability Zones are designed to provide separate failure domains while keeping workloads in relative proximity for low latency communications. Availability Zones are a good solution for synchronous replication of your databases using Mirroring, Always On Availability Groups, Basic Availability Groups, or Failover Cluster Instances. This is one of the main differences between most on-premises deployments and cloud deployments.

For critical components, preform multi-AZ, multi-region deployment to maintain system availability

Image Courtesy: https://www.mssqltips.com/sqlservertip/6397/azure-vm-high-availability-and-disaster-recovery-options-for-sql-server/

Chaos Engineering

Chaos Engineering is the discipline of experimenting on a system to build confidence in the system’s capability to withstand turbulent conditions in production.

It helps us to define “How much confidence we can have in the distributed and complex systems that we put into production?” We need to identify weaknesses before they manifest in system-wide, aberrant behaviors. Systemic weaknesses could take the form of; improper fallback settings when a service is unavailable; retry storms from improperly tuned timeouts; outages when a downstream dependency receives too much traffic; cascading failures when a single point of failure crashes; etc. We must address the most significant weaknesses proactively before they affect our customers in production.

Image courtesy: https://www.lynda.com/Developer-tutorials/DevOps-Foundations-Chaos-Engineering/5028636-2.html

Central Monitoring Mechanism

Most of the Data lake implementations have multiple components in a plug and play manner. These components may have their own monitoring patterns but as a best practice, there has to be a central monitoring framework covering all the components such that the system user can get a single pane view of all the monitoring logs and metrics of the Data lake system. Services like Cloud Watch and Azure monitor are good service which can be configured get this central monitoring mechanism enabled.

Image Courtesy: https://www.networkmanagementsoftware.com/solarwinds-security-and-event-manager-review/

Automatic Ticket Generation

Whenever there is a failure scenario for data pipelines, in most of the ETL flows there is some sort of notification service configured which will send email to the stakeholders on the success or failure of particular data pipelines. One issue with such design is the tracking of the failure cases through emails and in most cases, low priority failure instances remain untracked or fall through slippage, until and unless there is a very strong process methodology in place. One good practice we are see emerging among clients is the integration of the ticketing system along with the Data Pipeline and ETL workflows. Any failure instance results in a ticket opened in the ticketing system which takes care of the notification, tracking, and SLA of the failure tickets. This has been very helpful in the high priority production workloads.

Conclusion

In this whitepaper, we described several best practices for designing, building, and deploying cloud-native data lake implementations. Some of the terms used here might be specific to a particular cloud platform vendor like AWS/Azure, but those are just for representation purpose and doesn’t specify any affinity towards a particular vendor. Each best-practice and associated trade-offs may be embraced according to particular business requirements. The four pillars of architectural best practices (design and tech choices, cost optimization, security, and operational execution) can be explored as applicable to data workloads and cloud/custom-built services supporting the requirements.

We hope that the discussion of these best practices sheds some additional light on a large number of features the big data applications implemented cloud-based data lake solutions. We also hope that this provides a new set of questions to consider in evaluating your own solutions. Abzooba Big Data and Cloud Practice will continue to innovate and build upon this foundation with more features and learnings that raise the bar in various Data Lake implementations across clients.