Delta Lake learnings & challenges

4 min readNov 26, 2020

article by Timmanna Channal, Big Data & Cloud Developer

Purpose

The existing data lakes don’t support ACID operations and to achieve an ACID operation there are tools like Apache Hudi and Apache Iceberg but these tools don’t have fully spark supporting API’s. Hence Delta lake was introduced which is a sub project started in databricks (co-founder of Apache Spark) and later open sourced. In this blog we will be introducing Delta-Lake and its features.

Introduction to Delta Lake

Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. Delta Lake brings reliability to the data stored in the data lake.

Some of the key features of the tool

ACID Transactions
Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers must go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
Scalable Metadata Handling
In big data, even the metadata itself can be “big data”. Delta Lake treats metadata just like data, leveraging Spark’s distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
Time Travel (data versioning)
Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
Open Format
All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
Unified Batch and Streaming Source and Sink
A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
Schema Enforcement
Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
Schema Evolution
Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need to update the table schema using DDL.
Audit History
Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
Updates and Deletes
Delta Lake supports Scala/Java/Python APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and simplifies use cases like change data capture.
100% Compatible with Apache Spark API
Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.

To begin with Delta Lake

Delta come’s up as a jar file which is been hosted on the maven repository.

bin/pyspark –packages io.delta:delta-core_2.12:05:0
bin/spark-shell –packages io.delta:delta-core_2.12:05:0
bin/spark-submit –packages io.delta:delta-core_2.12:05:0

To convert an existing spark application to delta format. The following changes must be made.

Details of metadata in Delta Lake

Delta lake stores the metadata of data in a __delta_log folder.For each new transaction a new commit file will be created under the log folder and after every 10 commits the file compaction runs in the background and a new compaction commit file will be checkpointed to the disk which will be referred during latest transactions.

Reading the Delta Lake data.

Some of ways by which we can read the delta data are listed below.

Spark API
Presto
AWS Athena
Hive

Disadvantage

Features which are not supported by Delta Lake of now are listed below.

There are not pre-built connectors available.
Directly we can’t read the data from any BI tools we have to build our own connectors.
To read the data from Hive. We must use the connector and we should create a hive external table first in hive metastore. Using the connector, we can read the data, but we can’t write the data. As we should read the metadata first, so the read/write operations are slower.

Conclusion

Delta lake addresses the problems/challenges of the data lakes to simplify how you build your data lakes. As it is an open source project developer can get some hands-on and can also develop the new connectors.