An Introduction to Big Data Formats
article by Dilip Khandelwal, Big Data & Cloud Developer
Purpose
The goal of this blog is to introduce the popular big data file formats Avro, Parquet and ORC. We aim to understand their benefits and disadvantages as well as the context in which they developed. The right big data format is essential to achieve optimal performance and desired business outcomes.
Introduction
The big data world predominantly has three main file formats optimized for storing big data: Avro, Parquet, and Optimized Row-Columnar (ORC).
Similarities and differences between Avro, ORC, And Parquet formats
Similarities:
- All three formats store data in machine-readable binary format, which means that it can only be understood by a machine, unlike CSV and JSON formats which are human-readable.
- Datasets can also be split across multiple disks enabling for large scale parallel data processing. Considerably increases processing speed.
- They are self-described formats: One copy of a parquet file can quickly transfer onto another machine without loss of interpretability.
- They are on-the-wire formats: Can easily pass data between nodes in a cluster.
Differences:
- Parquet and ORC store data in a columnar format which means that data is optimized for fast retrieval. This is ideal for read-heavy analytical workloads i.e. queries that use only a few columns for analysis or ones with complex aggregations.
- Avro is a row-based data store which means that data is optimized for “write-heavy” workloads i.e. queries that need to display (write) most or all of the row data.
Choose the right Big Data Format
At its core, this evaluation framework has four key considerations:
- Row or Column Store ( R )
- Compression ( C )
- Schema Evolution ( E )
- Splitability ( S )
Row Vs Column Store
The most important consideration when selecting a big data format is whether a row or column-based format is best suited to your objectives. At the highest level, column-based storage is most useful when performing analytics queries that require only a subset of columns examined over substantial data sets. If your queries require access to all or most of the columns of each row of data, row-based storage will be better suited to your needs.
To demonstrate the differences between row and column-based data, consider this table of primary transaction data. For each transaction, we have the customer name, the product ID, sale amount, and the date.
Row-based storage is the purest form of the data table and is used in many applications, from web log files to highly-structured database systems like MySQL and Oracle.
In a database, this data would be stored by row, as follows:
Emma,Prod1,100.00,2018–04–02;Liam,Prod2,79.99,2018- 04–02;Noah,Prod3,19.99,2018–04–01;Oliv — ia,Prod2,79.99,2018–04–03
To process this data, a computer would read this data from left to right, starting at the first row and then reading each subsequent row.
Column-based data formats, store data by column. Using our transaction data as an example, in a columnar database this data would be stored as follows:
Emma,Liam,Noah,Olivia;Prod1,Prod2,Prod3;Pr od2;100.00,79.99,19.99,79.99;2018–04–02,2018–04–02, 2018–04–01, 2018–04–03
In columnar formats, data is stored sequentially by column, from top to bottom — not by row, left to right. Having data grouped by column makes it more efficient to easily focus computation on specific columns of data.
Row-based format, a computer would need to read a lot of unnecessary data (the blue and green boxes) across the whole data set. That requires more time, and higher compute costs.
By contrast, the column-based representation allows a computer to skip right to the relevant data, and only read the orange and navy blue boxes. Storing the data by column allows a computer to easily skip these entries, and bypass reading the entire row. This makes computation and compression more efficient.
Compression
Data compression reduces the amount of information needed for the storage or transmission of a given set of data. It reduces the resources required to store and transmit data, typically saving time and money. Compression uses encoding for frequently repeating data to achieve this reduction, done at the source of the data before it is stored and/or transmitted.
Schema Evolution
Schema in the context of a dataset refers to the column header and type. As a project matures, there may be a need to add/alter (new) columns to the dataset thus altering its schema. All 3 formats discussed support some level of schema evolution support although Avro is far superior in this compared to the other two formats.
Splitability
Datasets are commonly composed of hundreds to thousands of files, each of which may contain thousands to millions of records or more. Furthermore, these file-based chunks of data are often being generated continuously. Processing such datasets efficiently usually requires breaking the job up into parts that can be farmed out to separate processors. In fact, large-scale parallelization of processing is key to performance.
Understanding the Formats
Let us now explore the three file formats in more detail.
APACHE PARQUET: A COLUMN BASED FORMAT
Launched in 2013, Parquet was developed by Cloudera and Twitter (and inspired by Google’s Dremel query system) to serve as an optimized columnar data store on Hadoop. Because data is stored by columns, it can be highly compressed and is splittable
The column metadata for a Parquet file is stored at the end of the file, which allows for fast, one-pass writing. Metadata can include information such as, data types, compression/encoding scheme used (if any), statistics, element names, and more.
APACHE AVRO: A ROW BASED FORMAT
Apache Avro was released by the Hadoop working group in 2009. It is a row-based format that is highly splittable. The data definition is stored in JSON format while the data is stored in binary format, minimizing file size and maximizing efficiency. Avro features robust support for schema evolution by managing added fields, missing fields, and fields that have changed.
APACHE ORC: A ROW-COLUMNAR BASED FORMAT
Optimized Row Columnar (ORC) format was first developed at Hortonworks to optimize storage and performance in Hive, a data warehouse for summarization, query and analysis that lives on top of Hadoop.
This row-columnar format is highly efficient for compression and storage. It allows for parallel processing across a cluster, and the columnar format allows for skipping of unneeded columns for faster processing and decompression. ORC files can store data more efficiently without compression than compressed text files. Like Parquet, ORC is a good option for read-heavy workloads.
Conclusion
In this blog we’ve discussed a helpful framework for evaluating the big data formats Avro, Parquet, and ORC, an overview of how each format was developed, and their strengths.