POV on Presto

3 min readNov 26, 2020

article by Timmanna Channal & Rahul Lalwani

Introduction

Presto (or PrestoDB) is an open-source, distributed SQL query engine, designed from the ground up for fast analytic queries against data of any size. It supports both non-relational sources, such as the Hadoop Distributed File System (HDFS), Amazon S3, Cassandra, MongoDB, and HBase, and relational data sources such as MySQL, PostgreSQL, Amazon Redshift, Microsoft SQL Server, and Teradata.

Presto Concepts

Server Types:

There are two types of Presto servers: coordinators and workers.

Coordinators: The Presto coordinator is the server that is responsible for parsing statements, planning queries, and managing Presto worker nodes. It is the “brain” of a Presto installation and is also the node to which a client connects to submit statements for execution. Coordinators communicate with workers and clients using a REST API.

Workers: A Presto worker is a server that is responsible for executing tasks and processing data. Worker nodes fetch data from connectors and exchange intermediate data with each other. The coordinator is responsible for fetching results from the workers and returning the final results to the client. Workers communicate with other workers and Presto coordinators using a REST API.

Connectors: A connector adapts Presto to a data source such as Hive or a relational database. Every catalog is associated with a specific connector. Every catalog configuration file contains a mandatory property connector.name which is used by the catalog manager to create a connector for a given catalog. It is possible to have more than one catalog use the same connector to access two different instances of a similar database. For example, if we have two Hive clusters, we can configure two catalogs in a single Presto cluster that both use the Hive connector, allowing us to query data from both Hive clusters (even within the same SQL query).

Catalog: A Presto catalog contains schemas and references a data source via a connector. When we run a SQL statement in Presto, we are running it against one or more catalogs. When addressing a table in Presto, the fully-qualified table name is always rooted in a catalog. For example, a fully-qualified table name of hive.test_data.test would refer to the test table in the test_data schema in the hive catalog. Catalogs are defined in properties files stored in the Presto configuration directory.

Schema: Schemas are a way to organize tables. Together, a catalog and schema define a set of tables that can be queried.

Table: A table is a set of unordered rows which are organized into named columns with types. This is the same as in any relational database. The mapping from source data to tables is defined by the connector.

Statements and Query: Statements simply refer to the textual representation of a SQL statement. When a statement is parsed, Presto creates a query along with a query plan that is then distributed across a series of Presto workers.

Presto Architecture

The coordinator is made of 3 components: Parser, Planner, and Scheduler. Connectors provide Metadata API for the parser, Data location API for the scheduler, and Datastream API for the workers in order to perform queries above multiple data sources.

Parser: The Parser is responsible for parsing queries submitted by clients and detects syntax errors if any.
Planner: The Planner makes the execution plan for a query and passes the information to Scheduler for execution
Scheduler: The Scheduler is responsible for launching workers on Worker nodes and keeping track of each worker node. It defines a (usually parallel) execution pipeline, assigns different tasks to worker nodes closest to the data associated with each task, and monitors task progress.

Who Uses Presto?

It was developed in Facebook and they use it in production as well as Airbnb, Dropbox, Netflix, Uber, AWS (EMR & Amazon Athena), and more big names. Facebook uses Presto daily to run more than 30,000 queries that, in total, scan over a petabyte each per day over several internal data stores, including their 300PB data warehouse.

Capabilities Of Presto:

Allowing querying over data where it is stored. There is no need to move data to a separate analytics system.
A single query can read data from multiple sources — that’s the main advantage of Presto, it’s super-pluggable.