Running Presto on Amazon Web Services

BigData & Cloud Practice
3 min readDec 9, 2020

Article by Hakim Pocketwalla, Big Data & Cloud Developer

Introduction

Presto is an open source, distributed query engine that enables us to perform fast, interactive and analytical type of queries on various sizes of datasets. Presto is SQL compliant and supports many data sources.

Presto is a distributed system that runs on a cluster of machines. A full installation includes a coordinator and multiple workers.
Queries are submitted from a client such as the Presto CLI to the coordinator. The coordinator parses, analyzes and plans the query execution, then distributes the processing to the workers.

Presto Architecture

Presto Requirements

Presto has a few basic requirements that we need to ensure:

  • Linux or Mac OS X
  • Java 8, 64-bit
  • Python 2.4+

Installing and using Presto

We can download Presto from its official website as a tarball package: https://prestosql.io/download.html
Unzipping it will return a single top-level directory.

We now need to create an “etc“ directory inside the installation directory.
The “etc“ directory will hold the following configuration:

  • Node Properties: environmental configuration specific to each node
  • JVM Config: command line options for the Java Virtual Machine
  • Config Properties: configuration for the Presto server
  • Catalog Properties: configuration for Connectors (data sources)

More details of these configurations can be obtained from the presto deployment guide:
https://prestodb.io/docs/current/installation/deployment.html

Running Presto on Amazon Web Services

As mentioned earlier, Presto works in a distributed environment and hence, needs a cluster to work with.
Lucky for us, Amazon Web Services provides us the ability to use Presto installed as a part of EMR clusters with all basic configurations done.
We need not create any of the basic configurations files mentioned in the previous section. AWS provides us a complete working setup configured across all nodes of the cluster.
As of the date of writing this blog, with EMR version 5.30.1, we get Presto version 0.232 and with EMR version 6.0.0, we get Presto version 0.230.

We can check our configurations at the path:
/etc/presto/conf/

Also the presto installation directory can be found at:
/usr/lib/presto/

From here we can navigate into the bin folder and use the pre packaged Presto CLI executable provided to us to start the Presto CLI.

Note: One thing to note is that the way Presto is set up on EMR, it runs on port 8889 rather than the default port 8080.

Configuring external connectors with Presto on EMR

To configure external connectors that are supported by presto, we need to create their respective configuration files at the following path:
/etc/presto/conf/catalog/

The naming convention used to create a properties file is:
”<supported connector name>.properties”
Each properties file shall contain additional configurations required for that particular connector.

Suppose we wanted to configure MySQL as a source with Presto, Let us have a look at how we can do it:

  • First we navigate to the directory: /etc/presto/conf/catalog/
  • Then we create a new file with the name: “mysql.properties“
  • Inside the “mysql.properties“ files, we add the following configurations:

connector.name=mysql
connection-url=jdbc:mysql://host:port
connection-user=mysql_username
connection-password=mysql_password

  • We now need to copy this file on each of the worker nodes in the same location.
  • Once the file is copied, we need to stop and start the Presto service on each node of the cluster. We can do so by using the commands:
    sudo systemctl stop presto-server
    sudo systemctl start presto-server
  • We can now start the Presto CLI and begin using MySQL with presto.

In a similar fashion we can create as many of the supported Presto connectors as necessary.
More information on the connectors can be found from the official Presto documentation:
https://prestodb.io/docs/current/index.html

--

--

BigData & Cloud Practice

Abzooba is an AI and Data Company. BD&C Practice is one of the fastest growing groups in Abzooba helping several fortune 500 clients in there cognitive journey