Establish a connection between Azure DataLake Storage Gen 2 and Azure Databricks (python)

6 min readDec 9, 2020

Article by Shruti Bhawsar, Big Data & Cloud Software Engineer

This article shows how to mount an Azure Data Lake Storage Gen2 filesystem to DBFS using a service principal and OAuth 2.0.

Azure DataLake Storage Gen2 (also known as ADLS Gen2) is a next-generation solution for big data analytics. Data Lake Storage Gen2 makes Azure Storage the foundation for building enterprise data lakes on Azure. It is designed from the start to service multiple petabytes of information while sustaining hundreds of gigabits of throughput. It allows you to easily manage massive amounts of data. On the other hand, databricks is Unified Data Platform. It is an industry-leading, cloud-based data engineering tool used for processing and transforming massive quantities of data and exploring the data through machine learning models.

It become complex to refer multiple sites to mount datalake to databricks as it involves creating many configurations. I am writing this article to bring all things in one-frame ,making it simple for you.

Prerequisites :

Once the above setup is ready, follow the below steps:

Step 1 : Create a Azure AD application , Service Principal and grant it necessary permissions.

When you have code that needs to access or modify resources, you can create an identity for the app. This identity is known as a service principal. You can then assign the required permissions to the service principal. We will create Service principal for DataLake.

Go to search bar in Azure under your subscription and search for App Registration and register an application by filling the required information.

2. After registering your application, you will get a window like below screenshot. You need to copy the following ids in a notepad from this window:

Application Id/Client Id
Directory Id

3. On the same window, click on Certificate & Secrets on the left side. Then, create a new client secret using tab New Client Secret. Fill the necessary fields and click on Add button. Refer below screenshot.

4. After adding the client secret name (also called as service principal), it will generate the client secret key .Copy it to Notepad.

Please Note : You can’t access this key again once you go back from this window. So, ensure to note it down.

5. Now, we will grant necessary permission to this application from DataLake account. Go to your DataLake storage account. Then, follow this steps -

Select Access control (IAM) on left side.
Select Add a role assignment
Select “Storage Blob Data Contributor” in role field.
Search for your registered application and select it.
Save the role
You can see the role in “View Role Assignment” tab

***Fig — Click on add a Role Assignment***

Step 2 : Create a secret in KeyVault for Client Secret Key generated in the above step.

It is standard practice for storing the secret keys from being misused for some wrong reasons. It fills the security bridge for application.

Go to your KeyVault and click on Secrets tab on the left side of the window.
Click on Generate/Import
Provide a name to your secret key
Copy your Client Secret Key generated in the above step and paste in the Value field here. Then,Click on create
Secret is created with the enabled Status.

***Fig — Create a Secret for your Client Secret Key***

6. Now, on the same window , Go to Properties tab on the left side and select it. In the Properties window, copy DNS Name and Resource Id in a notepad. These will be used while creating Secret Scope in Databricks.

***Fig — KeyVault — Properties Window***

Step 3: Create an Azure Key Vault-backed secret scope in Databricks workspace.

A secret scope is collection of secrets identified by a name. Sometimes accessing data requires that you authenticate to external data sources through JDBC. Instead of directly entering your credentials into a notebook, use Databricks secrets to store your credentials and reference them in notebooks and jobs.

From this Secret Scope , we will access the keys stored in KeyVault so that our keys will be encrypted and data access will be secured in databricks .

Follow below steps to create Secret Scope :

Launch your Databricks Notebook.
Go to https://<databricks-instance>#secrets/createScope . This URL is case sensitive; scope in createScope must be uppercase. See the below screenshot

3. You will get a window like below screenshot. Fill the required information -

Provide a scope name
Paste DNS Name and Resource Id copied in the previous step in the respective fields.

4. Then, click on Create Button

Step 4 : Mount Azure Data Lake Storage Gen2 filesystem

You Should be ready with the following credentials :

Application Id : a1c3574e-ece9*******************
Directory Id : fd41ee0d-0d97–********************
KeyVault Secret Key name(service-credential-key-name) : keyvaultsecretfordatabricks
Databricks Secret Scope Name (scope-name) : scopekeyvault
DataLake Storage Account Name : datalakegen02storage
DataLake file system name ( Container Name that you want to mount to databricks file system) : demo

2. To mount an Azure Data Lake Storage Gen2 filesystem or a folder inside it, use the following command in sequence:

Create a directory in databricks file system where you will mount your DataLake container

Syntax : 
dbutils.fs.mkdirs(“/mnt/<mount-name>”)Code :
dbutils.fs.mkdirs(“/mnt/datalakemount”)

Check if the directory is created or not. Initially the directory will be empty

Syntax : display(dbutils.fs.ls(“/mnt/”))  code : display(dbutils.fs.ls(“/mnt/<databricksdirectoryname>”))

Put your credentials in the following commands to configure mounting. See the below screenshots for reference.

Syntax :
configs = {"fs.azure.account.auth.type": "OAuth",
           "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
           "fs.azure.account.oauth2.client.id": "<application-id>",
           "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
           "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}dbutils.fs.mount(
  source = "abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/",
  mount_point = "/mnt/<mount-name>",
  extra_configs = configs)

Check if the mount is done correctly or not. We will check by listing out the directory folders. It should match the list of files in datalake file system.

Syntax:
display(dbutils.fs.ls(“/mnt/<mount-name>/”))

Now, you can access the Datalake files system like you are accessing a DBFS.

Once your work is done, unmount the mount point by following

Syntax :
dbutils.fs.unmount("/mnt/<mount-name>")

Please refer to the below screenshots -

Helpful links :

Hope it helped you!

Happy Learning :D

Establish a connection between Azure DataLake Storage Gen 2 and Azure Databricks (python)

Step 1 : Create a Azure AD application , Service Principal and grant it necessary permissions.

Step 2 : Create a secret in KeyVault for Client Secret Key generated in the above step.

Step 3: Create an Azure Key Vault-backed secret scope in Databricks workspace.

Step 4 : Mount Azure Data Lake Storage Gen2 filesystem

Written by BigData & Cloud Practice

No responses yet