Establish a connection between Azure DataLake Storage Gen 2 and Azure Databricks (python)

BigData & Cloud Practice
6 min readDec 9, 2020

--

Article by Shruti Bhawsar, Big Data & Cloud Software Engineer

This article shows how to mount an Azure Data Lake Storage Gen2 filesystem to DBFS using a service principal and OAuth 2.0.

Azure DataLake Storage Gen2 (also known as ADLS Gen2) is a next-generation solution for big data analytics. Data Lake Storage Gen2 makes Azure Storage the foundation for building enterprise data lakes on Azure. It is designed from the start to service multiple petabytes of information while sustaining hundreds of gigabits of throughput. It allows you to easily manage massive amounts of data. On the other hand, databricks is Unified Data Platform. It is an industry-leading, cloud-based data engineering tool used for processing and transforming massive quantities of data and exploring the data through machine learning models.

It become complex to refer multiple sites to mount datalake to databricks as it involves creating many configurations. I am writing this article to bring all things in one-frame ,making it simple for you.

Prerequisites :

  1. Create a DataLake Gen2 Storage Account
  2. Create a KeyVault
  3. Create Databricks workspace and also a cluster

Once the above setup is ready, follow the below steps:

Step 1 : Create a Azure AD application , Service Principal and grant it necessary permissions.

When you have code that needs to access or modify resources, you can create an identity for the app. This identity is known as a service principal. You can then assign the required permissions to the service principal. We will create Service principal for DataLake.

  1. Go to search bar in Azure under your subscription and search for App Registration and register an application by filling the required information.
Fig: Search for App Registration
Fig — Register an application

2. After registering your application, you will get a window like below screenshot. You need to copy the following ids in a notepad from this window:

  • Application Id/Client Id
  • Directory Id
Fig — Registered app window

3. On the same window, click on Certificate & Secrets on the left side. Then, create a new client secret using tab New Client Secret. Fill the necessary fields and click on Add button. Refer below screenshot.

Fig — Add a client secret

4. After adding the client secret name (also called as service principal), it will generate the client secret key .Copy it to Notepad.

Please Note : You can’t access this key again once you go back from this window. So, ensure to note it down.

Fig — Client Secret Key

5. Now, we will grant necessary permission to this application from DataLake account. Go to your DataLake storage account. Then, follow this steps -

  • Select Access control (IAM) on left side.
  • Select Add a role assignment
  • Select “Storage Blob Data Contributor” in role field.
  • Search for your registered application and select it.
  • Save the role
  • You can see the role in “View Role Assignment” tab
Fig — Add a role assignment
Fig — Click on add a Role Assignment
Fig — Add a role assignment

Step 2 : Create a secret in KeyVault for Client Secret Key generated in the above step.

It is standard practice for storing the secret keys from being misused for some wrong reasons. It fills the security bridge for application.

  1. Go to your KeyVault and click on Secrets tab on the left side of the window.
  2. Click on Generate/Import
  3. Provide a name to your secret key
  4. Copy your Client Secret Key generated in the above step and paste in the Value field here. Then,Click on create
  5. Secret is created with the enabled Status.
Fig — KeyVault
Fig — Create a Secret for your Client Secret Key
Fig — Secret is created

6. Now, on the same window , Go to Properties tab on the left side and select it. In the Properties window, copy DNS Name and Resource Id in a notepad. These will be used while creating Secret Scope in Databricks.

Fig — KeyVault — Properties Window

Step 3: Create an Azure Key Vault-backed secret scope in Databricks workspace.

A secret scope is collection of secrets identified by a name. Sometimes accessing data requires that you authenticate to external data sources through JDBC. Instead of directly entering your credentials into a notebook, use Databricks secrets to store your credentials and reference them in notebooks and jobs.

From this Secret Scope , we will access the keys stored in KeyVault so that our keys will be encrypted and data access will be secured in databricks .

Follow below steps to create Secret Scope :

  1. Launch your Databricks Notebook.
  2. Go to https://<databricks-instance>#secrets/createScope . This URL is case sensitive; scope in createScope must be uppercase. See the below screenshot
fig — Secret Scope URL formation

3. You will get a window like below screenshot. Fill the required information -

  • Provide a scope name
  • Paste DNS Name and Resource Id copied in the previous step in the respective fields.

4. Then, click on Create Button

fig — Create Secret Scope

Step 4 : Mount Azure Data Lake Storage Gen2 filesystem

  1. You Should be ready with the following credentials :
  • Application Id : a1c3574e-ece9*******************
  • Directory Id : fd41ee0d-0d97–********************
  • KeyVault Secret Key name(service-credential-key-name) : keyvaultsecretfordatabricks
  • Databricks Secret Scope Name (scope-name) : scopekeyvault
  • DataLake Storage Account Name : datalakegen02storage
  • DataLake file system name ( Container Name that you want to mount to databricks file system) : demo

2. To mount an Azure Data Lake Storage Gen2 filesystem or a folder inside it, use the following command in sequence:

  • Create a directory in databricks file system where you will mount your DataLake container
Syntax : 
dbutils.fs.mkdirs(“/mnt/<mount-name>”)
Code :
dbutils.fs.mkdirs(“/mnt/datalakemount”)
  • Check if the directory is created or not. Initially the directory will be empty
Syntax : display(dbutils.fs.ls(“/mnt/”))  code : display(dbutils.fs.ls(“/mnt/<databricksdirectoryname>”))
  • Put your credentials in the following commands to configure mounting. See the below screenshots for reference.
Syntax :
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<application-id>",
"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}
dbutils.fs.mount(
source = "abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)
  • Check if the mount is done correctly or not. We will check by listing out the directory folders. It should match the list of files in datalake file system.
Syntax:
display(dbutils.fs.ls(“/mnt/<mount-name>/”))

Now, you can access the Datalake files system like you are accessing a DBFS.

  • Once your work is done, unmount the mount point by following
Syntax :
dbutils.fs.unmount("/mnt/<mount-name>")

Please refer to the below screenshots -

fig — Code Part 1
fig — Code Part 2
fig — Code Part 3
fig — Code Part 4

Helpful links :

Hope it helped you!

Happy Learning :D

--

--

BigData & Cloud Practice
BigData & Cloud Practice

Written by BigData & Cloud Practice

Abzooba is an AI and Data Company. BD&C Practice is one of the fastest growing groups in Abzooba helping several fortune 500 clients in there cognitive journey

No responses yet