Building a data mining solution

BigData & Cloud Practice
4 min readJan 12, 2021

Article by Narender Kumar, Big Data & Cloud Lead Developer

Overview

Let’s assume we have a huge amount of data in our data lake related to sales. It can be very difficult to find some specific data like downloading data where milk sale was higher than 500 ltr in January month of 2019.

In this blog, we will learn how we can build a scalable solution with a simplified GUI that can be used to dig into huge data stored in the data lake and provide us some specific data that we need.

We can develop the GUI as a webpage that takes inputs such as product type, sales limit, date, etc., and provide us a URL link of the data that we can click and download related data.

We have implemented this solution using Azure services but we can use similar services by any cloud provider/open-source as well.

The Architecture

Process

1. Metadata

First, we need to create the metadata about the data we have. We need to gather the information which can be used to map the items in the GUI. We can create a few hive tables with this metadata.

2. Indexes of metadata

We need to create indexes of metadata so that we can quickly query the metadata and present it to GUI.

We can push our data to the Azure SQL server and create indexes on top of it using Azure Search. Azure Search provides a REST API that can be used by web services to query the metadata.

3. GUI and web jobs

We can create a good webpage we can take the user’s inputs on search criteria and provide downloadable links to the user. The web service can take user inputs and query the indexes from Azure Search using the REST API. Based on the results from the Azure Search we can run a job that can download the relevant data from the data lake to an Azure blob. Azure Active Directory can be used for authentication.

4. Usage Insights

We can use Azure application insights and log analytics and build the application usage dashboards to show how our application is being used and which data is being searched frequently.

Technologies used

1. Azure Data lake

Azure Data lake is a scalable data store used for analytics applications. We used it as an external store for Hive on HDInsight.

2. Azure SQL Database

Azure SQL Database is a managed cloud database. We used it to store the metadata useful for search criteria by the user. We also used it as a backend database for the web application.

3. Azure Search

Azure Search provides indexing and querying capabilities for data uploaded to Microsoft servers. We used it to build indexes of metadata stored in the Azure SQL Database. These indexes were used by the UI to query metadata with minimal latency.

4. Azure Web Apps

Azure Web Apps is used to create and deploy mission-critical web applications that can scale with the business.

We used it to deploy the front end of the web application.

5. Azure AD

Azure Active Directory (Azure AD) is Microsoft’s enterprise cloud-based identity and access management (IAM) solution. We used it for user authentication.

6. Azure Blob

Azure Blob storage is a service for storing large amounts of unstructured object data, such as text or binary data. We used blob to store the requested data as per user request and provide a download link for the same.

7. Azure Application Insights

It is an Application Performance Management (APM) service. We used it to get insights about the application usage like which searches are performed frequently, which users are using it most etc.

8. Azure Log Analytics

We used Azure Log Analytics to collect and visualize the logs of the web application.

We used Azure services for this requirement. We have similar services available on other cloud services also as below :

Conclusion

With this architecture, we got a good solution with GUI that can be used by any user without any technical background. In our case, this solution has provided great help to our data scientists to improve their efficiency and save time. Data scientists can now spend their valuable time to analyze the specific data instead of digging into the data lake to mine the specific data.

--

--

BigData & Cloud Practice

Abzooba is an AI and Data Company. BD&C Practice is one of the fastest growing groups in Abzooba helping several fortune 500 clients in there cognitive journey