Building a data mining solution
Article by Narender Kumar, Big Data & Cloud Lead Developer
Let’s assume we have a huge amount of data in our data lake related to sales. It can be very difficult to find some specific data like downloading data where milk sale was higher than 500 ltr in January month of 2019.
In this blog, we will learn how we can build a scalable solution with a simplified GUI that can be used to dig into huge data stored in the data lake and provide us some specific data that we need.
We can develop the GUI as a webpage that takes inputs such as product type, sales limit, date, etc., and provide us a URL link of the data that we can click and download related data.
We have implemented this solution using Azure services but we can use similar services by any cloud provider/open-source as well.
First, we need to create the metadata about the data we have. We need to gather the information which can be used to map the items in the GUI. We can create a few hive tables with this metadata.
2. Indexes of metadata
We need to create indexes of metadata so that we can quickly query the metadata and present it to GUI.
We can push our data to the Azure SQL server and create indexes on top of it using Azure Search. Azure Search provides a REST API that can be used by web services to query the metadata.
3. GUI and web jobs
We can create a good webpage we can take the user’s inputs on search criteria and provide downloadable links to the user. The web service can take user inputs and query the indexes from Azure Search using the REST API. Based on the results from the Azure Search we can run a job that can download the relevant data from the data lake to an Azure blob. Azure Active Directory can be used for authentication.
4. Usage Insights
We can use Azure application insights and log analytics and build the application usage dashboards to show how our application is being used and which data is being searched frequently.
1. Azure Data lake
Azure Data lake is a scalable data store used for analytics applications. We used it as an external store for Hive on HDInsight.
2. Azure SQL Database
Azure SQL Database is a managed cloud database. We used it to store the metadata useful for search criteria by the user. We also used it as a backend database for the web application.
3. Azure Search
Azure Search provides indexing and querying capabilities for data uploaded to Microsoft servers. We used it to build indexes of metadata stored in the Azure SQL Database. These indexes were used by the UI to query metadata with minimal latency.
4. Azure Web Apps
Azure Web Apps is used to create and deploy mission-critical web applications that can scale with the business.
We used it to deploy the front end of the web application.
5. Azure AD
Azure Active Directory (Azure AD) is Microsoft’s enterprise cloud-based identity and access management (IAM) solution. We used it for user authentication.
6. Azure Blob
Azure Blob storage is a service for storing large amounts of unstructured object data, such as text or binary data. We used blob to store the requested data as per user request and provide a download link for the same.
7. Azure Application Insights
It is an Application Performance Management (APM) service. We used it to get insights about the application usage like which searches are performed frequently, which users are using it most etc.
8. Azure Log Analytics
We used Azure Log Analytics to collect and visualize the logs of the web application.
We used Azure services for this requirement. We have similar services available on other cloud services also as below :
With this architecture, we got a good solution with GUI that can be used by any user without any technical background. In our case, this solution has provided great help to our data scientists to improve their efficiency and save time. Data scientists can now spend their valuable time to analyze the specific data instead of digging into the data lake to mine the specific data.