In this article, we will explore how to develop U-SQL jobs on Azure Data Lake Analytics to process data stored in Data Lake Storage. We will learn how to read data from files, apply logic using U-SQL, and write the output back to the storage layer.
Azure Data Lake Analytics Jobs
Azure Data Lake Storage can host data volumes ranging from modest to big data scale. To process this data, we need to create Azure Data Lake Analytics service jobs. These jobs read data from the storage account, process it using U-SQL logic, and write the output back to the storage layer.
Let’s start by creating a job that reads data from a file stored in the data lake storage account, processes it, and writes the output to a new file.
Step 1: Prepare the Source Data
Before creating the job, we need to have the source data in a file stored on the data lake storage account. The file can have any schema with any amount of data. For example, we can have a CSV file with a list of customers and their attributes.
Step 2: Create a U-SQL Job
Navigate to the Azure Data Lake Analytics account and click on the “New Job” button. This will open the job editor screen where we can write our U-SQL script.
In the job script, we first use the EXTRACT command to read the data from the source file. We specify the column names and data types based on the schema of the file. Note that the data types available in SQL Server are not applicable here. The supported data types in U-SQL are string, int, DateTime, float, etc.
Next, we can apply any transformations or filters to the data using U-SQL commands. For example, we can filter out female customers for a specific marketing campaign.
Finally, we write the output to a CSV file in the data lake storage account using the CSV output specification.
Step 3: Submit and Monitor the Job
Once the script is ready, we can submit the job for execution. Before submitting, we can adjust the AU (Analytics Units) allocation based on the expected data volume and processing requirements. A higher AU allocation can process data faster.
During the job execution, we can monitor the job details, including the status, time spent, and estimated cost. We can also view the job graph to visualize the steps performed in the job.
Performance Analysis
After the job completes, we can analyze its performance using the AU analysis and diagnostics features provided by Azure Data Lake Analytics.
The AU analysis tab provides insights into the AU allocation and suggests options for optimizing cost and performance. For example, it may suggest using the “Fast” option to achieve an 80% cost reduction.
The diagnostics tab allows us to dive deeper into the job metrics and identify any issues or areas for improvement. It provides information on the efficiency percentage, over-allocated AUs, and other performance-related details.
Conclusion
Azure Data Lake Analytics is a powerful service for processing data stored in Data Lake Storage. With U-SQL, we can develop jobs to read, process, and write data efficiently. By analyzing the job metrics and fine-tuning the job configuration, we can achieve maximum performance and cost efficiency.
By following the steps outlined in this article, you can start leveraging Azure Data Lake Analytics to process your data at scale.