Azure Data Factory (ADF) is a powerful tool for Azure Data engineers and organizations to fulfill their data transfer needs. Whether you are new to ADF or looking to enhance your knowledge, this comprehensive guide will provide you with the necessary insights to work with ADF effectively.
Building Blocks of Azure Data Factory
ADF consists of several key components or building blocks:
- Pipelines: Pipelines define the workflow and orchestration of data movement and transformation activities.
- Activities: Activities represent the individual tasks within a pipeline, such as copying data from one source to another or transforming data.
- Datasets: Datasets define the structure and location of the data used in activities.
- Linked Services: Linked Services establish connections to external data sources or compute services.
- Dataflows: Dataflows provide a visual interface for building data transformation logic without writing code.
- Integration Runtimes: Integration Runtimes are the compute infrastructure used by ADF for data integration and transformations.
If you are new to ADF, you can follow this link to learn how to create your first Data Factory pipeline.
Activities in ADF Data Pipeline
When working with ADF, you will encounter various activities in your data pipelines. For example, you may need to move data from Azure Blob Storage to Azure SQL Database. In this scenario, you would create Linked Services, Datasets, and Activities such as Copy Activity or Get Metadata activity. It is important to understand the process of creating and configuring these activities to achieve your desired data movement or transformation.
Integration Runtime in Azure Data Factory
The Integration Runtime is a crucial component of ADF that provides the compute infrastructure for data integration and transformations. There are three types of Integration Runtimes:
- Azure Integration Runtime: Supports connecting to data stores and compute services with public accessible endpoints.
- Self-hosted Integration Runtime: Allows lifting and shifting existing SSIS workloads or executing SSIS packages.
- SSIS Integration Runtime: Provides infrastructure for running SSIS packages and allows scaling up or out based on workload requirements.
To learn more about SSIS Integration Runtime, you can refer to this link.
Custom Activities in Azure Data Factory
Azure Data Factory allows you to execute custom code logic using Custom Activities. These activities run your customized code on Azure Batch pool of virtual machines. For example, you can execute Python scripts or C# code for data transformation or movement. Custom Activities provide a cost-effective alternative to Azure Data Bricks. To see an example of how Azure Custom Activity is used, check out this link.
Execute Pipeline Activity
The Execute Pipeline activity enables one pipeline to invoke another pipeline within ADF or Synapse. It is important to understand how to pass parameters from the master pipeline to the invoked pipeline to ensure seamless execution of your data workflows.
Lookup Activity in Azure Data Factory
The Lookup activity allows you to retrieve datasets from various data sources supported by ADF. It is commonly used for update and delete operations in Azure SQL Database. However, there are certain limitations to be aware of, such as the maximum number of rows returned and the size of the output. The timeout duration for Lookup activity is also limited. Understanding these limitations will help you optimize your data workflows.
Key Vault Integration with Azure Data Factory
Azure Key Vault is a cloud service that securely stores and manages secrets, such as API keys, passwords, certificates, or cryptographic keys. In ADF, you can leverage Key Vault to securely access sensitive information, such as database passwords or access keys, directly from your pipeline.
Error Handling in Azure Data Factory
To avoid the failure of an entire pipeline due to issues with specific records from the source, ADF provides fault tolerance options in the Copy Activity. By enabling this option, you can redirect error rows to a specific destination while ensuring that correct rows are successfully copied to the desired destination.
Scheduling and Triggers in Azure Data Factory
By creating triggers in Azure Data Factory, you can schedule the execution of your pipelines. There are three types of triggers:
- Schedule Trigger: Invokes a pipeline on a specific schedule.
- Tumbling Window Trigger: Operates on a periodic interval while retaining a state.
- Event-based Trigger: Responds to specific events, such as uploading a file to a data storage.
Mapping Data Flow in Azure Data Factory
Mapping Data Flow is a visual data transformation feature in Azure Data Factory that allows you to perform complex data transformations without writing code. It provides a user-friendly interface for designing data transformation logic. To learn more about Mapping Data Flow and how to implement it, you can refer to this link.
Moving CSV Files with Different Schemas to SQL Database
When dealing with CSV files with different schemas, but with a common destination table, Azure Data Factory offers various methods to handle this scenario. One approach is explained in detail in this link.
By understanding these concepts and scenarios, you will be well-prepared for Azure Data Factory interviews and have a solid foundation for working with ADF in real-world scenarios.