When designing an Azure Data Lake Storage Gen2 account, there are several considerations to take into account. This article will explore the best practices for designing an Azure Data Lake Storage Gen2 account, covering topics such as data lake layers, design considerations for zones, directories/files, and security options.
Data Lake Layers
The data lake consists of various layers, including environments, storage accounts, file systems, zones, directories, and files. Each layer plays a crucial role in the overall design and architecture of the data lake.
Environment
The environment layer defines the top-level layer of the data lake. It includes different environments such as DEV, QA, and PROD, each of which may require one or many ADLS2 storage accounts. The process can be orchestrated using Azure DevOps pipelines.
Storage Account
When creating an Azure Data Lake Storage account, there are several properties that need to be configured. These properties include:
- Performance Tier: Choose between standard storage accounts, which provide bulk storage at a lower cost, or premium storage accounts, which offer consistent, low-latency performance.
- Account Kind: Select between general purpose storage accounts or blob storage accounts, depending on your storage needs.
- Replication: Choose a replication strategy that matches your durability requirements.
- Point in Time Restore: Enable point-in-time restore to restore containers to an earlier state.
- Secure Transfer Required: Enhance the security of your storage account by only allowing requests through a secure connection.
- Allow Public Access: Decide whether to allow anonymous access to blobs within the storage account.
Zones, Directories & Files
At the folder and file layer, storage account containers define zones, directories, and files. It is recommended to follow a specific folder structure for optimized analytical querying. Each source system should be granted write permissions at the DataSource folder level, ensuring permissions are inherited as new daily folders and files are created.
Here is an example of the recommended folder structure:
\Raw\DataSource\Entity\YYYY\MM\DD\File.extension
For sensitive sub-zones in the raw layer, it is advisable to separate them by top-level folders. This allows for separate lifecycle management policies based on prefix matching.
Security
Security is a critical aspect of designing a data lake. Here are some security features to consider:
RBAC (Role-Based Access Control)
RBAC provides control plane and data plane permissions. Control plane permissions grant security principals rights at the Azure resource level, while data plane permissions grant permissions at the file and folder level. It is recommended to use a combination of RBAC and ACLs for effective security.
ACLs (Access Control Lists)
ACLs control access to files and folders within the data lake. It is advised to assign security principals an RBAC Reader role at the storage account/container level and then apply restrictive and selective ACLs at the file and folder level.
Shared Access Signature (SAS)
SAS allows for limited access capabilities to containers for users. It is useful for granting temporary access to your storage account and managing different levels of access to users within or outside of your organization.
Data Encryption
Data stored in Azure Data Lake Storage Gen2 is automatically encrypted at rest and in transit. It is recommended to let the Data Lake service manage encryption, unless there is a specific need for user-managed keys.
Network Transport
Network rules can be configured to limit access to your storage account from specified IP addresses or subnets. Private endpoints can also be created to secure all traffic between your VNet and the storage account over a private link.
By following these best practices, you can ensure a well-designed and secure Azure Data Lake Storage Gen2 account that meets your specific requirements.
Article Last Updated: 2021-04-21