Azure Data Factory allows businesses to organize raw data into meaningful data stores and data lakes, thus enabling businesses to make better decisions. In this article, we will take a look at four steps typically carried out by Azure Data Factory’s pipelines. We will also look at some high level concepts.
Introduction
Raw data by itself, lacking context and meaning, is not a source of actionable insights, no matter how many petabytes of data you may have collected and stored. This type of unorganized data is often stored in a variety of storage systems, including relational and non-relational databases, but without context, it’s not useful for analysts or data scientists.
Background
For big data to be useful, it requires services that can orchestrate and operationalize processes, thus turning unorganized data into business insights that are actionable. Azure Data Factory was built to enable businesses transform raw data into actionable business insights. It does this by carrying out complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects.
Imagine a car rental company that has collected petabytes of car rental logs which it holds in a cloud data store. The company would like to use this data to gain insights into customer demographics, preferences and usage behaviour. With these insights, the company could more effectively up-sell and cross-sell to its customers, as well as improve customer experience and develop new features, thus driving business growth.
In order to analyse the car rental logs held in the cloud data store, the company needs to include contextual data such as customer information, vehicle information and advertising and marketing information. However, this contextual information is stored in an on-premises database. Therefore, in order to make use of the car rental logs, the company will have to use the data in the on-premise database, combining it with the log data it has in a cloud data store.
In order to extract insights from its data, the company would probably have to process the joined data by using a Spark cluster in the cloud and then publish the transformed data into a cloud data warehouse, such as Azure SQL Data Warehouse, so a report can be built easily on top of it. This workflow will probably need to be automated, as well as being monitored and managed on a daily schedule. This is not an unusual data scenario for a company in these days of big data. Azure Data Factory has been designed to solve just such data scenarios. Azure Data Factory is a cloud-based data integration service that allows you to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation.
What this means is that you can use Azure Data Factory to create and schedule pipelines (data driven workflows) that can take in data from different data stores. Azure Data Factory can also process and transform data using compute services such as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning. Also, you can publish output data to data stores such as Azure SQL Data Warehouse, which can then be consumed by business intelligence (BI) applications. In short, Azure Data Factory allows businesses to organize raw data into meaningful data stores and data lakes, thus enabling businesses to make better decisions.
Pipeline Steps
Azure Data Factory’s pipelines typically carry out the following four steps:
Connect and Collect
When building an information production system, the first step is to connect to all the required sources of data. This data can be structured, unstructured and semi-structured. It can be located on-site or in the cloud and arrives at different speeds and intervals. You also need to connect to the sources of data processing, such as databases, file shares, software-as-a-service and FTP web services. Once you’ve connected to all the sources of both the data and the processing, then you need to move the data to a centralized location so it can be processed. It’s entirely possible for a company to do all this by building custom data movement components or by writing custom services. However, such systems are difficult to integrate and maintain as well as being expensive. In contrast, a fully managed service can offer a higher level of monitoring, alerts and controls. Instead of building custom data movement components, Azure Data Factory allows you to move data from on-premises data store, as well as cloud data stores to a centralized data store simply by using its Copy Activity (described below).
Copy Activity performs the following steps:
- It reads data from a source data store.
- It performs serialization/deserialization, compression/decompression, column mapping, among others.
- It writes the data to the destination data store.
Transform and Enrich
HDInsight Hadoop, Spark, Data Lake Analytics, and Machine Learning can be used to process or transform the data once the data is in a centralized data store. The transformed data can be produced according to a controllable and maintainable schedule.
Publish
Once the data has been refined, it can be loaded into an analytics engine such as Azure Data Warehouse, Azure SQL Database, Azure CosmosDB. Then you can point at the analytics engine from whichever business intelligence tool you use.
Monitor
Once you’ve built your data pipeline and refined the data, the activities and pipelines need to be monitored for success and failure rates. Azure Data Factory has built-in support for pipeline monitoring via Azure Monitor, API, PowerShell and Log Analytics.
High-Level Concepts
There are four key components in an Azure Data Factory. Together, these components provide the platform on which you can build data-drive workflows. An Azure subscription might be made up of one or more data factories.
Pipeline
A logical grouping of activities that performs a task is known as a pipeline. A data factory can have one or more pipelines. For instance, a pipeline could consist of a group of activities that takes data from an Azure blob, and then runs a Hive query on an HDInsight cluster in order to partition the data. Using a pipeline means that you can manage the activities as a set, rather than individually. The activities can run sequentially or independently in parallel, depending on your needs.
Activity
An activity is a processing step in a pipeline. Azure Data Factory supports three types of activities: data movement activities, data transformation activities, and control activities.
Datasets
Datasets represent data structures within the data stores. They point to the data you want to use as inputs or outputs in your activities.
Linked Services
Linked services are similar to connections strings. They define the connection information which the Data Factory needs in order to be able to connect to external resources.
Linked services have two purposes:
They are used to represent a data store that includes, among others, in on-premises SQL Server databases, Oracle databases, file shares, or Azure blob storage accounts.
Linked services are also used to represent a compute resource which can host the execution of an activity.
Triggers
The unit of processing that determines when a pipeline execution needs to be kicked off is represented by a trigger. There are several different types of triggers, depending on the type of event.
Pipeline Runs
A pipeline run is an instance of the pipeline execution. Pipelines include parameters and a pipeline run is instantiated by passing arguments to these parameters. You can pass the arguments within the trigger definition, or manually.
Parameters
Parameters are defined in the pipeline and they are key-value pairs of read-only execution. Activities within the pipeline consume the parameter values. Both datasets and linked services are types of parameters.
Control Flow
Control flow is how the pipeline activities are organised. This can include putting the activities in a sequence, branching, as well as defining parameters in the pipeline and passing arguments to these parameters.
Supported Regions
Azure Data Factory is currently available in the following regions (as of October 2018):
History
- 20th October, 2018: Version 1