Azure Data Factory Interview Questions and Answer
Azure Data Factory Interview Questions and Answer
Azure Data Factory Interview Questions and Answer
Azure Data factory doesn’t store any data itself; it lets you produce workflows that orchestrate
the movement of data between supported data stores and data processing. You
can monitor and manage your workflows using both programmatic and UI mechanisms. Apart
from that, it is the best tool available today for ETL processes with an easy-to-use interface. This
shows the need for Azure Data Factory.
Azure Data Factory is a cloud-based integration service offered by Microsoft that lets
you create data-driven workflows for orchestrating and automating data movement and data
transformation overcloud. Data Factory services also offer to create and running data pipelines
that move and transform data and then run the pipeline on a specified schedule.
Integration runtime is nothing but a compute structure used by Azure Data Factory
to give integration capabilities across different network environments.
Types of Integration Runtimes:
Azure Integration Runtime – It can copy data between cloud data stores and
dispatch the activity to a variety of computing services such as SQL Server, Azure
HDInsight
Self Hosted Integration Runtime – It’s software with basically the same code as
Azure Integration runtime, but it’s installed on on- premises systems or
virtual machines over virtual networks.
Azure SSIS Integration Runtime – It helps to execute SSIS packages in a
managed environment. So when we lift and shift the SSIS packages to the data
factory, we use Azure SSIS Integration Runtime.
4) How much is the limit on the number of integration
runtimes?
There’s no specific limit on the number of integration runtime instances. But there’s a limit on
the number of VM cores used by Integration runtime grounded on per subscription for
SSIS package execution.
6) What is the key difference between the Dataset and Linked Service in
Azure Data Factory?
Dataset specifies a source to the data store described by the linked service. When we put data to
the dataset from a SQL Server instance, the dataset indicates the table’s name that contains
the target data or the query that returns data from dissimilar tables.
Linked service specifies a definition of the connection string used to connect to the data stores.
For illustration, when we put data in a linked service from a SQL Server instance,
the linked service contains the name for the SQL Server instance and the
credentials used to connect to that case.
8) What are the different rich cross-platform SDKs for advanced users in
Azure Data Factory?
The Azure Data Factory V2 provides a rich set of SDKs that we can use to write, manage,
and watch pipelines by applying our favourite IDE. Some popular cross-platform SDKs
for advanced users in Azure Data Factory are as follows:
Python SDK
C# SDK
PowerShell CLI
Users can also use the documented REST APIs to interface with Azure Data
Factory V2
9) What is the difference between Azure Data Lake and Azure Data
Warehouse?
Data Lake is a capable way of storing any type, size, and Data Warehouse acts as a repository for already fi
It uses ELT (Extract, Load and Transform) process. It uses ETL (Extract, Transform and Load) pro
It is an ideal platform for doing in-depth analysis. It is the best platform for operational user
Intermediate ADF Interview Questions
10) What is Blob Storage in Azure?
It helps to store a large amount of unstructured data similar as text, images or double data. It
can be used to expose data intimately to the world. Blob storage is most commonly used for
streaming audios or videos, storing data for backup, and disaster recovery, storing data for
analysis etc. You can also create Data Lakes using blob storage to perform analytics.
11) Difference between Data Lake Storage and Blob Storage.
It is an optimized storage solution for big data analytics Blob Storage is general-purpose storage for a w
It follows a hierarchical file system. It follows an object store with a flat namesp
It can be used to store Batch, interactive, stream We can use it to store text files, binary data, medi
analytics, and machine learning data. for streaming and general purpose data.
12) What are the steps to create an ETL process in Azure Data
Factory?
There are straightforward steps to create an ETL process.
We need to create a service for a linked data store which is an SQL Server
Database.
Let’s assume that we have a car dataset.
For this car’s dataset, we can create a linked service for the destination data
store that is Azure Data Lake.
Now create a data set for Data Saving.
Create a Pipeline and Copy Activity.
13) What is the difference between Azure HDInsight and Azure Data
Lake Analytics?
Processing data in it requires configuring the cluster with It is all about passing the queries written for
predefined nodes. Further, by using languages like pig or processing. Data Lake Analytics further creates c
hive, we can process the data. nodes to process the data set.
Users can easily configure HDInsight Clusters at their It does not give that much flexibility in terms of con
convenience. Users can also use Spark, Kafka, without and customization. But, Azure manages it automat
restrictions. users.
15) What are the key differences between the Mapping data flow and
Wrangling data flow transformation activities in Azure Data Factory?
In Azure Data Factory, the main dissimilarity between the Mapping data flow and the Wrangling
data flow transformation activities is as follows
The Mapping data flow activity is a visually allowed data transformation activity that
facilitates users to plan graphical data transformation logic. It does
not need the users to be expert developers. It’s executed as an activity within the
ADF pipeline on an ADF completely managed scaled-out Spark cluster.
On the other hand, the Wrangling data flow activity is a code–free data preparation activity.
It’s integrated with Power Query Online to make the Power Query M functions available for
data wrangling using spark execution.
17) What changes can we see regarding data flows from private preview to
limited public preview?
Following is a list of some important changes we can see regarding data flows from private
preview to limited public preview:
Yes definitely, we can very easily pass parameters to a pipeline run. Pipeline runs are the first-
class, top-level concepts in Azure Data Factory. We can define parameters at the pipeline
level, and then we can pass the arguments to run a pipeline.
25) What has changed from private preview to limited public preview
in regard to data flows?
You’ll no longer have to bring your own Azure Databricks clusters.
Data Factory will manage cluster creation and tear– down.
Blob datasets and Azure Data Lake Storage Gen2 datasets
are separated into delimited text and Apache Parquet datasets.
You can still use Data Lake Storage Gen2
and Blob storage to store those files. Use the appropriate linked service for
those storage engines.
26) How do I access the data using the other 80 Dataset types in Data
Factory?
The mapping data flow feature currently allows Azure SQL database, Data Warehouse,
Delimited text-files from Azure Blob Storage or Azure Data Lake storage to generation tools
natively for source and sink. You can use copy activity to states data from any of the other
connectors, and then you can execute the data flow activity to transform data.
You are no longer required to bring your own Azure Databricks Clusters.
Data Factory will manage cluster creation and tear down process.
We can still use Data Lake Storage Gen 2 and Blob Storage to store those files.
You can use the appropriate linked services. You can also use the appropriate
linked services for those of the storage engines.
Blob data sets and Azure Data Lake storage gen 2 are separated into delimited
text and Apache Parquet datasets.
29) What is the difference between the Dataset and Linked Service in
Data Factory?
Dataset: is a reference to the datastore that is described by Linked Service.
Linked Service: is nothing but a description of the connection string that is used to
connect to the data stores.
30) What is the difference between the mapping data flow and
wrangling data flow transformation?
Mapping Data Flow: It is a visually designed data transformation activity that lets
users design a graphical data transformation logic without needing an expert
developer.
Wrangling Data Flow: This is a code-free data preparation activity that integrates
with Power Query Online.
31) Data Factory supports two types of compute environments to execute the
transform activities. Mention them briefly.
Let’s go through the types:
Azure SSIS Integration is a fully managed cluster of virtual machines that are hosted in Azure
and dedicated to run SSIS packages in the data factory. We can easily scale up the SSIS nodes
by configuring the node size or scaled out by configuring the number of nodes on the Virtual
Machine’s cluster.
Debug Mode
Manual execution using trigger now
Adding schedule, tumbling window/event trigger
Yes, we can monitor and manage ADF Pipelines using the following steps:
Click on the monitor and manage on the data factory tab.
Click on the resource manager.
Here, you will find- pipelines, datasets, and linked services in a tree format.
39) What are the steps involved in the ETL process?
ETL (Extract, Transform, Load) process follows four main steps:
Connect and Collect – helps in moving the data on-premises and cloud source
data stores
Transform – lets users collect the data by using compute services such as
HDInsight Hadoop, Spark etc.
Publish – Helps in loading the data into Azure data warehouse, Azure SQL
database, and Azure Cosmos DB etc
Monitor – It helps support the pipeline monitoring via Azure Monitor, API and
PowerShell, Log Analytics, and health panels on the Azure Portal.
FAQs
Q. Is coding required for Azure Data Factory?
Ans: No, coding is not required. Azure Data Factory lets you create workflows very quickly. It
offers 90+ built-in connectors available in Azure Data Factory to transform the data using
mapping data flow activities without programming skills or spark cluster knowledge.
Q. Can we replace Synapse pipelines with other ETL like talend or SSIS?
Ans: We can use both azure data factory or synapse with Synapse Pipelines, Data Integration &
Orchestration to integrate our data and operationalize all our code development.
Q. ETL should always happen with Azure Data factory or Synapse Pipelines, or
can we use any other ETL tool in the market?
Ans: Along with Azure Data Factory and Synapse Pipelines, you can also use data bricks. Data
Integration & Orchestration to integrate your data and operationalize all of your code
development with Synapse Pipelines.
Q. If Azure data factory and synapse pipelines have the same functionality
then which one to choose and why to choose?
Ans: If your requirement is only data movement and transformation then use Azure data factory
and For Analytics capabilities go with synapse because Azure synapse analytics is an umbrella
service which provides analytical workspace along with other services.
Q. Azure Data Bricks we can write transformation logic right then why we
require ADF?
Ans: Mapping data flows are visually designed data transformations in Azure Data Factory. Data
flows allow data engineers to develop data transformation logic without
writing code. The resulting data flows are applied as activities within Azure Data
Factory pipelines that apply scaled– out Apache Spark clusters. Data
flow activities can be operationalized using existing Azure Data Factory scheduling, control, flow,
and monitoring capabilities.