The Scientist - The Engineer - and The Warehouse
The Scientist - The Engineer - and The Warehouse
The Scientist - The Engineer - and The Warehouse
Introduction...............................................................................................3
Azure Databricks......................................................................................................................19
Conclusion................................................................................................25
The smaller organization.......................................................................................................27
But let’s make it clear: the data engineer certainly knows about
Data scientists machine learning and coding. They can make sense of algorithms
build robust data and scripts, but perhaps without a state-of-the-art statistical or
pipelines and mathematical understanding of models and experimentation.
deployments which What they do have is the ability to build robust data pipelines
and deployments which can run models in production under
can run models
demanding conditions. Their work is rooted in a deep, and
in production often hard-won, knowledge of how modern software is deployed
under demanding and administered.
conditions.
Yet modern companies also find that they want to store and
manage that raw data, and not only for machine learning which,
as we have said, looks beneath the standardized business model.
This is one reason that the data lake has been such a successful
A data lake stores architectural innovation in recent years. A data lake stores data—
data—often vast often vast amounts—in its natural state. This data may be messy
amounts of data— and unstructured but provides raw material for data science.
in its natural state. While the Data Scientist may work with raw data in a data lake,
report designers and BI users are much more likely to work with
data which has already been cleaned and modeled to reflect
enterprise demands and standards. Nevertheless, the modern
data warehouse can be a useful source of knowledge and data.
See the sidebar: The data warehouse as a source for data science.
Secondly, the data warehouse serves data that has been through a process
of consolidation and cleaning, particularly when it comes to dimension data.
Names of departments, geographies, job titles, and every other dimension
have been reconciled and agreed upon. Although the data scientist will mostly
wish to use raw numeric or text data for analysis, this authoritative source of
master data can help them to ensure that their model is readily applicable in
business terms.
To see better how data scientists and data engineers work in this
environment, it will be useful to examine their workflow and tools
in more detail.
Azure
Databricks
(Prep-only)
Store
Azure Data
Lake Storage
If the project at hand is simply for research, this work may well be
done by a data scientist alone. However, if their research comes
up with compelling results which could be useful in production,
this phase will need to be revisited by the data engineer and
data scientist together, to deliver a process robust enough for
enterprise use.
Azure Databricks
Azure Databricks is an Apache Spark-based analytics platform
widely used in machine learning for exploration and modeling. It
enables the data scientist and data engineer to write code in data
science notebooks using Java, Python, R, Scala, or SQL while also
leveraging the power of distributed processing with automated
cluster management.
To enable this, you can load large volumes of data from Azure
Data Bricks directly into Azure Synapse Analytics using a
specialized and highly efficient Synapse Analytics connector. This
connector uses Azure Blob Storage, and PolyBase (a Microsoft
data virtualization technology) in Synapse Analytics to transfer
large volumes of data between a Databricks cluster and a
Synapse Analytics instance. In some scenarios, where source data
is streaming from a constantly updated system—often the case
in online retail, for example, or from the Internet of Things—you
can directly stream data into Azure Synapse Analytics using
Structured Streams. This enables business users to work with
near-real-time data in Azure Synapse Analytics.
With the Azure Machine Learning service, you can use the
Score Model module to generate predictions using a trained
classification or regression model. The module’s scored dataset
output can then be loaded into Azure Synapse Analytics.
Find related users lists the users who are related to each user in
the input dataset, depending on how many results you request.
All these data sets can be loaded easily into the Azure Synapse
Analytics.
However, many businesses don’t have the resources to employ both a data
scientist and a data engineer. Yet, in these days of big data, globalization, and
online commerce, even modestly-sized teams may be handling challenging
issues. In such cases, there are four key recommendations which will enable you
to address the same concerns that we’ve discussed in this whitepaper.
Firstly, train existing IT team members in the basics of machine learning. They
do not need to become experts, but the more familiar an IT team becomes with
the data science methodology, how models work, and the data that is required,
the more they can effectively support machine learning. There are numerous
online courses available, reasonably priced and often free, which cover the
fundamentals, and can even be quite advanced.
Secondly, the data scientist will need to accept that they have quite a lot of work
to do, putting models into production. This work will require close co-operation
with IT to ensure that data pipelines, scripts, notebooks, and so on are ready
for the enterprise. The data scientist will also need to learn about governance,
compliance, and security needs of the business. This white paper, Seven Key
Principles of Cloud Security and Privacy, is a good place to start.
Thirdly, the data scientist (singular) must take on some data engineering work.
That data scientist will be your first hire. But if you are serious about expanding
your machine learning footprint—and that is highly likely as your work
progresses—your second hire should be a data engineer, not another
data scientist.
Finally, your choice of tools and platforms will be central to your success. The
Azure Synapse Analytics and Azure Machine Learning Service, with Power BI at
the front end, are simple to deploy and maintain, while scaling and growing with
your business very effectively. There is no better platform on which to start a
data science practice.