Principles of Data Wrangling Practical Techniques For Data Preparation 1st Edition Tye Rattenbury
Principles of Data Wrangling Practical Techniques For Data Preparation 1st Edition Tye Rattenbury
Principles of Data Wrangling Practical Techniques For Data Preparation 1st Edition Tye Rattenbury
OR CLICK LINK
https://textbookfull.com/product/principles-of-
data-wrangling-practical-techniques-for-data-
preparation-1st-edition-tye-rattenbury/
Read with Our Free App Audiobook Free Format PFD EBook, Ebooks dowload PDF
with Andible trial, Real book, online, KINDLE , Download[PDF] and Read and Read
Read book Format PDF Ebook, Dowload online, Read book Format PDF Ebook,
[PDF] and Real ONLINE Dowload [PDF] and Real ONLINE
More products digital (pdf, epub, mobi) instant
download maybe you interests ...
https://textbookfull.com/product/data-mining-and-data-
warehousing-principles-and-practical-techniques-1st-edition-
parteek-bhatia/
https://textbookfull.com/product/mastering-the-sas-ds2-procedure-
advanced-data-wrangling-techniques-second-edition-mark-jordan/
https://textbookfull.com/product/python-for-data-analysis-data-
wrangling-with-pandas-numpy-and-ipython-wes-mckinney/
https://textbookfull.com/product/practical-data-science-with-sap-
machine-learning-techniques-for-enterprise-data-1st-edition-greg-
foss/
Data Wrangling with JavaScript 1st Edition Ashley Davis
https://textbookfull.com/product/data-wrangling-with-
javascript-1st-edition-ashley-davis/
https://textbookfull.com/product/python-for-data-analysis-data-
wrangling-with-pandas-numpy-and-jupyter-3rd-edition-wes-mckinney/
https://textbookfull.com/product/feature-engineering-for-machine-
learning-principles-and-techniques-for-data-scientists-first-
edition-casari/
https://textbookfull.com/product/python-data-analysis-perform-
data-collection-data-processing-wrangling-visualization-and-
model-building-using-python-3rd-edition-avinash-navlani/
https://textbookfull.com/product/principles-of-data-science-
learn-the-techniques-and-math-you-need-to-start-making-sense-of-
your-data-1st-edition-sinan-ozdemir/
Principles of Data Wrangling
Practical Techniques for Data Preparation
Let’s begin with the most important question: why should you read
this book? The answer is simple: you want more value from your
data. To put a little more meat on that statement, our objective in
writing this book is to help the variety of people who manage the
analysis or application of data in their organizations. The data might
or might not be “yours,” in the strict sense of ownership. But the
pains in extracting value from this data are.
We’re focused on two kinds of readers. First are people who manage
the analysis and application of data indirectly—the managers of
teams or directors of data projects. Second are people who work
with data directly—the analysts, engineers, architects, statisticians,
and scientists.
If you’re reading this book, you’re interested in extracting value from
data. We can categorize this value into two types along a temporal
dimension: near-term value and long-term value. In the near term,
you likely have a sizable list of questions that you want to answer
using your data. Some of these questions might be vague; for
example, “Are people really shifting toward interacting with us
through their mobile devices?” Other questions might be more
specific: “When will our customers’ interactions primarily originate
from mobile devices instead of from desktops or laptops?”
What is stopping you from answering these questions? The most
common answer we hear is “time.” You know the questions, you
know how to answer them, but you just don’t have enough hours in
the day to wrangle your data into the right form.
Beyond the list of known questions related to the near-term value of
your data is the optimism that your data has greater potential long-
term value. Can you use it to forecast important seasonal changes?
What about risks in your supply chain due to weather or geopolitical
shifts? Can you understand how the move to mobile is affecting your
customers’ purchasing patterns? Organizations generally hire data
scientists to take on these longer-term, exploratory analyses. But
even if you have the requisite skills to tackle these kinds of analyses,
you might still struggle to be allocated sufficient time and resources.
After all, exploratory analytics projects can take months, and often
contain a nontrivial risk of producing primarily negative or
ambiguous results.
As we’ve seen, the primary impediment to realizing both the short-
term and long-term value of your data is time: your limited time and
your organization’s limited time. In this book, we describe how
improving your data wrangling efforts can create the time required
to get more near-term and long-term value from your data. In
Chapters 1-3, we describe a workflow framework that links activities
focused on both kinds of value, and explain how data wrangling
factors into those activities and into the overall workflow framework.
We introduce the basic building blocks for a data wrangling project:
data flow, data wrangling activities, roles, and responsibilities. These
are all elements that you will want to consider, at a high level, when
embarking on a project that involves data wrangling. Our goal is to
provide some helpful guidance and tips on how to coordinate your
data wrangling efforts, both across multiple projects by making sure
your wrangling efforts are constructive as opposed to redundant or
conflicting, and within a single project by taking advantage of some
standard language and operations to increase productivity and
consistency.
There’s more to effective data wrangling than just clearly defined
workflows and processes; to most effectively wrangle your data, you
should also understand which transformation actions constitute data
wrangling, and, most important, how you can use those
transformations to produce the best datasets for your analytic
activities.
Those nitty-gritty transformations constitute our discussion in
Chapters 4-7. You can think of those chapters as a rough “how-to”
guide for data wrangling. That said, we do not intend this book to
provide a comprehensive tutorial on all possible data wrangling
methods. Instead, we want to give you a collection of techniques
that you can use when moving through the stages of the data
workflow framework.
As we introduce each of the key transformation and profiling
activities that comprise data wrangling, we will walk through a
theoretical data project involving a publicly available dataset
containing US campaign finance information. You can walk through
the project along with us in your data wrangling tool of choice.
Finally, we end by discussing roles and responsibilities in a data
wrangling project in Chapter 8, and exploring a selection of data
wrangling tools in Chapter 9.
Throughout the book, we ground our discussion in example data,
transformations of that data, and various visual and statistical views
of that data. Along those lines, we open with a story about
Facebook.
https://s21.q4cdn.com/399680738/files/doc_financials/annual_report
s/2015-Annual-Report.pdf
2https://medium.com/swlh/diligence-at-social-capital-part-1-
accounting-for-user-growth-4a8a449fddfc#.w7lptg3n4
3 https://blog.kissmetrics.com/alex-schultz-growth/
Chapter 2. A Data Workflow
Framework
In the raw stage, the primary goal is to discover the data. When
examining raw data, you ask questions aimed at understanding what
your data looks like. For example:
Armed with an understanding of the data, you can then refine the
data for deeper exploration by removing unusable parts of the data,
reshaping poorly formatted elements, and establishing relationships
between multiple datasets. Assessing potential data quality issues is
also frequently a concern during the refined stage, because quality
issues might negatively affect any automated use of the data
downstream.
Finally, after you understand the data’s quality and potential
applications in automated systems, you can move the data to the
production stage. At this point, production-quality data can feed
automated products and services, or enter previously established
pipelines that drive regular reporting and analytics activities.
A minority of data projects will end in the raw or production stages.
The majority will end in the refined stage. Projects ending in the
refined stage will add indirect value by delivering insights and
models that drive better decisions. In some cases, these projects
might last multiple years. Google’s Project Oxygen is a great
example of a project that ended in the refined stage.1 Realizing that
managing people is a critical skill for a successful organization,
Google kicked off a multiyear study to assess the characteristics of a
good manager and then test how effective they could be at teaching
those characteristics. The results of the study indireclty influenced
employee behavior, but the study data itself was not incorporated
into a production pipeline.
The hand-off between IT shared services organizations and lines of
business traditionally occurs in the refined stage. In such an
environment, IT is responsible for Extract-Transform-Load (ETL)
operations. ETL moves data through the three data stages in a
centrally controlled manner. Lines of business own the data analysis
process, including everything from reporting and ad hoc research
tasks, to advanced modeling and forecasting, to data-driven
operational changes. This division of concerns and responsibilities
has two intended benefits: basic data governance due to centralized
data processing, and efficiency gains due to IT engineers reusing
broadly useful data transformations.
However, in practice, the perceived benefits of centrally transforming
data are often eclipsed by the reality of organizational inefficiencies
and bottlenecks. Most of these bottlenecks arise from line-of-
business analysts being dependent upon IT. In the age of agile
analytics and data-driven services, there is increasing pressure to
speed up the extraction of value from your data. Unsurprisingly, the
best plan of attack involves identifying and removing bottlenecks.
In our experience, there are two primary bottlenecks. The first
bottleneck is the time it takes to wrangle your data. Even when you
start from refined data, there are often nontrivial transformations
required to prepare your data for analysis. These transformations
can include removing unnecessary records, joining in additional
information, aggregating data, or pivoting datasets. We will discuss
each of these common transformation actions in more detail in later
chapters.
The second bottleneck is the simple capacity mismatch that arises
when a large pool of analysts relies on a small pool of IT
professionals to prepare “refined” data for them. Removing this
bottleneck is more of an organizational challenge than anything else,
and it involves expanding the range of users who have access to raw
data and providing them with the requisite training and skills.
To help motivate these organizational changes, let’s step back and
consider the gross mechanics of successfully using data. The most
valuable uses of your data will be production uses that take the form
of automated reports or data-driven services and products. But
every production use of your data depends on hundreds or even
thousands of exploratory, ad hoc analyses. In other words, there is a
funnel of effort leading to direct, production value that begins with
exploratory analytics. And, as with any funnel, your conversation
rate will not be 100 percent. You’ll need as many people as possible
exploring your data and deriving insights in order to discover a
relatively small number of valuable applications of your data.
As Figure 2-1 demonstrates, a large number of raw data sources and
exploratory analyses are required to produce a single valuable
application of your data.
Creating Metadata
In most cases, the data that you are ingesting during the raw data
stage is known; that is, you know what you are going to get and
how to work with it. But what happens when your organization adds
a new data source? In other words, what do you do when your data
is partially or completely unknown? Ingesting unknown data triggers
two additional actions, both related to the creation of metadata. One
action is focused on understanding the characteristics of your data,
or describing your data. We refer to this action as generating generic
metadata. A second action is focused on using the characteristics of
your data to make a determination about your data’s value. This
action involves creating custom metadata.
Another random document with
no related content on Scribd:
day, deafness; the fever increased; urine the same. On the twentieth
and following days, much delirium. On the thirtieth, copious
hemorrhage from the nose, and became more collected; deafness
continued, but less; the fever diminished; on the following days,
frequent hemorrhages, at short intervals. About the sixtieth, the
hemorrhages ceased, but violent pain of the hip-joint, and increase
of fever. Not long afterwards, pains of all the inferior parts; it then
became a rule, that either the fever and deafness increased, or, if
these abated and were lightened, the pains of the inferior parts were
increased. About the eightieth day, all the complaints gave way,
without leaving any behind; for the urine was of a good color, and
had a copious sediment, while the delirium became less. About the
hundredth day, disorder of the bowels, with copious and bilious
evacuations, and these continued for a considerable time, and again
assumed the dysenteric form with pain; but relief of all the other
complaints. On the whole, the fevers went off, and the deafness
ceased. On the hundred and twentieth day, had a complete crisis.
Ardent fever.
Explanation of the characters. It is probable that the bilious
discharge brought about the recovery on the hundred and twentieth
day.[733]
Case X.—In Abdera, Nicodemus was seized with fever from
venery and drinking. At the commencement he was troubled with
nausea and cardialgia; thirsty, tongue was parched; urine thin and
dark. On the second day, the fever exacerbated; he was troubled
with rigors and nausea; had no sleep; vomited yellow bile; urine the
same; passed a quiet night, and slept. On the third, a general
remission; amelioration; but about sunset felt again somewhat
uncomfortable; passed an uneasy night. On the fourth, rigor, much
fever, general pains; urine thin, with substances floating in it; again a
quiet night. On the fifth, all the symptoms remained, but there was an
amelioration. On the sixth, some general pains; substances floating
in the urine; very incoherent. On the seventh, better. On the eighth,
all the other symptoms abated. On the tenth, and following days,
there were pains, but all less; in this case throughout, the paroxysms
and pains were greater on the even days. On the twentieth, the urine
white and thick, but when allowed to stand had no sediment; much
sweat; seemed to be free from fever; but again in the evening he
became hot, with the same pains, rigor, thirst, slightly incoherent. On
the twenty-fourth, urine copious, white, with an abundant sediment; a
copious and warm sweat all over; apyrexia; the fever came to its
crisis.
Explanation of the characters. It is probable that the cure was
owing to the bilious evacuations and the sweats.[734]
Case XI.—In Thasus, a woman, of a melancholic turn of mind,
from some accidental cause of sorrow, while still going about,
became affected with loss of sleep, aversion to food, and had thirst
and nausea. She lived near the Pylades, upon the Plain. On the first,
at the commencement of night, frights, much talking; despondency,
slight fever; in the morning, frequent spasms, and when they ceased,
she was incoherent and talked obscurely; pains frequent, great, and
continued. On the second, in the same state; had no sleep; fever
more acute. On the third, the spasms left her; but coma, and
disposition to sleep, and again awake, started up, and could not
contain herself; much incoherence; acute fever; on that night a
copious sweat all over; apyrexia, slept, quite collected; had a crisis.
About the third day, the urine black, thin, substances floating in it
generally round, did not fall to the bottom; about the crisis a copious
menstruation.[735]
Case XII.—In Larissa,[736] a young unmarried woman was seized
with a fever of the acute and ardent type; insomnolency, thirst;
tongue sooty and dry; urine of a good color, but thin. On the second,
in an uneasy state, did not sleep. On the third, alvine discharges
copious, watery, and greenish, and on the following days passed
such with relief. On the fourth, passed a small quantity of thin urine,
having substances floating towards its surface, which did not
subside; was delirious towards night. On the sixth, a great
hemorrhage from the nose; a chill, with a copious and hot sweat all
over; apyrexia, had a crisis. In the fever, and when it had passed the
crisis, the menses took place for the first time, for she was a young
woman. Throughout she was oppressed with nausea, and rigors;
redness of the face; pain of the eyes; heaviness of the head; she
had no relapse, but the fever came to a crisis. The pains were on the
even days.[737]
Case XIII.—Apollonius, in Abdera, bore up (under the fever?) for
some time, without betaking himself to bed. His viscera were
enlarged, and for a considerable time there was a constant pain
about the liver, and then he became affected with jaundice; he was
flatulent, and of a whitish complexion. Having eaten beef, and drunk
unseasonably, he became a little heated at first, and betook himself
to bed, and having used large quantities of milk, that of goats and
sheep, and both boiled and raw, with a bad diet otherwise, great
mischief was occasioned by all these things; for the fever was
exacerbated, and of the food taken scarcely any portion worth
mentioning was passed from the bowels; the urine was thin and
scanty; no sleep; troublesome meteorism; much thirst; disposition to
coma; painful swelling of the right hypochondrium; extremities
altogether coldish; slight incoherence, forgetfulness of everything he
said; he was beside himself. About the fourteenth day after he
betook himself to bed, had a rigor, became heated, and was seized
with furious delirium; loud cries, much talking, again composed, and
then coma came on; afterwards the bowels disordered, with copious,
bilious, unmixed, and undigested stools; urine black, scanty, and
thin; much restlessness; alvine evacuations of varied characters,
either black, scanty, and verdigris-green, or fatty, undigested, and
acrid; and at times the dejections resembled milk. About the twenty-
fourth, enjoyed a calm; other matters in the same state; became
somewhat collected; remembered nothing that had happened since
he was confined to bed; immediately afterwards became delirious;
every symptom rapidly getting worse. About the thirtieth, acute fever;
stools copious and thin; was delirious; extremities cold; loss of
speech. On the thirty-fourth he died. In this case, as far as I saw, the
bowels were disordered; urine thin and black; disposition to coma;
insomnolency; extremities cold; delirious throughout. Phrenitis.[738]
Case XIV.—In Cyzicus,[739] a woman who had brought forth twin
daughters, after a difficult labor, and in whom the lochial discharge
was insufficient, at first was seized with an acute fever, attended with
chills; heaviness of the head and neck, with pain; insomnolency from
the commencement; she was silent, sullen, and disobedient; urine
thin, and devoid of color; thirst, nausea for the most part; bowels
irregularly disordered, and again constipated. On the sixth, towards
night, talked much incoherently; had no sleep. About the eleventh
day was seized with wild delirium, and again became collected; urine
black, thin, and again deficient, and of an oily appearance; copious,
thin, and disordered evacuations from the bowels. On the fourteenth,
frequent convulsions; extremities cold; not in anywise collected;
suppression of urine. On the sixteenth loss of speech. On the
seventeenth, she died. Phrenitis.
Explanation of the characters. It is probable that death was
caused, on the seventeenth day, by the affection of the brain
consequent upon her accouchement.[740]
Case XV.—In Thasus, the wife of Dealces, who was lodged upon
the Plain, from sorrow was seized with an acute fever, attended with
chills. From first to last she wrapped herself up in her bedclothes; still
silent, she fumbled, picked, bored, and gathered hairs (from them);
tears, and again laughter; no sleep; bowels irritable, but passed
nothing; when directed, drank a little; urine thin and scanty; to the
touch of the hand the fever was slight; coldness of the extremities.
On the ninth, talked much incoherently, and again became
composed and silent. On the fourteenth, breathing rare, large, at
intervals; and again hurried respiration. On the sixteenth, looseness
of the bowels from a stimulant clyster; afterwards she passed her
drink, nor could retain anything, for she was completely insensible;
skin parched and tense. On the twentieth, much talk, and again
became composed; loss of speech; respiration hurried. On the
twenty-first she died. Her respiration throughout was rare and large;
she was totally insensible; always wrapped up in her bedclothes;
either much talk, or complete silence throughout. Phrenitis.[741]
Case XVI.—In Melibœa,[742] a young man having become
heated by drinking and much venery, was confined to bed; he was
affected with rigors and nausea; insomnolency and absence of thirst.
On the first day much fæces passed from the bowels along with a
copious flux; and on the following days he passed many watery.
stools of a green color; urine thin, scanty, and deficient in color;
respiration rare, large, at long intervals; softish distention of the
hypochondrium, of an oblong form, on both sides; continued
palpitation in the epigastric region throughout; passed urine of an oily
appearance. On the tenth, he had calm delirium, for he was naturally
of an orderly and quiet disposition; skin parched and tense;
dejections either copious and thin, or bilious and fatty. On the
fourteenth, all the symptoms were exacerbated; he became
delirious, and talked much incoherently. On the twentieth, wild
delirium, jactitation, passed no urine; small drinks were retained. On
the twenty-fourth he died. Phrenitis.[743]
ON INJURIES OF THE HEAD.
THE ARGUMENT.