Nothing Special   »   [go: up one dir, main page]

Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Python Tools for Data Scientists Pocket Primer: A Quick Guide to Essential Python Libraries for Data Science
Python Tools for Data Scientists Pocket Primer: A Quick Guide to Essential Python Libraries for Data Science
Python Tools for Data Scientists Pocket Primer: A Quick Guide to Essential Python Libraries for Data Science
Ebook764 pages4 hours

Python Tools for Data Scientists Pocket Primer: A Quick Guide to Essential Python Libraries for Data Science

Rating: 0 out of 5 stars

()

Read preview

About this ebook

This book, part of the best-selling Pocket Primer series, offers a comprehensive introduction to essential Python tools for data scientists. It begins with an overview of Python basics, followed by in-depth coverage of NumPy and Pandas, focusing on their features and applications. The text also addresses the critical tasks of writing regular expressions and performing data cleaning.
Further sections delve into data visualization techniques and the use of Sklearn and SciPy, providing practical knowledge and skills for handling complex data analysis tasks. This structured approach ensures that readers gain a complete understanding of the tools and techniques necessary for effective data science.
Designed to be accessible yet thorough, this book includes numerous code samples to reinforce learning. Companion files with source code are available for download, making it an invaluable resource for anyone looking to master Python for data science and enhance their data analysis capabilities.

LanguageEnglish
Release dateAug 12, 2024
ISBN9781836643487
Python Tools for Data Scientists Pocket Primer: A Quick Guide to Essential Python Libraries for Data Science

Read more from Mercury Learning And Information

Related to Python Tools for Data Scientists Pocket Primer

Related ebooks

Programming For You

View More

Related articles

Reviews for Python Tools for Data Scientists Pocket Primer

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Python Tools for Data Scientists Pocket Primer - Mercury Learning and Information

    PYTHON TOOLS

    FOR

    DATA SCIENTISTS

    Pocket Primer

    LICENSE, DISCLAIMER OF LIABILITY, AND LIMITED WARRANTY

    By purchasing or using this book and companion files (the Work), you agree that this license grants permission to use the contents contained herein, including the disc, but does not give you the right of ownership to any of the textual content in the book / disc or ownership to any of the information or products contained in it. This license does not permit uploading of the Work onto the Internet or on a network (of any kind) without the written consent of the Publisher. Duplication or dissemination of any text, code, simulations, images, etc. contained herein is limited to and subject to licensing terms for the respective products, and permission must be obtained from the Publisher or the owner of the content, etc., in order to reproduce or network any portion of the textual material (in any media) that is contained in the Work.

    MERCURY LEARNING AND INFORMATION (MLI or the Publisher) and anyone involved in the creation, writing, or production of the companion disc, accompanying algorithms, code, or computer programs (the software), and any accompanying Web site or software of the Work, cannot and do not warrant the performance or results that might be obtained by using the contents of the Work. The author, developers, and the Publisher have used their best efforts to ensure the accuracy and functionality of the textual material and/or programs contained in this package; we, however, make no warranty of any kind, express or implied, regarding the performance of these contents or programs. The Work is sold as is without warranty (except for defective materials used in manufacturing the book or due to faulty workmanship).

    The author, developers, and the publisher of any accompanying content, and anyone involved in the composition, production, and manufacturing of this work will not be liable for damages of any kind arising out of the use of (or the inability to use) the algorithms, source code, computer programs, or textual material contained in this publication. This includes, but is not limited to, loss of revenue or profit, or other incidental, physical, or consequential damages arising out of the use of this Work.

    The sole remedy in the event of a claim of any kind is expressly limited to replacement of the book and/or disc, and only at the discretion of the Publisher. The use of implied warranty and certain exclusions vary from state to state, and might not apply to the purchaser of this product.

    Companion files for this title are available by writing to the publisher at info@merclearning.com.

    PYTHON TOOLS

    FOR

    DATA SCIENTISTS

    Pocket Primer

    Oswald Campesato

    Copyright ©2023 by MERCURY LEARNING AND INFORMATION LLC. All rights reserved.

    This publication, portions of it, or any accompanying software may not be reproduced in any way, stored in a retrieval system of any type, or transmitted by any means, media, electronic display or mechanical display, including, but not limited to, photocopy, recording, Internet postings, or scanning, without prior permission in writing from the publisher.

    Publisher: David Pallai

    MERCURY LEARNING AND INFORMATION

    22841 Quicksilver Drive

    Dulles, VA 20166

    info@merclearning.com

    www.merclearning.com

    800-232-0223

    O. Campesato. Python Tools for Data Scientists Pocket Primer.

    ISBN: 978-1-68392-823-2

    The publisher recognizes and respects all marks used by companies, manufacturers, and developers as a means to distinguish their products. All brand names and product names mentioned in this book are trademarks or service marks of their respective companies. Any omission or misuse (of any kind) of service marks or trademarks, etc. is not an attempt to infringe on the property of others.

    Library of Congress Control Number: 2022943452

    222324321 This book is printed on acid-free paper in the United States of America.

    Our titles are available for adoption, license, or bulk purchase by institutions, corporations, etc. For additional information, please contact the Customer Service Dept. at 800-232-0223(toll free).

    All of our titles are available in digital format at academiccourseware.com and other digital vendors. Companion files (figures and code listings) for this title are available by contacting info@merclearning.com. The sole obligation of MERCURY LEARNING AND INFORMATION to the purchaser is to replace the disc, based on defective materials or faulty workmanship, but not based on the operation or functionality of the product.

    I’d like to dedicate this book to my parents –

    may this bring joy and happiness into their lives.

    CONTENTS

    Preface

    Chapter 1: Introduction to Python

    Tools for Python

    easy_install and pip

    virtualenv

    Python Installation

    Setting the PATH Environment Variable (Windows Only)

    Launching Python on Your Machine

    The Python Interactive Interpreter

    Python Identifiers

    Lines, Indentations, and Multi-Lines

    Quotation and Comments in Python

    Saving Your Code in a Module

    Some Standard Modules in Python

    The help() and dir() Functions

    Compile Time and Runtime Code Checking

    Simple Data Types in Python

    Working with Numbers

    Working with Other Bases

    The chr() Function

    The round() Function in Python

    Formatting Numbers in Python

    Unicode and UTF-8

    Working with Unicode

    Listing 1.1: Unicode1.py

    Working with Strings

    Comparing Strings

    Listing 1.2: Compare.py

    Formatting Strings in Python

    Uninitialized Variables and the Value None in Python

    Slicing and Splicing Strings

    Testing for Digits and Alphabetic Characters

    Listing 1.3: CharTypes.py

    Search and Replace a String in Other Strings

    Listing 1.4: FindPos1.py

    Listing 1.5: Replace1.py

    Remove Leading and Trailing Characters

    Listing 1.6: Remove1.py

    Printing Text without NewLine Characters

    Text Alignment

    Working with Dates

    Listing 1.7: Datetime2.py

    Listing 1.8: datetime2.out

    Converting Strings to Dates

    Listing 1.9: String2Date.py

    Exception Handling in Python

    Listing 1.10: Exception1.py

    Handling User Input

    Listing 1.11: UserInput1.py

    Listing 1.12: UserInput2.py

    Listing 1.13: UserInput3.py

    Command-Line Arguments

    Listing 1.14: Hello.py

    Summary

    Chapter 2: Introduction to NumPy

    What is NumPy?

    Useful NumPy Features

    What are NumPy Arrays?

    Listing 2.1: nparray1.py

    Working with Loops

    Listing 2.2: loop1.py

    Appending Elements to Arrays (1)

    Listing 2.3: append1.py

    Appending Elements to Arrays (2)

    Listing 2.4: append2.py

    Multiplying Lists and Arrays

    Listing 2.5: multiply1.py

    Doubling the Elements in a List

    Listing 2.6: double_list1.py

    Lists and Exponents

    Listing 2.7: exponent_list1.py

    Arrays and Exponents

    Listing 2.8: exponent_array1.py

    Math Operations and Arrays

    Listing 2.9: mathops_array1.py

    Working with −1 Sub-ranges With Vectors

    Listing 2.10: npsubarray2.py

    Working with −1 Sub-ranges with Arrays

    Listing 2.11: np2darray2.py

    Other Useful NumPy Methods

    Arrays and Vector Operations

    Listing 2.12: array_vector.py

    NumPy and Dot Products (1)

    Listing 2.13: dotproduct1.py

    NumPy and Dot Products (2)

    Listing 2.14: dotproduct2.py

    NumPy and the Length of Vectors

    Listing 2.15: array_norm.py

    NumPy and Other Operations

    Listing 2.16: otherops.py

    NumPy and the reshape() Method

    Listing 2.17: numpy_reshape.py

    Calculating the Mean and Standard Deviation

    Listing 2.18: sample_mean_std.py

    Code Sample with Mean and Standard Deviation

    Listing 2.19: stat_values.py

    Trimmed Mean and Weighted Mean

    Working with Lines in the Plane (Optional)

    Plotting Randomized Points with NumPy and Matplotlib

    Listing 2.20: np_plot.py

    Plotting a Quadratic with NumPy and Matplotlib

    Listing 2.21: np_plot_quadratic.py

    What is Linear Regression?

    What is Multivariate Analysis?

    What about Non-Linear Datasets?

    The MSE (Mean Squared Error) Formula

    Other Error Types

    Non-Linear Least Squares

    Calculating the MSE Manually

    Find the Best-Fitting Line in NumPy

    Listing 2.22: find_best_fit.py

    Calculating MSE by Successive Approximation (1)

    Listing 2.23: plain_linreg1.py

    Calculating MSE by Successive Approximation (2)

    Listing 2.24: plain_linreg2.py

    Google Colaboratory

    Uploading CSV Files in Google Colaboratory

    Listing 2.25: upload_csv_file.ipynb

    Summary

    Chapter 3: Introduction to Pandas

    What is Pandas?

    Pandas Options and Settings

    Pandas Data Frames

    Data Frames and Data Cleaning Tasks

    Alternatives to Pandas

    A Pandas Data Frame with a NumPy Example

    Listing 3.1: pandas_df.py

    Describing a Pandas Data Frame

    Listing 3.2: pandas_df_describe.py

    Pandas Boolean Data Frames

    Listing 3.3: pandas_boolean_df.py

    Transposing a Pandas Data Frame

    Pandas Data Frames and Random Numbers

    Listing 3.4: pandas_random_df.py

    Listing 3.5: pandas_combine_df.py

    Reading CSV Files in Pandas

    Listing 3.6: sometext.txt

    Listing 3.7: read_csv_file.py

    The loc() and iloc() Methods in Pandas

    Converting Categorical Data to Numeric Data

    Listing 3.8: cat2numeric.py

    Listing 3.9: shirts.csv

    Listing 3.10: shirts.py

    Matching and Splitting Strings in Pandas

    Listing 3.11: shirts_str.py

    Converting Strings to Dates in Pandas

    Listing 3.12: string2date.py

    Merging and Splitting Columns in Pandas

    Listing 3.13: employees.csv

    Listing 3.14: emp_merge_split.py

    Combining Pandas Data Frames

    Listing 3.15: concat_frames.py

    Data Manipulation with Pandas Data Frames (1)

    Listing 3.16: pandas_quarterly_df1.py

    Data Manipulation with Pandas Data Frames (2)

    Listing 3.17: pandas_quarterly_df2.py

    Data Manipulation with Pandas Data Frames (3)

    Listing 3.18: pandas_quarterly_df3.py

    Pandas Data Frames and CSV Files

    Listing 3.19: weather_data.py

    Listing 3.20: people.csv

    Listing 3.21: people_pandas.py

    Managing Columns in Data Frames

    Switching Columns

    Appending Columns

    Deleting Columns

    Inserting Columns

    Scaling Numeric Columns

    Listing 3.22: numbers.csv

    Listing 3.23: scale_columns.py

    Managing Rows in Pandas

    Selecting a Range of Rows in Pandas

    Listing 3.24: duplicates.csv

    Listing 3.25: row_range.py

    Finding Duplicate Rows in Pandas

    Listing 3.26: duplicates.py

    Listing 3.27: drop_duplicates.py

    Inserting New Rows in Pandas

    Listing 3.28: emp_ages.csv

    Listing 3.29: insert_row.py

    Handling Missing Data in Pandas

    Listing 3.30: employees2.csv

    Listing 3.31: missing_values.py

    Multiple Types of Missing Values

    Listing 3.32: employees3.csv

    Listing 3.33: missing_multiple_types.py

    Test for Numeric Values in a Column

    Listing 3.34: test_for_numeric.py

    Replacing NaN Values in Pandas

    Listing 3.35: missing_fill_drop.py

    Sorting Data Frames in Pandas

    Listing 3.36: sort_df.py

    Working with groupby() in Pandas

    Listing 3.37: groupby1.py

    Working with apply() and mapapply() in Pandas

    Listing 3.38: apply1.py

    Listing 3.39: apply2.py

    Listing 3.40: mapapply1.py

    Listing 3.41: mapapply2.py

    Handling Outliers in Pandas

    Listing 3.42: outliers_zscores.py

    Pandas Data Frames and Scatterplots

    Listing 3.43: pandas_scatter_df.py

    Pandas Data Frames and Simple Statistics

    Listing 3.44: housing.csv

    Listing 3.45: housing_stats.py

    Aggregate Operations in Pandas Data Frames

    Listing 3.46: aggregate1.py

    Aggregate Operations with the titanic.csv Dataset

    Listing 3.47: aggregate2.py

    Save Data Frames as CSV Files and Zip Files

    Listing 3.48: save2csv.py

    Pandas Data Frames and Excel Spreadsheets

    Listing 3.49: write_people_xlsx.py

    Listing 3.50: read_people_xslx.py

    Working with JSON-based Data

    Python Dictionary and JSON

    Listing 3.51: dict2json.py

    Python, Pandas, and JSON

    Listing 3.52: pd_python_json.py

    Useful One-line Commands in Pandas

    What is Method Chaining?

    Pandas and Method Chaining

    Pandas Profiling

    Listing 3.53: titanic.csv

    Listing 3.54: profile_titanic.py

    Summary

    Chapter 4: Working with Sklearn and Scipy

    What is Sklearn?

    Sklearn Features

    The Digits Dataset in Sklearn

    Listing 4.1: load_digits1.py

    Listing 4.2: load_digits2.py

    Listing 4.3: sklearn_digits.py

    The train_test_split() Class in Sklearn

    Selecting Columns for X and y

    What is Feature Engineering?

    The Iris Dataset in Sklearn (1)

    Listing 4.4: sklearn_iris1.py

    Sklearn, Pandas, and the Iris Dataset

    Listing 4.5: pandas_iris.py

    The Iris Dataset in Sklearn (2)

    Listing 4.6: sklearn_iris2.py

    The Faces Dataset in Sklearn (Optional)

    Listing 4.7: sklearn_faces.py

    What is SciPy?

    Installing SciPy

    Permutations and Combinations in SciPy

    Listing 4.8: scipy_perms.py

    Listing 4.9: scipy_combinatorics.py

    Calculating Log Sums

    Listing 4.10: scipy_matrix_inv.py

    Calculating Polynomial Values

    Listing 4.11: scipy_poly.py

    Calculating the Determinant of a Square Matrix

    Listing 4.12: scipy_determinant.py

    Calculating the Inverse of a Matrix

    Listing 4.13: scipy_matrix_inv.py

    Calculating Eigenvalues and Eigenvectors

    Listing 4.14: scipy_eigen.py

    Calculating Integrals (Calculus)

    Listing 4.15: scipy_integrate.py

    Calculating Fourier Transforms

    Listing 4.16: scipy_fourier.py

    Flipping Images in SciPy

    Listing 4.17: scipy_flip_image.py

    Rotating Images in SciPy

    Listing 4.18: scipy_rotate_image.py

    Google Colaboratory

    Uploading CSV Files in Google Colaboratory

    Listing 4.19: upload_csv_file.ipynb

    Summary

    Chapter 5: Data Cleaning Tasks

    What is Data Cleaning?

    Data Cleaning for Personal Titles

    Data Cleaning in SQL

    Replace NULL with 0

    Replace NULL Values with the Average Value

    Listing 5.1: replace_null_values.sql

    Replace Multiple Values with a Single Value

    Listing 5.2: reduce_values.sql

    Handle Mismatched Attribute Values

    Listing 5.3: type_mismatch.sql

    Convert Strings to Date Values

    Listing 5.4: str_to_date.sql

    Data Cleaning from the Command Line (optional)

    Working with the sed Utility

    Listing 5.5: delimiter1.txt

    Listing 5.6: delimiter1.sh

    Working with Variable Column Counts

    Listing 5.7: variable_columns.csv

    Listing 5.8: variable_columns.sh

    Listing 5.9: variable_columns2.sh

    Truncating Rows in CSV Files

    Listing 5.10: variable_columns3.sh

    Generating Rows with Fixed Columns with the awk Utility

    Listing 5.11: FixedFieldCount1.sh

    Listing 5.12: employees.txt

    Listing 5.13: FixedFieldCount2.sh

    Converting Phone Numbers

    Listing 5.14: phone_numbers.txt

    Listing 5.15: phone_numbers.sh

    Converting Numeric Date Formats

    Listing 5.16: dates.txt

    Listing 5.17: dates.sh

    Listing 5.18: dates2.sh

    Converting Alphabetic Date Formats

    Listing 5.19: dates2.txt

    Listing 5.20: dates3.sh

    Working with Date and Time Date Formats

    Listing 5.21: date-times.txt

    Listing 5.22: date-times-padded.sh

    Working with Codes, Countries, and Cities

    Listing 5.23: country_codes.csv

    Listing 5.24: add_country_codes.sh

    Listing 5.25: countries_cities.csv

    Listing 5.26: split_countries_codes.sh

    Listing 5.27: countries_cities2.csv

    Listing 5.28: split_countries_codes2.sh

    Data Cleaning on a Kaggle Dataset

    Listing 5.29: convert_marketing.sh

    Summary

    Chapter 6: Data Visualization

    What is Data Visualization?

    Types of Data Visualization

    What is Matplotlib?

    Diagonal Lines in Matplotlib

    Listing 6.1: diagonallines.py

    A Colored Grid in Matplotlib

    Listing 6.2: plotgrid2.py

    Randomized Data Points in Matplotlib

    Listing 6.3: lin_plot_reg.py

    A Histogram in Matplotlib

    Listing 6.4: histogram1.py

    A Set of Line Segments in Matplotlib

    Listing 6.5: line_segments.py

    Plotting Multiple Lines in Matplotlib

    Listing 6.6: plt_array2.py

    Trigonometric Functions in Matplotlib

    Listing 6.7: sincos.py

    Display IQ Scores in Matplotlib

    Listing 6.8: iq_scores.py

    Plot a Best-Fitting Line in Matplotlib

    Listing 6.9: plot_best_fit.py

    The Iris Dataset in SkLearn

    Listing 6.10: sklearn_iris1.py

    SkLearn, Pandas, and the Iris Dataset

    Listing 6.11: pandas_iris.py

    Working with Seaborn

    Features of Seaborn

    Seaborn Built-in Datasets

    Listing 6.12: seaborn_tips.py

    The Iris Dataset in Seaborn

    Listing 6.13: seaborn_iris.py

    The Titanic Dataset in Seaborn

    Listing 6.14: seaborn_titanic_plot.py

    Extracting Data from the Titanic Dataset in Seaborn (1)

    Listing 6.15: seaborn_titanic.py

    Extracting Data from the Titanic Dataset in Seaborn (2)

    Listing 6.16: seaborn_titanic2.py

    Visualizing a Pandas Dataset in Seaborn

    Listing 6.17: pandas_seaborn.py

    Data Visualization in Pandas

    Listing 6.18: pandas_viz1.py

    What is Bokeh?

    Listing 6.19: bokeh_trig.py

    Summary

    Appendix A: Working with Data

    What are Datasets?

    Data Preprocessing

    Data Types

    Preparing Datasets

    Discrete Data vs. Continuous Data

    Binning Continuous Data

    Scaling Numeric Data via Normalization

    Scaling Numeric Data via Standardization

    What to Look for in Categorical Data

    Mapping Categorical Data to Numeric Values

    Working with Dates

    Working with Currency

    Missing Data, Anomalies, and Outliers

    Missing Data

    Anomalies and Outliers

    Outlier Detection

    What is Data Drift?

    What is Imbalanced Classification?

    What is SMOTE?

    SMOTE Extensions

    Analyzing Classifiers (Optional)

    What is LIME?

    What is ANOVA?

    The Bias-Variance Trade-Off

    Types of Bias in Data

    Summary

    Appendix B: Working with awk

    The awk Command

    Built-in Variables that Control awk

    How Does the awk Command Work?

    Aligning Text with the printf Statement

    Listing B.1: columns2.txt

    Listing B.2: AlignColumns1.sh

    Conditional Logic and Control Statements

    The while Statement

    A for loop in awk

    Listing B.3: Loop.sh

    A for loop with a break Statement

    The next and continue Statements

    Deleting Alternate Lines in Datasets

    Listing B.4: linepairs.csv

    Listing B.5: deletelines.sh

    Merging Lines in Datasets

    Listing B.6: columns.txt

    Listing B.7: ColumnCount1.sh

    Printing File Contents as a Single Line

    Joining Groups of Lines in a Text File

    Listing B.8: digits.txt

    Listing B.9: digits.sh

    Joining Alternate Lines in a Text File

    Listing B.10: columns2.txt

    Listing B.11: JoinLines.sh

    Listing B.12: JoinLines2.sh

    Listing B.13: JoinLines2.sh

    Matching with Meta Characters and Character Sets

    Listing B.14: Patterns1.sh

    Listing B.15: columns3.txt

    Listing B.16: MatchAlpha1.sh

    Printing Lines Using Conditional Logic

    Listing B.17: products.txt

    Splitting Filenames with awk

    Listing B.18: SplitFilename2.sh

    Working with Postfix Arithmetic Operators

    Listing B.19: mixednumbers.txt

    Listing B.20: AddSubtract1.sh

    Numeric Functions in awk

    One Line awk Commands

    Useful Short awk Scripts

    Listing B.21: data.txt

    Printing the Words in a Text String in awk

    Listing B.22: Fields2.sh

    Count Occurrences of a String in Specific Rows

    Listing B.23: data1.csv

    Listing B.24: data2.csv

    Listing B.25: checkrows.sh

    Printing a String in a Fixed Number of Columns

    Listing B.26: FixedFieldCount1.sh

    Printing a Dataset in a Fixed Number of Columns

    Listing B.27: VariableColumns.txt

    Listing B.28: Fields3.sh

    Aligning Columns in Datasets

    Listing B.29: mixed-data.csv

    Listing B.30: mixed-data.sh

    Aligning Columns and Multiple Rows in Datasets

    Listing B.31: mixed-data2.csv

    Listing B.32: aligned-data2.csv

    Listing B.33: mixed-data2.sh

    Removing a Column from a Text File

    Listing B.34: VariableColumns.txt

    Listing B.35: RemoveColumn.sh

    Subsets of Column-aligned Rows in Datasets

    Listing B.36: sub-rows-cols.txt

    Listing B.37: sub-rows-cols.sh

    Counting Word Frequency in Datasets

    Listing B.38: WordCounts1.sh

    Listing B.39: WordCounts2.sh

    Listing B.40: columns4.txt

    Displaying Only Pure Words in a Dataset

    Listing B.41: onlywords.sh

    Working with Multi-line Records in awk

    Listing B.42: employees.txt

    Listing B.43: employees.sh

    A Simple Use Case

    Listing B.44: quotes3.csv

    Listing B.45 delim1.sh

    Another Use Case

    Listing B.46: dates2.csv

    Listing B.47: string2date2.sh

    Summary

    Index

    PREFACE

    What is the Primary Value Proposition for this Book?

    This book contains a fast-paced introduction to as much relevant information about Python tools for data scientists as possible that can be reasonably included in a book of this size. If you are a novice, this book will give you a starting point from which you can decide which Python technologies that you want to explore in greater detail.

    You will be exposed to features of NumPy and Pandas, how to write regular expressions, and how to perform data cleaning tasks. Some topics are presented in a cursory manner, which is for two main reasons. First, it’s important that you be exposed to these concepts. In some cases, you will find topics that might pique your interest, and hence motivate you to learn more about them through self-study; in other cases, you will probably be satisfied with a brief introduction. In other words, you decide whether to delve deeply into each of the topics in this book.

    Second, a full treatment of all the topics that are covered in this book would significantly increase its size, and few people are interested in reading technical tomes with 500 or more pages.

    However, it’s important for you to decide if this approach is suitable for your needs and learning style. If not, you can select one or more of the plethora of data analytics books that are available.

    The Target Audience

    This book is intended primarily for people who have worked with Python and are interested in learning about several important Python libraries. Moreover, this book is also intended to reach an international audience of readers with highly diverse backgrounds in various age groups. Consequently, this book uses standard English rather than colloquial expressions that might be confusing to those readers. As you know, many people learn by different types of imitation, which includes reading, writing, or hearing new material. This book takes these points into consideration to provide a comfortable and meaningful learning experience for the intended readers.

    What Will I Learn from This Book?

    The first chapter contains a quick tour of basic Python, followed by a chapter that introduces you to Python data structures. Next, Chapter 3 introduces you to NumPy, followed by a chapter for Pandas. Chapter 5 provides a high-level view of Sklearn, which is an extremely powerful Python library that is central to many machine learning tasks.

    Chapter 6 contains an assortment of data cleaning tasks that are solved via Python as well as the awk programming language. Chapter 6 delves into data visualization with Matplotlib, Seaborn, and Bokeh. Next, one appendix explores issues that can arise with data, followed by an appendix for awk.

    Why is an Appendix for awk Included in This Book?

    While many data cleaning tasks can be performed via Python, sometimes it’s much easier to perform data cleaning via awk. If you have not worked with awk, it’s a venerable Unix utility that was developed almost 50 years ago by Aho, Weinberger, and Kernighan (the latter is a coauthor of the famous K&R book for C).

    Incidentally, most of the Python code samples are short (usually less than one page and sometimes less than half a page), and if need be, you can easily and quickly copy/paste the code into a new Jupyter notebook. For the Python code samples that reference a CSV file, you do not need any additional code in the corresponding Jupyter notebook to access the CSV file. Moreover, the code samples execute quickly, so you won’t need to avail yourself of the free GPU that is provided in Google Colaboratory.

    If you do decide to use Google Colaboratory, you can easily copy/paste the Python code into a notebook, and also use the upload feature to upload existing Jupyter notebooks. Keep in mind the following point: if the Python code references a CSV file, make sure that you include the appropriate code snippet (as explained in Chapter 1) to access the CSV file in the corresponding Jupyter notebook in Google Colaboratory.

    Do I Need to Learn the Theory Portions of this Book?

    Once again, the answer depends on the extent to which you plan to become involved in data analytics. For example, if you plan to study machine learning, then you will probably learn how to create and train a model, which is a task that is performed after data cleaning tasks. In general, you will probably need to learn everything that you encounter in this book if you are planning to become a machine learning engineer.

    Why Does This Book Include Sklearn Material?

    The amount of Sklearn material in this book is minimal because this book is not about machine learning. The Sklearn material is located in Chapter 6, where you will learn about some of the Sklearn built-in datasets. If you decide to delve into machine learning, you will have already been introduced to some aspects of Sklearn.

    Getting the Most from This Book

    Some programmers learn well from prose, others learn well from sample code (and lots of it), which means that there’s no single style that can be used for everyone.

    Moreover, some programmers want to run the

    Enjoying the preview?
    Page 1 of 1