Python Tools for Data Scientists Pocket Primer: A Quick Guide to Essential Python Libraries for Data Science
()
About this ebook
This book, part of the best-selling Pocket Primer series, offers a comprehensive introduction to essential Python tools for data scientists. It begins with an overview of Python basics, followed by in-depth coverage of NumPy and Pandas, focusing on their features and applications. The text also addresses the critical tasks of writing regular expressions and performing data cleaning.
Further sections delve into data visualization techniques and the use of Sklearn and SciPy, providing practical knowledge and skills for handling complex data analysis tasks. This structured approach ensures that readers gain a complete understanding of the tools and techniques necessary for effective data science.
Designed to be accessible yet thorough, this book includes numerous code samples to reinforce learning. Companion files with source code are available for download, making it an invaluable resource for anyone looking to master Python for data science and enhance their data analysis capabilities.
Read more from Mercury Learning And Information
Computer Graphics Programming in OpenGL With C++ (Edition 3): Mastering 3D Graphics and Animation Techniques Rating: 0 out of 5 stars0 ratingsClassic Game Design: From Pong to Pac-Man with Unity: Crafting Timeless Retro Games with Expert Techniques Rating: 0 out of 5 stars0 ratingsMarket Research and Analysis: Mastering Market Research: Advanced Methods, Design, and Data Analysis Rating: 0 out of 5 stars0 ratingsComputer Graphics Programming in OpenGL with Java: A Comprehensive Guide to Modern 3D Graphics Programming Rating: 0 out of 5 stars0 ratingsDatabase Security: Master the Art of Protecting Your Data with Cutting-Edge Techniques Rating: 0 out of 5 stars0 ratingsTransformer, BERT, and GPT: Unlock the Power of Transformers, BERT, GPT-3, and GPT-4 in Natural Language Processing Rating: 0 out of 5 stars0 ratingsComputer Security and Encryption: Advanced Techniques for Securing Digital Information Rating: 0 out of 5 stars0 ratingsArtificial Intelligence in the 21st Century: The Future of Technology and Human Innovation Rating: 0 out of 5 stars0 ratingsEmpirical Cloud Security: A Guide To Practical Intelligence to Evaluate Risks and Attacks Rating: 0 out of 5 stars0 ratingsData Science for IoT Engineers: Master Data Science Techniques and Machine Learning Applications for Innovative IoT Solutions Rating: 0 out of 5 stars0 ratingsArtificial Intelligence and Expert Systems: Techniques and Applications for Problem Solving Rating: 0 out of 5 stars0 ratingsText Analytics for Business Decisions: Mastering Techniques for Insightful Data Interpretation through a Case Study Approach Rating: 0 out of 5 stars0 ratingsData Wrangling Using Pandas, SQL, and Java: A Comprehensive Guide to Data Cleaning and Transformation Rating: 0 out of 5 stars0 ratingsCybersecurity: A Self-Teaching Introduction Rating: 0 out of 5 stars0 ratingsPython for Programmers: A Comprehensive Guide for Intermediate to Advanced Python Programmers and Developers Rating: 0 out of 5 stars0 ratingsPython 3 for Machine Learning: Harness the Power of Python for Advanced Machine Learning Projects Rating: 0 out of 5 stars0 ratingsData Science Fundamentals Pocket Primer: An Essential Guide to Data Science Concepts and Techniques Rating: 0 out of 5 stars0 ratingsPython 3 Data Visualization Using ChatGPT / GPT-4: Master Python Visualization Techniques with AI Integration Rating: 0 out of 5 stars0 ratingsBash Command Line and Shell Scripts Pocket Primer: Mastering Bash Commands and Scripting Techniques Rating: 0 out of 5 stars0 ratingsData Literacy With Python: A Comprehensive Guide to Understanding and Analyzing Data with Python Rating: 0 out of 5 stars0 ratingsPython 3 Data Visualization Using Google Gemini: Unlock the Power of Python and Google Gemini for Stunning Data Visualizations Rating: 0 out of 5 stars0 ratingsAdobe InDesign: Creative Class for Beginners Rating: 0 out of 5 stars0 ratings3D Printing: The Complete Guide to Mastering 3D Printing Techniques Rating: 0 out of 5 stars0 ratingsAngular and Deep Learning Pocket Primer: A Comprehensive Guide to AI and Expert Systems for Professionals Rating: 0 out of 5 stars0 ratingsTensor Analysis for Engineers: Mastering Coordinate Systems, Transformations and Applications using Mathematics Rating: 0 out of 5 stars0 ratingsData Analytics: Master the Art of Data Analytics with Essential Tools and Techniques Rating: 0 out of 5 stars0 ratingsAccess 2021 / Microsoft 365 Programming by Example: Mastering VBA for Data Management and Automation Rating: 0 out of 5 stars0 ratingsDigital Signal Processing: An Introduction to Mastering Advanced Techniques for Transforming and Analyzing Signals Rating: 0 out of 5 stars0 ratingsData Structures and Program Design Using Java: A Self-Teaching Introduction to Data Structures and Java Rating: 0 out of 5 stars0 ratings
Related to Python Tools for Data Scientists Pocket Primer
Related ebooks
Python 3 Data Visualization Using ChatGPT / GPT-4: Master Python Visualization Techniques with AI Integration Rating: 0 out of 5 stars0 ratingsGoogle Gemini for Python: Coding with Bard: Mastering Python with Google's AI Tools Rating: 0 out of 5 stars0 ratingsData Science Fundamentals Pocket Primer: An Essential Guide to Data Science Concepts and Techniques Rating: 0 out of 5 stars0 ratingsPython 3 Data Visualization Using Google Gemini: Unlock the Power of Python and Google Gemini for Stunning Data Visualizations Rating: 0 out of 5 stars0 ratingsPython for Programmers: A Comprehensive Guide for Intermediate to Advanced Python Programmers and Developers Rating: 0 out of 5 stars0 ratingsData Literacy With Python: A Comprehensive Guide to Understanding and Analyzing Data with Python Rating: 0 out of 5 stars0 ratingsPython 3 for Machine Learning: Harness the Power of Python for Advanced Machine Learning Projects Rating: 0 out of 5 stars0 ratingsPython Data Structures Pocket Primer: A concise guide to Python data structures to enhance your skills Rating: 0 out of 5 stars0 ratingsPython 3 and Machine Learning Using ChatGPT / GPT-4: Harness the Power of Python, Machine Learning, and Generative AI Rating: 0 out of 5 stars0 ratingsData Wrangling Using Pandas, SQL, and Java: A Comprehensive Guide to Data Cleaning and Transformation Rating: 0 out of 5 stars0 ratingsPandas Basics: Mastering Data Analysis with Pandas Rating: 0 out of 5 stars0 ratingsNatural Language Processing using R Pocket Primer: Learn Essential NLP Techniques and Tools for Developers Rating: 0 out of 5 stars0 ratingsData Structures and Program Design Using Python: A Self-Teaching Introduction to Data Structures and Python Rating: 0 out of 5 stars0 ratingsArtificial Intelligence, Machine Learning, and Deep Learning: A Practical Guide to Advanced AI Techniques Rating: 0 out of 5 stars0 ratingsAngular and Machine Learning Pocket Primer: A Comprehensive Guide to Angular and Integrating Machine Learning Rating: 0 out of 5 stars0 ratingsBash for Data Scientists: A Comprehensive Guide to Shell Scripting for Data Science Tasks Rating: 0 out of 5 stars0 ratingsComputational Physics: A Comprehensive Guide to Numerical Methods in Physics Rating: 0 out of 5 stars0 ratingsComputer Concepts and Management Information Systems: A Comprehensive Guide to Modern Computing and Information Management Rating: 0 out of 5 stars0 ratingsJava for Developers Pocket Primer: A Concise Guide to Mastering Java Programming Rating: 0 out of 5 stars0 ratingsDigital Signal Processing: An Introduction to Mastering Advanced Techniques for Transforming and Analyzing Signals Rating: 0 out of 5 stars0 ratingsAngular and Deep Learning Pocket Primer: A Comprehensive Guide to AI and Expert Systems for Professionals Rating: 0 out of 5 stars0 ratingsProgramming Fundamentals Using JAVA: A Game Application Approach: Unlock Your Potential with Comprehensive Java Training Rating: 0 out of 5 stars0 ratingsBash Command Line and Shell Scripts Pocket Primer: Mastering Bash Commands and Scripting Techniques Rating: 0 out of 5 stars0 ratingsData Structures and Program Design Using Java: A Self-Teaching Introduction to Data Structures and Java Rating: 0 out of 5 stars0 ratingsTransformer, BERT, and GPT: Unlock the Power of Transformers, BERT, GPT-3, and GPT-4 in Natural Language Processing Rating: 0 out of 5 stars0 ratingsText Analytics for Business Decisions: Mastering Techniques for Insightful Data Interpretation through a Case Study Approach Rating: 0 out of 5 stars0 ratingsWORKING WITH grep, sed, AND awk Pocket Primer: A Quick Guide to Mastering Powerful Command Line Tools Rating: 0 out of 5 stars0 ratingsData Analysis for Business Decisions: A Laboratory Manual Rating: 0 out of 5 stars0 ratingsData Science for IoT Engineers: Master Data Science Techniques and Machine Learning Applications for Innovative IoT Solutions Rating: 0 out of 5 stars0 ratings
Programming For You
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1 Rating: 5 out of 5 stars5/5Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps Rating: 4 out of 5 stars4/5Coding All-in-One For Dummies Rating: 4 out of 5 stars4/5Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer. Rating: 5 out of 5 stars5/5Learn PowerShell in a Month of Lunches, Fourth Edition: Covers Windows, Linux, and macOS Rating: 5 out of 5 stars5/5Python QuickStart Guide: The Simplified Beginner's Guide to Python Programming Using Hands-On Projects and Real-World Applications Rating: 0 out of 5 stars0 ratingsGrokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5HTML in 30 Pages Rating: 5 out of 5 stars5/5HTML & CSS: Learn the Fundaments in 7 Days Rating: 4 out of 5 stars4/5C Programming For Beginners: The Simple Guide to Learning C Programming Language Fast! Rating: 5 out of 5 stars5/5A Slackers Guide to Coding with Python: Ultimate Beginners Guide to Learning Python Quick Rating: 0 out of 5 stars0 ratingsLinux Command Line and Shell Scripting Bible Rating: 3 out of 5 stars3/5SQL: For Beginners: Your Guide To Easily Learn SQL Programming in 7 Days Rating: 5 out of 5 stars5/5C# Programming from Zero to Proficiency (Beginner): C# from Zero to Proficiency, #2 Rating: 0 out of 5 stars0 ratingsExcel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time! Rating: 0 out of 5 stars0 ratingsPYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project Rating: 5 out of 5 stars5/5Linux: Learn in 24 Hours Rating: 5 out of 5 stars5/5Lua Game Development Cookbook Rating: 0 out of 5 stars0 ratingsPython Data Structures and Algorithms Rating: 5 out of 5 stars5/5Narrative Design for Indies: Getting Started Rating: 4 out of 5 stars4/5
Reviews for Python Tools for Data Scientists Pocket Primer
0 ratings0 reviews
Book preview
Python Tools for Data Scientists Pocket Primer - Mercury Learning and Information
PYTHON TOOLS
FOR
DATA SCIENTISTS
Pocket Primer
LICENSE, DISCLAIMER OF LIABILITY, AND LIMITED WARRANTY
By purchasing or using this book and companion files (the Work
), you agree that this license grants permission to use the contents contained herein, including the disc, but does not give you the right of ownership to any of the textual content in the book / disc or ownership to any of the information or products contained in it. This license does not permit uploading of the Work onto the Internet or on a network (of any kind) without the written consent of the Publisher. Duplication or dissemination of any text, code, simulations, images, etc. contained herein is limited to and subject to licensing terms for the respective products, and permission must be obtained from the Publisher or the owner of the content, etc., in order to reproduce or network any portion of the textual material (in any media) that is contained in the Work.
MERCURY LEARNING AND INFORMATION (MLI
or the Publisher
) and anyone involved in the creation, writing, or production of the companion disc, accompanying algorithms, code, or computer programs (the software
), and any accompanying Web site or software of the Work, cannot and do not warrant the performance or results that might be obtained by using the contents of the Work. The author, developers, and the Publisher have used their best efforts to ensure the accuracy and functionality of the textual material and/or programs contained in this package; we, however, make no warranty of any kind, express or implied, regarding the performance of these contents or programs. The Work is sold as is
without warranty (except for defective materials used in manufacturing the book or due to faulty workmanship).
The author, developers, and the publisher of any accompanying content, and anyone involved in the composition, production, and manufacturing of this work will not be liable for damages of any kind arising out of the use of (or the inability to use) the algorithms, source code, computer programs, or textual material contained in this publication. This includes, but is not limited to, loss of revenue or profit, or other incidental, physical, or consequential damages arising out of the use of this Work.
The sole remedy in the event of a claim of any kind is expressly limited to replacement of the book and/or disc, and only at the discretion of the Publisher. The use of implied warranty
and certain exclusions
vary from state to state, and might not apply to the purchaser of this product.
Companion files for this title are available by writing to the publisher at info@merclearning.com.
PYTHON TOOLS
FOR
DATA SCIENTISTS
Pocket Primer
Oswald Campesato
Copyright ©2023 by MERCURY LEARNING AND INFORMATION LLC. All rights reserved.
This publication, portions of it, or any accompanying software may not be reproduced in any way, stored in a retrieval system of any type, or transmitted by any means, media, electronic display or mechanical display, including, but not limited to, photocopy, recording, Internet postings, or scanning, without prior permission in writing from the publisher.
Publisher: David Pallai
MERCURY LEARNING AND INFORMATION
22841 Quicksilver Drive
Dulles, VA 20166
info@merclearning.com
www.merclearning.com
800-232-0223
O. Campesato. Python Tools for Data Scientists Pocket Primer.
ISBN: 978-1-68392-823-2
The publisher recognizes and respects all marks used by companies, manufacturers, and developers as a means to distinguish their products. All brand names and product names mentioned in this book are trademarks or service marks of their respective companies. Any omission or misuse (of any kind) of service marks or trademarks, etc. is not an attempt to infringe on the property of others.
Library of Congress Control Number: 2022943452
222324321 This book is printed on acid-free paper in the United States of America.
Our titles are available for adoption, license, or bulk purchase by institutions, corporations, etc. For additional information, please contact the Customer Service Dept. at 800-232-0223(toll free).
All of our titles are available in digital format at academiccourseware.com and other digital vendors. Companion files (figures and code listings) for this title are available by contacting info@merclearning.com. The sole obligation of MERCURY LEARNING AND INFORMATION to the purchaser is to replace the disc, based on defective materials or faulty workmanship, but not based on the operation or functionality of the product.
I’d like to dedicate this book to my parents –
may this bring joy and happiness into their lives.
CONTENTS
Preface
Chapter 1: Introduction to Python
Tools for Python
easy_install and pip
virtualenv
Python Installation
Setting the PATH Environment Variable (Windows Only)
Launching Python on Your Machine
The Python Interactive Interpreter
Python Identifiers
Lines, Indentations, and Multi-Lines
Quotation and Comments in Python
Saving Your Code in a Module
Some Standard Modules in Python
The help() and dir() Functions
Compile Time and Runtime Code Checking
Simple Data Types in Python
Working with Numbers
Working with Other Bases
The chr() Function
The round() Function in Python
Formatting Numbers in Python
Unicode and UTF-8
Working with Unicode
Listing 1.1: Unicode1.py
Working with Strings
Comparing Strings
Listing 1.2: Compare.py
Formatting Strings in Python
Uninitialized Variables and the Value None in Python
Slicing and Splicing Strings
Testing for Digits and Alphabetic Characters
Listing 1.3: CharTypes.py
Search and Replace a String in Other Strings
Listing 1.4: FindPos1.py
Listing 1.5: Replace1.py
Remove Leading and Trailing Characters
Listing 1.6: Remove1.py
Printing Text without NewLine Characters
Text Alignment
Working with Dates
Listing 1.7: Datetime2.py
Listing 1.8: datetime2.out
Converting Strings to Dates
Listing 1.9: String2Date.py
Exception Handling in Python
Listing 1.10: Exception1.py
Handling User Input
Listing 1.11: UserInput1.py
Listing 1.12: UserInput2.py
Listing 1.13: UserInput3.py
Command-Line Arguments
Listing 1.14: Hello.py
Summary
Chapter 2: Introduction to NumPy
What is NumPy?
Useful NumPy Features
What are NumPy Arrays?
Listing 2.1: nparray1.py
Working with Loops
Listing 2.2: loop1.py
Appending Elements to Arrays (1)
Listing 2.3: append1.py
Appending Elements to Arrays (2)
Listing 2.4: append2.py
Multiplying Lists and Arrays
Listing 2.5: multiply1.py
Doubling the Elements in a List
Listing 2.6: double_list1.py
Lists and Exponents
Listing 2.7: exponent_list1.py
Arrays and Exponents
Listing 2.8: exponent_array1.py
Math Operations and Arrays
Listing 2.9: mathops_array1.py
Working with −1
Sub-ranges With Vectors
Listing 2.10: npsubarray2.py
Working with −1
Sub-ranges with Arrays
Listing 2.11: np2darray2.py
Other Useful NumPy Methods
Arrays and Vector Operations
Listing 2.12: array_vector.py
NumPy and Dot Products (1)
Listing 2.13: dotproduct1.py
NumPy and Dot Products (2)
Listing 2.14: dotproduct2.py
NumPy and the Length of Vectors
Listing 2.15: array_norm.py
NumPy and Other Operations
Listing 2.16: otherops.py
NumPy and the reshape() Method
Listing 2.17: numpy_reshape.py
Calculating the Mean and Standard Deviation
Listing 2.18: sample_mean_std.py
Code Sample with Mean and Standard Deviation
Listing 2.19: stat_values.py
Trimmed Mean and Weighted Mean
Working with Lines in the Plane (Optional)
Plotting Randomized Points with NumPy and Matplotlib
Listing 2.20: np_plot.py
Plotting a Quadratic with NumPy and Matplotlib
Listing 2.21: np_plot_quadratic.py
What is Linear Regression?
What is Multivariate Analysis?
What about Non-Linear Datasets?
The MSE (Mean Squared Error) Formula
Other Error Types
Non-Linear Least Squares
Calculating the MSE Manually
Find the Best-Fitting Line in NumPy
Listing 2.22: find_best_fit.py
Calculating MSE by Successive Approximation (1)
Listing 2.23: plain_linreg1.py
Calculating MSE by Successive Approximation (2)
Listing 2.24: plain_linreg2.py
Google Colaboratory
Uploading CSV Files in Google Colaboratory
Listing 2.25: upload_csv_file.ipynb
Summary
Chapter 3: Introduction to Pandas
What is Pandas?
Pandas Options and Settings
Pandas Data Frames
Data Frames and Data Cleaning Tasks
Alternatives to Pandas
A Pandas Data Frame with a NumPy Example
Listing 3.1: pandas_df.py
Describing a Pandas Data Frame
Listing 3.2: pandas_df_describe.py
Pandas Boolean Data Frames
Listing 3.3: pandas_boolean_df.py
Transposing a Pandas Data Frame
Pandas Data Frames and Random Numbers
Listing 3.4: pandas_random_df.py
Listing 3.5: pandas_combine_df.py
Reading CSV Files in Pandas
Listing 3.6: sometext.txt
Listing 3.7: read_csv_file.py
The loc() and iloc() Methods in Pandas
Converting Categorical Data to Numeric Data
Listing 3.8: cat2numeric.py
Listing 3.9: shirts.csv
Listing 3.10: shirts.py
Matching and Splitting Strings in Pandas
Listing 3.11: shirts_str.py
Converting Strings to Dates in Pandas
Listing 3.12: string2date.py
Merging and Splitting Columns in Pandas
Listing 3.13: employees.csv
Listing 3.14: emp_merge_split.py
Combining Pandas Data Frames
Listing 3.15: concat_frames.py
Data Manipulation with Pandas Data Frames (1)
Listing 3.16: pandas_quarterly_df1.py
Data Manipulation with Pandas Data Frames (2)
Listing 3.17: pandas_quarterly_df2.py
Data Manipulation with Pandas Data Frames (3)
Listing 3.18: pandas_quarterly_df3.py
Pandas Data Frames and CSV Files
Listing 3.19: weather_data.py
Listing 3.20: people.csv
Listing 3.21: people_pandas.py
Managing Columns in Data Frames
Switching Columns
Appending Columns
Deleting Columns
Inserting Columns
Scaling Numeric Columns
Listing 3.22: numbers.csv
Listing 3.23: scale_columns.py
Managing Rows in Pandas
Selecting a Range of Rows in Pandas
Listing 3.24: duplicates.csv
Listing 3.25: row_range.py
Finding Duplicate Rows in Pandas
Listing 3.26: duplicates.py
Listing 3.27: drop_duplicates.py
Inserting New Rows in Pandas
Listing 3.28: emp_ages.csv
Listing 3.29: insert_row.py
Handling Missing Data in Pandas
Listing 3.30: employees2.csv
Listing 3.31: missing_values.py
Multiple Types of Missing Values
Listing 3.32: employees3.csv
Listing 3.33: missing_multiple_types.py
Test for Numeric Values in a Column
Listing 3.34: test_for_numeric.py
Replacing NaN Values in Pandas
Listing 3.35: missing_fill_drop.py
Sorting Data Frames in Pandas
Listing 3.36: sort_df.py
Working with groupby() in Pandas
Listing 3.37: groupby1.py
Working with apply() and mapapply() in Pandas
Listing 3.38: apply1.py
Listing 3.39: apply2.py
Listing 3.40: mapapply1.py
Listing 3.41: mapapply2.py
Handling Outliers in Pandas
Listing 3.42: outliers_zscores.py
Pandas Data Frames and Scatterplots
Listing 3.43: pandas_scatter_df.py
Pandas Data Frames and Simple Statistics
Listing 3.44: housing.csv
Listing 3.45: housing_stats.py
Aggregate Operations in Pandas Data Frames
Listing 3.46: aggregate1.py
Aggregate Operations with the titanic.csv Dataset
Listing 3.47: aggregate2.py
Save Data Frames as CSV Files and Zip Files
Listing 3.48: save2csv.py
Pandas Data Frames and Excel Spreadsheets
Listing 3.49: write_people_xlsx.py
Listing 3.50: read_people_xslx.py
Working with JSON-based Data
Python Dictionary and JSON
Listing 3.51: dict2json.py
Python, Pandas, and JSON
Listing 3.52: pd_python_json.py
Useful One-line Commands in Pandas
What is Method Chaining?
Pandas and Method Chaining
Pandas Profiling
Listing 3.53: titanic.csv
Listing 3.54: profile_titanic.py
Summary
Chapter 4: Working with Sklearn and Scipy
What is Sklearn?
Sklearn Features
The Digits Dataset in Sklearn
Listing 4.1: load_digits1.py
Listing 4.2: load_digits2.py
Listing 4.3: sklearn_digits.py
The train_test_split() Class in Sklearn
Selecting Columns for X and y
What is Feature Engineering?
The Iris Dataset in Sklearn (1)
Listing 4.4: sklearn_iris1.py
Sklearn, Pandas, and the Iris Dataset
Listing 4.5: pandas_iris.py
The Iris Dataset in Sklearn (2)
Listing 4.6: sklearn_iris2.py
The Faces Dataset in Sklearn (Optional)
Listing 4.7: sklearn_faces.py
What is SciPy?
Installing SciPy
Permutations and Combinations in SciPy
Listing 4.8: scipy_perms.py
Listing 4.9: scipy_combinatorics.py
Calculating Log Sums
Listing 4.10: scipy_matrix_inv.py
Calculating Polynomial Values
Listing 4.11: scipy_poly.py
Calculating the Determinant of a Square Matrix
Listing 4.12: scipy_determinant.py
Calculating the Inverse of a Matrix
Listing 4.13: scipy_matrix_inv.py
Calculating Eigenvalues and Eigenvectors
Listing 4.14: scipy_eigen.py
Calculating Integrals (Calculus)
Listing 4.15: scipy_integrate.py
Calculating Fourier Transforms
Listing 4.16: scipy_fourier.py
Flipping Images in SciPy
Listing 4.17: scipy_flip_image.py
Rotating Images in SciPy
Listing 4.18: scipy_rotate_image.py
Google Colaboratory
Uploading CSV Files in Google Colaboratory
Listing 4.19: upload_csv_file.ipynb
Summary
Chapter 5: Data Cleaning Tasks
What is Data Cleaning?
Data Cleaning for Personal Titles
Data Cleaning in SQL
Replace NULL with 0
Replace NULL Values with the Average Value
Listing 5.1: replace_null_values.sql
Replace Multiple Values with a Single Value
Listing 5.2: reduce_values.sql
Handle Mismatched Attribute Values
Listing 5.3: type_mismatch.sql
Convert Strings to Date Values
Listing 5.4: str_to_date.sql
Data Cleaning from the Command Line (optional)
Working with the sed Utility
Listing 5.5: delimiter1.txt
Listing 5.6: delimiter1.sh
Working with Variable Column Counts
Listing 5.7: variable_columns.csv
Listing 5.8: variable_columns.sh
Listing 5.9: variable_columns2.sh
Truncating Rows in CSV Files
Listing 5.10: variable_columns3.sh
Generating Rows with Fixed Columns with the awk Utility
Listing 5.11: FixedFieldCount1.sh
Listing 5.12: employees.txt
Listing 5.13: FixedFieldCount2.sh
Converting Phone Numbers
Listing 5.14: phone_numbers.txt
Listing 5.15: phone_numbers.sh
Converting Numeric Date Formats
Listing 5.16: dates.txt
Listing 5.17: dates.sh
Listing 5.18: dates2.sh
Converting Alphabetic Date Formats
Listing 5.19: dates2.txt
Listing 5.20: dates3.sh
Working with Date and Time Date Formats
Listing 5.21: date-times.txt
Listing 5.22: date-times-padded.sh
Working with Codes, Countries, and Cities
Listing 5.23: country_codes.csv
Listing 5.24: add_country_codes.sh
Listing 5.25: countries_cities.csv
Listing 5.26: split_countries_codes.sh
Listing 5.27: countries_cities2.csv
Listing 5.28: split_countries_codes2.sh
Data Cleaning on a Kaggle Dataset
Listing 5.29: convert_marketing.sh
Summary
Chapter 6: Data Visualization
What is Data Visualization?
Types of Data Visualization
What is Matplotlib?
Diagonal Lines in Matplotlib
Listing 6.1: diagonallines.py
A Colored Grid in Matplotlib
Listing 6.2: plotgrid2.py
Randomized Data Points in Matplotlib
Listing 6.3: lin_plot_reg.py
A Histogram in Matplotlib
Listing 6.4: histogram1.py
A Set of Line Segments in Matplotlib
Listing 6.5: line_segments.py
Plotting Multiple Lines in Matplotlib
Listing 6.6: plt_array2.py
Trigonometric Functions in Matplotlib
Listing 6.7: sincos.py
Display IQ Scores in Matplotlib
Listing 6.8: iq_scores.py
Plot a Best-Fitting Line in Matplotlib
Listing 6.9: plot_best_fit.py
The Iris Dataset in SkLearn
Listing 6.10: sklearn_iris1.py
SkLearn, Pandas, and the Iris Dataset
Listing 6.11: pandas_iris.py
Working with Seaborn
Features of Seaborn
Seaborn Built-in Datasets
Listing 6.12: seaborn_tips.py
The Iris Dataset in Seaborn
Listing 6.13: seaborn_iris.py
The Titanic Dataset in Seaborn
Listing 6.14: seaborn_titanic_plot.py
Extracting Data from the Titanic Dataset in Seaborn (1)
Listing 6.15: seaborn_titanic.py
Extracting Data from the Titanic Dataset in Seaborn (2)
Listing 6.16: seaborn_titanic2.py
Visualizing a Pandas Dataset in Seaborn
Listing 6.17: pandas_seaborn.py
Data Visualization in Pandas
Listing 6.18: pandas_viz1.py
What is Bokeh?
Listing 6.19: bokeh_trig.py
Summary
Appendix A: Working with Data
What are Datasets?
Data Preprocessing
Data Types
Preparing Datasets
Discrete Data vs. Continuous Data
Binning
Continuous Data
Scaling Numeric Data via Normalization
Scaling Numeric Data via Standardization
What to Look for in Categorical Data
Mapping Categorical Data to Numeric Values
Working with Dates
Working with Currency
Missing Data, Anomalies, and Outliers
Missing Data
Anomalies and Outliers
Outlier Detection
What is Data Drift?
What is Imbalanced Classification?
What is SMOTE?
SMOTE Extensions
Analyzing Classifiers (Optional)
What is LIME?
What is ANOVA?
The Bias-Variance Trade-Off
Types of Bias in Data
Summary
Appendix B: Working with awk
The awk Command
Built-in Variables that Control awk
How Does the awk Command Work?
Aligning Text with the printf Statement
Listing B.1: columns2.txt
Listing B.2: AlignColumns1.sh
Conditional Logic and Control Statements
The while Statement
A for loop in awk
Listing B.3: Loop.sh
A for loop with a break Statement
The next and continue Statements
Deleting Alternate Lines in Datasets
Listing B.4: linepairs.csv
Listing B.5: deletelines.sh
Merging Lines in Datasets
Listing B.6: columns.txt
Listing B.7: ColumnCount1.sh
Printing File Contents as a Single Line
Joining Groups of Lines in a Text File
Listing B.8: digits.txt
Listing B.9: digits.sh
Joining Alternate Lines in a Text File
Listing B.10: columns2.txt
Listing B.11: JoinLines.sh
Listing B.12: JoinLines2.sh
Listing B.13: JoinLines2.sh
Matching with Meta Characters and Character Sets
Listing B.14: Patterns1.sh
Listing B.15: columns3.txt
Listing B.16: MatchAlpha1.sh
Printing Lines Using Conditional Logic
Listing B.17: products.txt
Splitting Filenames with awk
Listing B.18: SplitFilename2.sh
Working with Postfix Arithmetic Operators
Listing B.19: mixednumbers.txt
Listing B.20: AddSubtract1.sh
Numeric Functions in awk
One Line awk Commands
Useful Short awk Scripts
Listing B.21: data.txt
Printing the Words in a Text String in awk
Listing B.22: Fields2.sh
Count Occurrences of a String in Specific Rows
Listing B.23: data1.csv
Listing B.24: data2.csv
Listing B.25: checkrows.sh
Printing a String in a Fixed Number of Columns
Listing B.26: FixedFieldCount1.sh
Printing a Dataset in a Fixed Number of Columns
Listing B.27: VariableColumns.txt
Listing B.28: Fields3.sh
Aligning Columns in Datasets
Listing B.29: mixed-data.csv
Listing B.30: mixed-data.sh
Aligning Columns and Multiple Rows in Datasets
Listing B.31: mixed-data2.csv
Listing B.32: aligned-data2.csv
Listing B.33: mixed-data2.sh
Removing a Column from a Text File
Listing B.34: VariableColumns.txt
Listing B.35: RemoveColumn.sh
Subsets of Column-aligned Rows in Datasets
Listing B.36: sub-rows-cols.txt
Listing B.37: sub-rows-cols.sh
Counting Word Frequency in Datasets
Listing B.38: WordCounts1.sh
Listing B.39: WordCounts2.sh
Listing B.40: columns4.txt
Displaying Only Pure
Words in a Dataset
Listing B.41: onlywords.sh
Working with Multi-line Records in awk
Listing B.42: employees.txt
Listing B.43: employees.sh
A Simple Use Case
Listing B.44: quotes3.csv
Listing B.45 delim1.sh
Another Use Case
Listing B.46: dates2.csv
Listing B.47: string2date2.sh
Summary
Index
PREFACE
What is the Primary Value Proposition for this Book?
This book contains a fast-paced introduction to as much relevant information about Python tools for data scientists as possible that can be reasonably included in a book of this size. If you are a novice, this book will give you a starting point from which you can decide which Python technologies that you want to explore in greater detail.
You will be exposed to features of NumPy and Pandas, how to write regular expressions, and how to perform data cleaning tasks. Some topics are presented in a cursory manner, which is for two main reasons. First, it’s important that you be exposed to these concepts. In some cases, you will find topics that might pique your interest, and hence motivate you to learn more about them through self-study; in other cases, you will probably be satisfied with a brief introduction. In other words, you decide whether to delve deeply into each of the topics in this book.
Second, a full treatment of all the topics that are covered in this book would significantly increase its size, and few people are interested in reading technical tomes with 500 or more pages.
However, it’s important for you to decide if this approach is suitable for your needs and learning style. If not, you can select one or more of the plethora of data analytics books that are available.
The Target Audience
This book is intended primarily for people who have worked with Python and are interested in learning about several important Python libraries. Moreover, this book is also intended to reach an international audience of readers with highly diverse backgrounds in various age groups. Consequently, this book uses standard English rather than colloquial expressions that might be confusing to those readers. As you know, many people learn by different types of imitation, which includes reading, writing, or hearing new material. This book takes these points into consideration to provide a comfortable and meaningful learning experience for the intended readers.
What Will I Learn from This Book?
The first chapter contains a quick tour of basic Python, followed by a chapter that introduces you to Python data structures. Next, Chapter 3 introduces you to NumPy, followed by a chapter for Pandas. Chapter 5 provides a high-level view of Sklearn, which is an extremely powerful Python library that is central to many machine learning tasks.
Chapter 6 contains an assortment of data cleaning tasks that are solved via Python as well as the awk programming language. Chapter 6 delves into data visualization with Matplotlib, Seaborn, and Bokeh. Next, one appendix explores issues that can arise with data, followed by an appendix for awk.
Why is an Appendix for awk Included in This Book?
While many data cleaning tasks can be performed via Python, sometimes it’s much easier to perform data cleaning via awk. If you have not worked with awk, it’s a venerable Unix utility that was developed almost 50 years ago by Aho, Weinberger, and Kernighan (the latter is a coauthor of the famous K&R book for C).
Incidentally, most of the Python code samples are short (usually less than one page and sometimes less than half a page), and if need be, you can easily and quickly copy/paste the code into a new Jupyter notebook. For the Python code samples that reference a CSV file, you do not need any additional code in the corresponding Jupyter notebook to access the CSV file. Moreover, the code samples execute quickly, so you won’t need to avail yourself of the free GPU that is provided in Google Colaboratory.
If you do decide to use Google Colaboratory, you can easily copy/paste the Python code into a notebook, and also use the upload feature to upload existing Jupyter notebooks. Keep in mind the following point: if the Python code references a CSV file, make sure that you include the appropriate code snippet (as explained in Chapter 1) to access the CSV file in the corresponding Jupyter notebook in Google Colaboratory.
Do I Need to Learn the Theory Portions of this Book?
Once again, the answer depends on the extent to which you plan to become involved in data analytics. For example, if you plan to study machine learning, then you will probably learn how to create and train a model, which is a task that is performed after data cleaning tasks. In general, you will probably need to learn everything that you encounter in this book if you are planning to become a machine learning engineer.
Why Does This Book Include Sklearn Material?
The amount of Sklearn material in this book is minimal because this book is not about machine learning. The Sklearn material is located in Chapter 6, where you will learn about some of the Sklearn built-in datasets. If you decide to delve into machine learning, you will have already been introduced to some aspects of Sklearn.
Getting the Most from This Book
Some programmers learn well from prose, others learn well from sample code (and lots of it), which means that there’s no single style that can be used for everyone.
Moreover, some programmers want to run the