DS Lecture 15

Intro to Data Science
Instructor:
Rabia Tariq
Lecturer, Department of Computer Science
Email: rabiatariq964@gmail.com
Lecture Content
• Exploratory Data Analysis
• Data Science Process
Exploratory Data Analysis
• Exploratory data analysis (EDA) is used by data
scientists to analyze and investigate data sets and
summarize their main characteristics, often employing
data visualization methods.
• EDA helps determine how best to manipulate data
sources to get the answers you need, making it easier
for data scientists to discover patterns, spot anomalies,
test a hypothesis, or check assumptions.
• EDA is primarily used to see what data can reveal
beyond the formal modeling or hypothesis testing task
and provides a provides a better understanding of data
set variables and the relationships between them.
Exploratory Data Analysis
• It can also help determine if the statistical techniques
you are considering for data analysis are appropriate.
• Originally developed by American mathematician John
Tukey in the 1970s.
Importance of Exploratory Data
Analysis in Data Science
• The main purpose of EDA is to help look at data before
making any assumptions. It can help identify obvious
errors, as well as better understand patterns within the
data, detect outliers or anomalous events, find interesting
relations among the variables.
• Data scientists can use exploratory analysis to ensure the
results they produce are valid and applicable to any
desired business outcomes and goals.
• EDA also helps stakeholders by confirming they are
asking the right questions.
Importance of Exploratory Data
Analysis in Data Science
• EDA can help answer questions about standard deviations,

categorical variables, and confidence intervals.
• Once EDA is complete and insights are drawn, its features
can then be used for more sophisticated data analysis or
modeling, including machine learning.
Types of Exploratory Data
Analysis
• Univariate non-graphical. This is simplest form of data analysis, where the data being
analyzed consists of just one variable. Since it’s a single variable, it doesn’t deal with causes
or relationships. The main purpose of univariate analysis is to describe the data and find
patterns that exist within it.
• Univariate graphical. Non-graphical methods don’t provide a full picture of the data.
Graphical methods are therefore required. Common types of univariate graphics include:
• Stem-and-leaf plots, which show all data values and the shape of the distribution.
• Histograms, a bar plot in which each bar represents the frequency (count) or proportion
(count/total count) of cases for a range of values.
• Box plots, which graphically depict the five-number summary of minimum, first
quartile, median, third quartile, and maximum.
Types of Exploratory Data
Analysis
• Multivariate nongraphical: Multivariate data arises
from more than one variable. Multivariate non-
graphical EDA techniques generally show the
relationship between two or more variables of the data
through cross-tabulation or statistics.
• Multivariate graphical: Multivariate data uses
graphics to display relationships between two or more
sets of data. The most used graphic is a grouped bar
plot or bar chart with each group representing one
level of one of the variables and each bar within a
group representing the levels of the other variable.
Tools used for Exploratory Data
Analysis
Some of the most common tools used to create an EDA are:
1. R: An open-source programming language and free software environment for
statistical computing and graphics supported by the R foundation for statistical
computing. The R language is widely used among statisticians in developing statistical
observations and data analysis.
2. Python: An interpreted, object-oriented programming language with dynamic
semantics. Its high level, built-in data structures, combined with dynamic binding,
make it very attractive for rapid application development, also as to be used as a
scripting or glue language to attach existing components together. Python and EDA are
often used together to spot missing values in the data set, which is vital so you’ll
decide the way to handle missing values for machine learning.
Analysis
Specific statistical functions and techniques you can perform
with EDA tools include:
• Clustering and dimension reduction techniques, which help
create graphical displays of high-dimensional data containing
many variables.
• Univariate visualization of each field in the raw dataset, with
summary statistics.
• Bivariate visualizations and summary statistics that allow you
to assess the relationship between each variable in the
dataset and the target variable you’re looking at.
• Multivariate visualizations, for mapping and understanding
interactions between different fields in the data.
Analysis
• K-means Clustering is a clustering method in
unsupervised learning where data points are assigned
into K groups, i.e. the number of clusters, based on the
distance from each group’s centroid. The data points
closest to a particular centroid will be clustered under
the same category. K-means Clustering is commonly
used in market segmentation, pattern recognition, and
image compression.
• Predictive models, such as linear regression, use
statistics and data to predict outcomes.
Data Science Process
The data science process is a systematic approach to
solving a data problem. It provides a structured
framework for articulating your problem as a question,
deciding how to solve it, and then presenting the solution
to stakeholders.
Data Science Process
• Another term for the data science process is the data
science life cycle. The terms can be used
interchangeably, and both describe a workflow process
that begins with collecting data, and ends with
deploying a model that will hopefully answer your
questions
Steps of Data Science Process
The steps include:
• Framing the Problem
Understanding and framing the problem is the first step of the
data science life cycle. This framing will help you build an
effective model that will have a positive impact on your
organization
• Collecting Data
The next step is to collect the right set of data. High-quality,
targeted data—and the mechanisms to collect them—are crucial
to obtaining meaningful results. Since much of the roughly 2.5
quintillion bytes of data created every day come in unstructured
formats, you’ll likely need to extract the data and export it into a
usable format, such as a CSV or JSON file.
• Cleaning Data
Most of the data you collect during the collection phase
will be unstructured, irrelevant, and unfiltered. Bad data
produces bad results, so the accuracy and efficacy of your
analysis will depend heavily on the quality of your data.
Cleaning data eliminates duplicate and null values, corrupt
data, inconsistent data types, invalid entries, missing
data, and improper formatting.
This step is the most time-intensive process, but finding
and resolving flaws in your data is essential to building
effective models.
• Exploratory Data Analysis (EDA)
Now that you have a large amount of organized, high-
quality data, you can begin conducting an exploratory
data analysis (EDA). Effective EDA lets you uncover
valuable insights that will be useful in the next phase of
the data science lifecycle.
• Model Building and Deployment
Next, you’ll do the actual data modeling This is where
you’ll use machine learning , statistical models, and
algorithms to extract high-value insights and predictions.
• Communicating Your Results
Lastly, you’ll communicate your findings to stakeholders.
Every data scientist needs to build their repertoire of
visualization skills to do this.
Your stakeholders are mainly interested in what your
results mean for their organization, and often won’t care
about the complex back-end work that was used to build
your model. Communicate your findings in a clear,
engaging way that highlights their value in strategic
business planning and operation.
Mining Social Network Graphs
• Mining a social network graph refers to the process of using graph
theory and data analysis techniques to understand the connections
and relationships within a social network.
• A social network graph is a visual representation of a social network,
in which nodes represent individuals or organizations, and edges
represent the relationships or connections between them.
• By mining a social network graph, one can identify key nodes and
connections within the network, understand the dynamics of the
network, and identify patterns and trends in the data
• This can be useful for a wide range of applications, such as marketing,
social network analysis, and fraud detection. The data that is being
generated through any social media platform is considered to be big
data.
Importance of Mining Social Network Graphs in
Data Science
Mining social network graphs can be important in data science for several
reasons:
1. Understanding social connections:
• By analyzing the structure of a social network, it is possible to understand
how people are connected and identify patterns in those connections.
For example, one might study the connections between co-authors on
scientific papers to understand patterns of collaboration in a field.
2. Modeling the spread of information or disease:
• Social networks can be used to model the spread of information or
diseases through a population. By understanding the structure of the
network, researchers can predict how fast a piece of information or
disease might spread and identify key individuals who might play a role in
its spread.
Importance of Mining Social Network Graphs in
Data Science
3. Improving recommendation systems:
• Social network data can be used to improve recommendation
systems by incorporating the preferences and interests of an
individual's connections. For example, a movie recommendation
system might use data from a social network to recommend movies
that are popular among an individual's friends.
4. Identifying communities and influence:
• Social network analysis can be used to identify communities within a
network, and to identify individuals who have a high level of
influence within those communities. This can be useful for marketing
and public relations efforts, as well as for identifying key stakeholders
in a particular domain.
Representing mine social network data
through a graph
• There are many ways to represent a social network graph, but the
most common representation is a graph where the nodes represent
individuals, and the edges represent relationships between them.
• Relationships can be of various types, such as friendship, family, co-
worker, etc.
• Some common tasks in mining a social network graph include
finding communities or clusters within the network, predicting
missing or future relationships, and identifying key influencers or
central nodes within the network.
Representing mine social network data through a graph
• There are many different tools and techniques that can be used for
mining a social network graph, including machine learning algorithms,
graph theory, and network analysis.
Steps for mining a social network graph
Here are the general steps that can be followed when mining a social
network graph:
1. Gather data on the social network:
This can be done using APIs or web scraping techniques to access data
from social media platforms or other sources.
2. Construct the social network graph:
Use the data to create a visual representation of the network, with
nodes representing individuals or organizations and edges
representing the connections or relationships between them.
Steps for mining a social network graph
3. Visualize the social network graph:
Use tools such as Gephi or NodeXL to visualize the network and identify
key nodes and connections.
4. Analyze the social network:
Use statistical and machine learning techniques to identify patterns and
trends in the data, and to forecast future events or behaviors within the
network.
5. Interpret and communicate the results:
Use the insights gained from the analysis to understand the structure
and dynamics of the social network, and to inform decision-making or
strategic planning.
Important terms for mining social
network graphs
• There are several key terms and concepts that are important to
understand when it comes to mining social network graphs. These
include:
1. Graph theory:
The study of graphs and the relationships between their nodes and
edges.
2. Node: A node is a point or vertex in a graph, representing an
individual or organization.
network graphs
3. Edge: An edge is a line connecting two nodes in a graph, representing
a relationship or connection between them.
Important terms for mining social network
graphs
4. Network analysis: The process of using graph theory and data
analysis techniques to understand the structure and dynamics of a
network.
network graphs
• 5. Centrality measures: Algorithms that measure the importance or
influence of a node within a network. Examples include degree
centrality, betweenness centrality, and eigenvector centrality.
Centrality measures are algorithms that are used to identify the most
important or influential nodes in a network. In a social network graph,
centrality measures can be used to identify individuals or
organizations that are highly connected or influential within the
network.
graphs
• 6. Clustering algorithms: Algorithms that group nodes into

communities or clusters based on their connections in the network.
Clustering algorithms can be useful for a wide range of applications,
including social network analysis, customer segmentation, and
anomaly detection. By identifying clusters within a dataset, one can
gain valuable insights into the underlying structure and patterns of
the data
graphs
• 7. Predictive modeling: The use of statistical and machine learning
techniques to forecast future events or behaviors within a network.
Predictive modeling is the use of statistical and machine learning
techniques to forecast future events or behaviors within a network. In
the context of social network graphs, predictive modeling can be used
to forecast future connections or interactions between individuals or
organizations in the network.
Important terms for mining
social network graphs
Intro of Next Generation Data Scientists
Intro of Next Generation
Data Scientists
1. Technical Skills
2. Slow Down and Proceed
Methodically
3. Soft Skills
4.Apply the Scientific Method
5. Proceed with Ethics
6.Data Science Tools & Work
Flows
Transformation Tools
Quering & Processing
IDES & Workspaces

DS Lecture 15

Uploaded by

Copyright:

Available Formats

DS Lecture 15

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DS Lecture 15

Uploaded by

Copyright:

Available Formats

Intro to Data Science

• EDA can help answer questions about standard deviations,

• 6. Clustering algorithms: Algorithms that group nodes into

You might also like