Nothing Special   »   [go: up one dir, main page]

Artificial Intelligence - (Unit - 1)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

CAPSTONE PROJECT

A Capstone Project is a project where students must research a


topic independently to find a deep understanding of the subject matter.
It gives an opportunity for the student to integrate all their
knowledge and demonstrate it through a comprehensive project.

So, without further ado, let’s jump straight into some Capstone Project
ideas that will strengthen your base

1. Stock Prices Predictor


2. Develop A Sentiment Analyzer
3. Movie Ticket Price Predictor
4. Students Results Predictor
5. Human Activity Recognition using Smartphone Data set
6. Classifying humans and animals in a photo.
Artificial Intelligence is perhaps the most transformative
technology available today.

At a high level, every AI project follows the following six steps:

1) Problem Definition i.e. Understanding the problem

2) Data Gathering

3) Feature Definition

4) AI Model Construction

5) Evaluation & Refinements

6) Deployment
1. Understanding the Problem:

➢ Problem Scoping is the first stage of the Capstone Project .


➢ In this stage of AI development, problems will be identified.
➢ Problem scoping is the process by which student designers “figure out” the
problem that they need to solve.
➢ Incorrect problem scoping also leads to failure of the project as well.
➢ It is then followed by designing, developing, or building, and finally testing
the project.

Begin formulating your problem by asking yourself this simple is there a


pattern?

➢ The premise that underlies all Machine Learning disciplines is that there
needs to be a pattern.
➢ If there is no pattern, then the problem cannot be solved with AI
technology.
If it is believed that there is a pattern in the data, then AI development
techniques may be employed.
Applied uses of these techniques are typically geared towards answering
five types of questions, all of which may be categorized as being within the umbrella
of predictive analysis:

1) Which category? (Classification)

2) How much or how many? (Regression)

3) Which group? (Clustering)

4) Is this unusual? (Anomaly Detection)

5) Which option should be taken? (Recommendation)

It is important to determine which of these questions you’re asking, and


how answering it helps you solve your problem.
2.Analytic Approach
Those who work in the domain of AI and Machine Learning solve
problems and answer questions through data every day They build models to
predict outcomes or discover underlying patterns, all to gain insights leading to
actions that will improve future Outcomes.
➢ After clearly stating a business problem, the data scientist can define the
analytic approach to solving it.

➢ Doing so involves expressing the problem in the context of statistical


and machine learning techniques so that the data scientist can identify
techniques suitable for achieving the desired outcome

➢ Selecting the right analytic approach depends on the question being


asked

➢ Once the problem to be addressed is defined, the appropriate analytic


approach for the problem is selected in the context of the business
requirements

➢ This is the second stage of the data science methodology.


➢ If the question is to determine probabilities of an action, then a
predictive model might be used.

➢ If the question is to show relationships, a descriptive approach maybe


be required.

➢ Statistical analysis applies to problems that require counts: if the


question requires a yes/ no answer, then a classification approach to
predicting a response would be suitable.
3. Data Requirement:
If the problem that needs to be resolved is "a recipe", so to speak, and data is
"an ingredient", then the data scientist needs to identify.

1.which ingredients are required?


2.how to source or the collect them?
3.how to understand or work with them?
4.how to prepare the data to meet the desired outcome?
➢ Prior to undertaking the data collection and data preparation stages of the
methodology, it's vital to define the data requirements for decision tree
classification This includes identifying the necessary data content, formats and
sources for initial data collection.

➢ In this phase the data requirements are revised and decisions are made as to
whether or not the collection requires more or less data Once the data
ingredients are collected, the data scientist will have a good understanding of
what they will be working with.

➢ Techniques such as descriptive statistics and visualization can be applied to


the data set, to assess the content, quality, and initial insights about the data
Gaps in data will be identified and plans to either fill or make substitutions will
have to be made.
4. Decomposing the Problem Through DT Framework

➢ Design Thinking is a design methodology that provides a solution-based


approach to solving problems.
➢ It’s extremely useful in tackling complex problems that are ill-defined or
unknown.
➢ The five stages of Design Thinking are as follows:
Empathize, Define, Ideate, Prototype, and Test.
Real computational tasks are complicated. To accomplish them you need to
break down the problem into smaller units before coding.
Problem Decomposition Steps

1. Understand the problem and then restate the problem in your own words
❖ Know what the desired inputs and outputs are
❖ Ask questions for clarification

2. Break the problem down into a few large pieces. Write these down, either on
paper or as comments in a file.

3. Break complicated pieces down into smaller pieces. Keep doing this until all of
the pieces are small.

4. Code one small piece at a time.


1. Think about how to implement it
2. Write the code/query
3. Test it… on its own.
4. Fix problems, if any
Example 2:
Imagine that you want to create your first app. This is a complex problem. How
would you decompose the task of creating an app?
To decompose this task, you would need to know the answer to a series of
smaller problems:

❖ what kind of app you want to create?


❖ what will your app will look like?
❖ who is the target audience for your app?
❖ what will the graphics will look like?
❖ what audio will you include?
❖ what software will you use to build your app?
❖ how will the user navigate your app?
❖ how will you test your app?

This list has broken down the complex problem of creating an app into much
simpler problems that can now be worked out.
You may also be able to get other people to help you with different
individual parts of the app.
For example, you may have a friend who can create the graphics, while another
will be your test the app.
Example 1:

Calculate the volume of a bunch of books

Data

Length Width Height

1 2 3

2 4 3

1. Calculate the volume of a book [a. Function]

2. Run this calculation on all books [a. Loop]


5.Modeling Approach:
➢ Data Modeling focuses on developing models that are either descriptive or
predictive.
➢ An example of a descriptive model might examine things like: if a person did
this, then they're likely to prefer that.
➢ A predictive model tries to yield yes/no, or stop/go type outcomes. These
models are based on the analytic approach that was taken, either statistically
driven or machine learning driven.

➢ The data scientist will use a training set for predictive modelling
➢ A training set is a set of historical data in which the outcomes are already known
The training set acts like a gauge to determine if the model needs to be calibrated
In this stage, the data scientist will play around with different algorithms to ensure
that the variables in play are actually required
➢ The success of data compilation, preparation and modelling depends on the
understanding of the problem at hand, and the appropriate analytical
approach being taken The data supports the answering of the question, and like
the quality of the ingredients in cooking, sets the stage for the outcome
Constant refinement, adjustments and tweaking are necessary within each step
to ensure the outcome is one that is solid The framework is geared to do 3
things.

➢ First, understand the question at hand.

➢ Second, select an analytic approach or method to solve the problem

➢ Third, obtain, understand, prepare, and model the data

➢ The end goal is to move the data scientist to a point where a data model can be
built to answer the question
How to Validate Model Quality
Train-Test Split Evaluation
➢ The train test split is a technique for evaluating the performance of a machine
learning algorithm.
➢ It can be used for classification or regression problems and can be used for
any supervised learning algorithm.
➢ The procedure involves taking a dataset and dividing it into two subsets. The
first subset is used to fit the model and is referred to as the training dataset. The
second subset is not used to train the model; instead, the input element of the
dataset is provided to the model, then predictions are made and compared to the
expected values. This second dataset is referred to as the test dataset.
Train Dataset: Used to fit the machine learning model.
Test Dataset: Used to evaluate the fit machine learning model.
➢ The objective is to estimate the performance of the machine learning model
on new data: data not used to train the model.
➢ This is how we expect to use the model in practice. Namely, to fit it on available
data with known inputs and outputs, then make predictions on new examples in
the future where we do not have the expected output or target values.
➢ The train-test procedure is appropriate when there is a sufficiently large
dataset available.
How to Configure the Train-Test Split
➢ The procedure has one main configuration parameter, which is the size of the train and test
sets. This is most commonly expressed as a percentage between 0 and 1 for either the
train or test datasets.
➢ For example, a training set with the size of 0.67 (67 percent) means that the remainder
percentage 0.33 (33 percent) is assigned to the test set.
➢ There is no optimal split percentage.
You must choose a split percentage that meets your project’s objectives with
considerations that include:
❖ Computational cost in training the model.
❖ Computational cost in evaluating the model.
❖ Training set representativeness.
❖ Test set representativeness.

Nevertheless, common split percentages include:


➢ Train: 80%, Test: 20%
➢ Train: 67%, Test: 33%
➢ Train: 50%, Test: 50%
➢ Now that we are familiar with the train-test split model evaluation procedure, let’s look at
how we can use this procedure in Python.
Train-Test Split

➢ Imagine you have a dataset with 5000 rows.


➢ The train_test_split function has an argument for test_size that you can use
to decide how many rows go to the training set and how many go to the test set.
➢ The larger the test set, the more reliable your measures of model quality
will be.
➢ At an extreme, you could imagine having only 1 row of data in the test set.
➢ If you compare alternative models, which one makes the best predictions on
a single data point will be mostly a matter of luck.
➢ You will typically keep about 20% as a test dataset.
➢ But even with 1000 rows in the test set, there's some random chance in
determining model scores.
➢ A model might do well on one set of 1000 rows, even if it would be
inaccurate on a different 1000 rows.
➢ The larger the test set, the less randomness (aka "noise") there is in our
measure of model quality.
Example 1: Training and Test Data in Python Machine Learning
➢ As we work with datasets, a machine learning model works in two stages. We usually
split the data around 20%-80% between testing and training stages. Under supervised
learning, we split a dataset into a training data and test data in Python ML.

Example 1: Training and Test Data in Python Machine Learning


❖ As we work with datasets, a machine learning model works in two stages. We usually split
the data around 20%-80% between testing and training stages. Under supervised learning,
we split a dataset into a training data and test data in Python ML.
a. Prerequisites for Train and Test Data We will need the following Python Libraries for
this tutorial:
❖ Pandas
❖ Sklearn
We can install these with pip
1. pip install pandas
2. pip install sklearn
We use pandas to import the dataset and sklearn to perform the splitting. You can
import these packages as:
1. >>> import pandas as pd
2. >>> from sklearn.model_selection import train_test_split
3. >>> from sklearn.datasets import load_iris

Following are the process of Train and Test set in Python ML. So, let’s take a
dataset first.
Loading the Data set
Let’s load the forestfires dataset using pandas.
1. >>> data=pd.read_csv(‘forestfires.csv’)
2. >>> data.head()

Train and Test Set in Python Machine Learning


b. Splitting
Let’s split this data into labels and features. Now, what’s that? Using features, we
predict labels. I mean using features (the data we use to predict labels), we predict
labels (the data we want to predict).
1. >>> y=data.temp
2. >>> x=data.drop(‘temp’,axis=1)

Temp is a label to predict temperatures in y; we use the drop() function to take all
other data in x. Then, we split the data.
1. >>> x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)
2. >>> x_train.head()
Train and Test Set in Python Machine Learning
1. >>> x_train.shape
(413, 12)
1. >>> x_test.head()

Train and Test Set in Python Machine Learning


1. >>> x_test.shape
(104, 12)
The line test_size=0.2 suggests that the test data should be 20% of the dataset and
the rest should be train data. With the outputs of the shape () functions, you can see
that we have 104 rows in the test data and 413 in the training data.
Cross Validation

➢ We already learned to use train_test_split to split the data, so you can


measure model quality on the test data.
➢ Cross-validation extends this approach to model scoring (or "model
validation.")
➢ Compared to train_test_split, cross-validation gives you a more reliable
measure of your model's quality, though it takes longer to run.

The Cross-Validation Procedure

➢ In cross-validation, we run our modeling process on different subsets of the


data to get multiple measures of model quality.
➢ For example, we could have 5 folds or experiments. We divide the data
into 5 pieces, each being 20% of the full dataset.
➢ We run an experiment called experiment 1 which uses the first fold as a
holdout set, and everything else as training data.
➢ This gives us a measure of model quality based on a 20% holdout set, much
as we got from using the simple train-test split.
➢ We then run a second experiment, where we hold out data from the second
fold (using everything except the 2nd fold for training the model.) This gives us a
second estimate of model quality.
➢ We repeat this process, using every fold once as the holdout. Putting this
together, 100% of the data is used as a holdout at some point.
➢ Returning to our example above from train-test split, if we have 5000 rows
of data, we end up with a measure of model quality based on 5000 rows of
holdout (even if we don't use all 5000 rows simultaneously.

Trade-offs Between Cross-Validation and Train-Test Split

➢ Cross-validation gives a more accurate measure of model quality, which is


especially important if you are making a lot of modeling decisions.
➢ However, it can take more time to run, because it estimates models once for
each fold. So it is doing more total work.
➢ Given these tradeoffs, when should you use each approach? On small
datasets, the extra computational burden of running cross-validation isn't a big
deal.
➢ These are also the problems where model quality scores would be least
reliable with train-test split. So, if your dataset is smaller, you should run cross-
validation.
➢ For the same reasons, a simple train-test split is sufficient for larger datasets.
It will run faster, and you may have enough data that there's little need to re-
use some of it for holdout.
➢ There's no simple threshold for what constitutes a large vs small dataset. If
your model takes a couple minute or less to run, it's probably worth
switching to cross-validation. If your model takes much longer to run, cross-
validation may slow down your workflow more than it's worth.
➢ Alternatively, you can run cross-validation and see if the scores for each
experiment seem close. If each experiment gives the same results, train-test split
is probably sufficient.
Conclusion
➢ Using cross-validation gave us much better measures of model quality, with the
added benefit of cleaning up our code (no longer needing to keep track of separate
train and test sets. So, it's a good win.
➢ Activity 1: Convert the code for your on-going project over from train-test split to
cross-validation. Make sure to remove all code that divides your dataset into
training and testing datasets. Leaving code you don't need any more would be
sloppy.
➢ Activity 2: Add or remove a predictor from your models. See the cross-validation
score using both sets of predictors, and see how you can compare the scores.
Metrics of model quality by simple Math and examples
➢ After you make predictions, you need to know if they are any good. There are
standard measures that we can use to summarize how good a set of predictions
actually are.
➢ Knowing how good a set of predictions is, allows you to make estimates about
how good a given machine learning model of your problem,
➢ You must estimate the quality of a set of predictions when training a machine
learning model.
➢ Performance metrics like classification accuracy and root mean squared
error can give you a clear objective idea of how good a set of predictions is,
and in turn how good the model is that generated them.
This is important as it allows you to tell the difference and select among:
❖ Different transforms of the data used to train the same machine learning model.
❖ Different machine learning models trained on the same data.
❖ Different configurations for a machine learning model trained on the same data.
RMSE (Root Mean Squared Error)
MSE (Mean Squared Error)
MAPE (Mean Absolute Percentage Error)
➢ All the algorithms in machine learning rely on minimizing or maximizing a
function, which we call “objective function”.
➢ The group of functions that are minimized are called “loss functions”. A loss
function is a measure of how good a prediction model does in terms of being able to
predict the expected outcome.
➢ A most commonly used method of finding the minimum point of function is
“gradient descent”. Think of loss function like undulating mountain and gradient descent
is like sliding down the mountain to reach the bottom most point.
Loss functions can be broadly categorized into 2 types: Classification and Regression
Loss.

Regression functions
predict a quantity,
and classification functions
predict a label.
Python Libraries:
A Library refers to Collection of Modules that together satisfy a
specific type of Needs.
Pandas Library:
➢ Pandas is the most popular library for Data Analysis and
Manipulation.
➢ Its supports reshaping of data in to different forms.
➢ Its used in economics, stock prediction, big data, finance, data
science and data analytics.
➢ Pandas is the most popular library for Data Analysis and
Manipulation.
➢ It has functionality to find and fill missing data.
➢ Its supports reshaping of data in to different forms.
➢ It supports data visualization by integrating matplotlib and seaboru.
✓ We can analyze the data in pandas with Series(1D) and Data
Frame(2D)..
✓ Its used in economics, stock prediction, big data, finance,
data science and data analytics.
Syntax:
import pandas (or) import pandas as pd
Scikit Learn:
➢ It is a popular library used to perform machine learning.
➢ Also used in statistical modeling including classification,
regression, clustering and dimensionality reduction.
➢ Pytorch also machine learning library developed by
Facebook.
Syntax:
import sklearn as sk
NumPy:
✓ NumPy library provides high-level Math functionalities to
create and manipulate numeric array.
✓ Data manipulation in Pandas library's is performed using
NumPy.
✓ Its used in Matrix processing, linear algebra, Fourier
transform, 2D array etc.
Syntax:
import numpy (or ) import numpy as np
When to use mean squared error
Use MSE when doing regression, believing that your target, conditioned on
the input, is normally distributed, and want large errors to be significantly
(quadratically) more penalized than small ones.
Example-1: You want to predict future house prices. The price is a continuous
value, and therefore we want to do regression. MSE can here be used as the loss
function.
Example-2:Consider the given data points: (1,1), (2,1), (3,2), (4,2), (5,4) You can
use this online calculator to find the regression equation / line.

Regression line equation: Y = 0.7X – 0.1


from sklearn.metrics import mean_squared_error
# Given values
Y_true = [1,1,2,2,4] # Y_true = Y (original values)
# calculated values
Y_pred = [0.6,1.29,1.99,2.69,3.4] # Y_pred = Y'
# Calculation of Mean Squared Error (MSE)
mean_squared_error(Y_true,Y_pred)

Output: 0.21606
Example 1 (RMSE)
Let us write a python code to find out RMSE values of our model. We would be
predicting the brain weight of the users. We would be using linear regression to train our model, the
data set used in my code can be downloaded from here:

import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#reading the data
"""
here the directory of my code and the headbrain6.csv file is same make sure both the files are stored
in same folder or directory
"""
data=pd.read_csv('headbrain6.csv')
data.head()
x=data.iloc[:,2:3].values
y=data.iloc[:,3:4].values
#splitting the data into training and test
from sklearn.cross_validation import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=1/4,random_state=0)
#fitting simple linear regression to the training set
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(x_train,y_train)
#predict the test result
y_pred=regressor.predict(x_test)
#to see the relationship between the training data values
plt.scatter(x_train,y_train,c='red')
plt.show()
#to see the relationship between the predicted
#brain weight values using scattered graph
plt.plot(x_test,y_pred)
plt.scatter(x_test,y_test,c='red')
plt.xlabel('headsize')
plt.ylabel('brain weight')
#errorin each value
for i in range(0,60):
print("Error in value number",i,(y_test[i]-y_pred[i]))
time.sleep(1)
#combined rmse value
rss=((y_test-y_pred)**2).sum()
mse=np.mean((y_test-y_pred)**2)
print("Final rmse value is =",np.sqrt(np.mean((y_test-y_pred)**2)))
Output:

The RMSE value of our is coming out to be approximately 73 which is not bad. A
good model should have an RMSE value less than 180. In case you have a higher RMSE
value, this would mean that you probably need to change your feature or probably you need
to tweak your hyperparameters.
THANK YOU

You might also like