Nothing Special   »   [go: up one dir, main page]

Data Mining Python Lab

Download as pdf or txt
Download as pdf or txt
You are on page 1of 208

U18IT608

DATA MINING USING PYTHON


LABORATORY

A Learner Centric
LABORATORY MANUAL & RECORD BOOK
First Edition, December 2023

Class: B.Tech.(IT) VI-Semester


[For the Students admitted under URR18 Regulation]

Academic Year: ………………………… Semester: ……………

Student Details
Student Name:
Roll Number:
Semester/Branch/Section:

1.

2.
Laboratory Course Faculty:
3.

4.

Course Offered by
DEPARTMENT OF INFORMATION TECHNOLOGY
KAKATIYA INSTITUTE OF TECHNOLOGY & SCIENCE, WARANGAL
(An Autonomous Institute under Kakatiya University, Warangal)

DEPARTMENT OF INFORMATION TECHNOLOGY

CERTIFICATE

This is to certify that it is a bonafide record of practical work done by Mr. / Kum.

………………………………………………………………………………………

bearing the Roll No. ………………………… of ………………… Class

………………… branch in the Design and Analysis of Algorithms Laboratory (DAA Lab)

during the academic year ………………… under our supervision.

Course Faculty Head of the Department


(Name & Signature)
Date: ……………………
Date: ……………………

Examiner
(Signature with Date)
PREFACE

Dear students,

This lab manual is designed and developed as “A Learner Centric Laboratory Manual and
Record Book (LMRB)” for Outcome Based Education (OBE).

a) A well-defined learner centric continuous internal evaluation (CIE) will be followed in


this lab. It is expected to make students active learners, skilled and acquire several
competencies related to the laboratory programming tasks
b) Hence, students are advised to love learning, follow the stipulated CIE and become active
learners.
c) Active learning will ensure students acquire the 21st century skills and competencies to be
successful in a job.

1. Learner Centric Lab Manual & Record Book (LMRB):


1.1. This Learner centric LMRB contains relevant information for all programming tasks

2. Videos on programming Tasks:


2.1. Lab course faculty will make videos on essential back ground needed for programs and upload
in the CourseWeb portal well in advance

3. Requisite prior knowledge (K):


3.1. Student should come prepared with requisite prior knowledge on the programs to be developed
3.2. To gain requisite prior knowledge on the programs to be completed, the student should
3.2.1. watch the videos on lab experiments, which are posted in CourseWeb portal by your lab
course faculty
3.2.2. read the information given in this Learner centric LMRB
3.2.3. execute the At-Home sample programs related to the corresponding lab task, as given in this
learner centric LMRB

4. During lab session:


4.1. Before start of Programming Task
KNOWLEDGE (K): 10 Marks
1. Before start of Programming Task, student will be tested on knowledge(K). Whether the
student has requisite prior knowledge on the programs to be developed.
2. The lab faculty will
a. Check whether the student executed the At-Home sample programs with
Additional Test cases?
b. Ask around 4-5 questions to test the student’s pre-requisite knowledge(K)
3. This component of 10 marks will be awarded based on student’s prior preparation on the
programming task to be completed. To score well, in this component, students are
expected to,
a. execute the At-Home SPs as HOMEWORK with Additional Test Cases
b. have complete Knowledge on the In- Lab EPs to be executed in the lab
4. The student will be permitted to do the programming task after pre-requisite the
Knowledge (K) test
5. This Knowledge (K) Test is aimed at imparting / enhancing the following skills for the
students
a. Communication (oral) skills
b. Requisite prior knowledge

4.2. During Programming Task:


PARTICIPATION (P): 10 marks
a) Students should complete the given programming task(s) i.e., In - lab EPs first 2 periods
of the lab session.
b) Student will be observed for taking part in developing PDs for In - lab EPs, from problem
analysis to debugging & testing
c) Hence, while doing programming task, every student should be proactive to earn 10
marks for participation
d) This PARTICIPATION (P) section is aimed at imparting /enhancing the following skills
for the students
a. Problem analysis (Logic Development)
b. Algorithm/pseudocode
c. Flowchart
d. Coding
e. Testing & Debugging
f. Test Cases

Note:
a) Student should complete the Programming Task in first 2 periods of the lab session
b) And use the last 30 – 45 minutes of lab time to do the following tasks
a. Complete the record write up (W:10 marks) and
b. attend viva-voce (V:10 marks)
c) LAB RECORD WTIE UP is to be completed in the respective lab session itself. The lab
course faculty will complete the evaluation in the lab session itself.
d) Lab record WRITE UP should not be carried to home for completion.

4.3. After completion of Programming Task:


4.3.1. WRITE UP (W): 10 marks
a) After completion of the task(s), the student has to complete the write up for the record
in the lab session itself.
b) Write up related to problem analysis, flowchart, algorithm & code is normally common
to all students, and hence attracts no marks for these items
c) Evaluation under this Write Up section is purely based on how student practices all steps
of program development for execution of the program
d) To score well in this section, prior preparation & lateral thinking are needed.
e) For In-lab EPs, the student should focus on developing a code which should be readable
(with proper Annotations and Indentation), maintainable, extendable, testable and
robust.
f) Copying from other student’s code is not allowed and attracts award of ZERO
marks in this section
g) This WRITE UP (W) section is aimed at imparting / enhancing the following skills for
the students
a. Coding skills
b. Innovation & lateral thinking (ILT) skills
c. Research skills (inferring & predicting)

4.3.2. VIVA-VOCE (V): 10 marks


a) After completing the write up, the student should go for viva-voce
b) To score well in this section, student should
a. come prepared to programming task(s) with prior knowledge (K),
b. participate (P) actively during the development and execution of Programming tasks
c. focus on answering the sample questions given under viva-voce (V)
c) This VIVA-VOCE (V) section is aimed at imparting / enhancing the following skills for
the students
a. Communication (oral) skills
b. Coding skills
c. Innovation & lateral thinking (ILT) skills
d. Research skills (inferring & predicting)

NOTE: FACULTY WILL COMPLETE THE STUDENT EVALUATION IN THE LAB SESSION
ITSELF SO STUDENT SHOULD COMPLETE THE WRITE UP IN THE LAB SESSION
ITSELF.
The lab course faculty will assess and evaluate the student in four quadrants i.e. K, P, W & V
during the lab slot itself, and award the marks after conduction of viva-voce. This evaluation
gives scope for the students to improve, in the upcoming weeks of programming tasks, by
demonstrating relevant skills and the competencies in K, P, W & V.

Bottom Line:
a) A well-defined leaner centric continuous internal evaluation (CIE) will be followed in this
lab. It is expected to make students active learners, skilled and acquire several
competencies related to the programming tasks
b) Hence, students are advised to love learning, follow the stipulated CIE and become active
learners
c) Active learning will ensure students acquire the 21st century skills and competencies to be
successful in a job
INDEX
Institute vision and mission 1

Department vision and mission 1

Program Educational Objectives (PEOs) 2

Program Outcomes (POs) 2

Program Specific Outcomes (PSOs) 3

Instructions to the students 4

Rubrics for Continuous Internal Evaluation (CIE) 5

Make-up laboratory sessions 8

Laboratory programs Calendar 10

List of programs to be performed 15


INSTITUTE VISION & MISSION

Vision of Institute:
• To make our students technologically superior and ethically strong by providing quality
education with the help of our dedicated faculty and staff and thus improve the quality of
human life.

Mission of the Institute:


• To provide latest technical knowledge, analytical and practical skills, managerial competence
and interactive abilities to students, so that their employability is enhanced.
• To provide a strong human resource base for catering to the changing needs of the Industry
and Commerce.
• To inculcate a sense of brotherhood and national integrity.

DEPARTMENT VISION & MISSION

Vision of Department:
• To become a Centre of Excellence in the Information Technology discipline with effective
teaching and strong research environment that makes our students globally competitive with
strong ethical values and leadership abilities.

Mission of the Department:


• To impart technical knowledge to the students to turn out proficient and well-groomed
engineers.
• Motivate students to improve skills by attending training programs and internships that leads
to develop innovative projects in emerging technologies.
• To train our students for higher education, leadership in profession and adopt quality research.

Page 1 of 201
Program - B.Tech. Information Technology

PROGRAM EDUCATIONAL OBJECTIVES (PEOs)

Within first few years after graduation, the Information Technology graduates will be able
to…
PEO1 To provide students with a sound foundation in Information Technology theory
and practices to analyze, formulate and solve engineering problems
PEO2 To develop an ability to design algorithms, implement programs and deploy
software.
PEO3 To develop Information Technology solutions with the changing needs of the
society for the career-related activities.

PROGRAM OUTCOMES (POs)

Program Outcomes Engineering graduates will be able to

PO1 Engineering Apply the knowledge of mathematics, science,


knowledge engineering fundamentals, and an engineering
specialization to the solution of complex engineering
problems.
PO2 Problem analysis Identify, formulate, review research literature, and
analyze complex engineering problems reaching
substantiated conclusions using first principles of
mathematics, natural sciences, and engineering sciences.
PO3 Design/development Design solutions for complex engineering problems and
of solutions design system components or processes that meet the
specified needs with appropriate consideration for the
public health and safety, and the cultural, societal, and
environmental considerations.
PO4 Conduct investigations Use research-based knowledge and research methods
of complex problems including design of experiments, analysis and
interpretation of data, and synthesis of the information to
provide valid conclusions.
PO5 Modern tool usage Create, select, and apply appropriate techniques,
resources, and modern engineering and IT tools including
prediction and modeling to complex engineering
activities with an understanding of the limitations.
PO6 The engineer and Apply reasoning informed by the contextual knowledge
society to assess societal, health, safety, legal and cultural issues
and the consequent responsibilities relevant to the
professional engineering practice.

Page 2 of 201
PO7 Environment and Understand the impact of the professional engineering
sustainability solutions in societal and environmental contexts, and
demonstrate the knowledge of, and need for sustainable
development.
PO8 Ethics Apply ethical principles and commit to professional ethics
and responsibilities and norms of the engineering practice
PO9 Individual and Function effectively as an individual, and as a member or
teamwork leader in diverse teams, and in multidisciplinary settings.
PO10 Communication Communicate effectively on complex engineering
activities with the engineering community and with
society at large, such as, being able to comprehend and
write effective reports and design documentation, make
effective presentations, and give and receive clear
instructions.
PO11 Project management Demonstrate knowledge and understanding of the
and finance engineering and management principles and apply these
to one’s own work, as a member and leader in a team, to
manage projects and in multidisciplinary environments.
PO12 Life-long learning Recognize the need for, and have the preparation and
ability to engage in independent and life-long learning in
the broadest context of technological change.

PROGRAM SPECIFIC OUTCOMES (PSOs)

PSO The Information Technology Engineering graduates will be able to

PSO1 Apply analytical and experimental problem-solving skills in the Information


Technology discipline.
PSO2 Use fundamental knowledge to investigate new and emerging technologies
leading to innovations in the field of Information Technology.
PSO3 Begin immediate professional practice as an Information Technology Engineer.

Page 3 of 201
INSTRUCTIONS TO THE STUDENTS
1. This Learner Centric LABORATORY MANUAL & RECORD BOOK (LMRB) is essential for the
student and must be brought to every laboratory session.
2. This learner centric LMRB consists of At-Home Sample Programs (SPs) and In-lab Exercise
Problems (EPs)
a) At-Home Sample Programs (SPs): At-Home Sample Programs (SPs) are the HOMEWORK
programs to be completed, before attending the lab. You should execute these Sample Programs
(SPs) with the given sample test cases and check for the results. In addition, as a proof of
completion of the HOMEWORK, the student should execute the SPs, with other set of test cases
and record the answers in the space provided under Additional Test Cases
(i). You should design your own additional Test Cases and execute these SPs.
(ii). You should design the test cases, which challenge the robustness of the code. The
challenging test cases have the capacity to halt the program execution.
(iii). You should bring those challenging test cases to the notice of course faculty, so that the
code of SPs can suitably be modified to make the code robust.
b) In-Lab Exercise Problem (EPs): In-Lab Exercise Problems (EPs) are the problems to be coded
during the lab session. Student should complete all EPs in the Lab slot itself with necessary
write up in the space provided, by following the required Program Development Steps (PDS).
Therefore, students should:
(i). work on the At-Home and execute SPs with Additional Test Cases before attending the
lab.
(ii). Complete the In-Lab the EPs in the lab session.
Prior preparation on EPs will help the student a lot in completing the EPs within the
stipulated lab session. It is always a good practice to attend the Lab with prior preparation.
c) For EP1 - All steps of PDS are mandatory: Algorithm/Psuedocode development is mandatory
only for the FIRST Exercise Problem (EP-1) of every Laboratory task.
d) For OTHER EPs (EP2 onwards): To save time during lab session, the student can skip "writing
algorithm" for other EPs. The student should focus and work on Problem Analysis (Logic
development), Coding, Testing & Debugging and execution with Test Cases.
e) Student should focus on developing code for the In-Lab EPs, which is readable (with proper
Annotations and Indentation), maintainable, extendable, testable and robust.
3. All the EPs must be completed within the stipulated time.
4. Students should demonstrate the required skills during ORAL VIVA-VOCE. It is not mandatory
to write the answers to the viva voce questions of every lab tasks. But it is a good practice to keep
the question answered in the place provided, after completion of lab session.
5. Incompletion of the lab record will result in reduction of marks.

Page 4 of 201
Rubrics for Continuous Internal Evaluation (CIE)

Continuous Internal Evaluation (CIE) for Practical (Laboratory) Course shall carry 40% weightage.

CIE throughout the semester shall consist of the following for each experiment/lab.

CIE- Assessment for experiments done in every lab Weightage

Requisite Prior Coding Knowledge Knowledge (K) 10%

Participation as an individual while developing programs Participation (P) 10%


Write-up for Record Work Writing (W) 10 %
Viva-voce (oral) Viva-voce (V) 10%

Every laboratory session is evaluated for a total of 40 marks. The details have been listed below.

A. Before start of Programming Task


1. Requisite Prior Coding knowledge (K): 10 Marks
Student should come prepared to the lab session and is expected to answer the following, prior to the
start of the programming task:
i. Whether student worked on the given At-Home sample programs and executed with
Additional Test Cases - 5 marks
ii. A total of 3-5 questions shall be asked (for 5 marks) on whether student gained requisite prior
knowledge on the At-Home SPs and the In-Lab EPs to be completed in the lab session

These Five (05) marks shall be awarded based on student's performance, as below:
% of questions answered satisfactorily Marks awarded

80-100 : 5
60-80 : 3-4

30-60 : 2
0-30 : 1

Note: Faculty will check whether student expected the At-Home SPs with at additional Test Cases
and affix signature.

B. During Programming Task


2. Participation (P): 10 Marks

Once the student is allowed to develop the program for the In-Lab EPs, marks will be awarded based on
his/her participation as an individual while developing the programs by following the PDS.

Marks shall be awarded as below:


Marks
After the completion of programming task(s)…
Awarded
Completed the EPs effectively without the assistance of faculty and answered all the
10
questions related to the of programming tasks :

Page 5 of 201
Completed the program effectively with partial assistance of faculty but able to answer
7-9
the questions related to programming tasks :
Completed the EPs only with full assistance of faculty but unable to answer the
3-6
questions on programming tasks :
Unable to complete the EPs even with assistance of the faculty : 0-2

C. Activities to be completed by student after completion of programming tasks:


After completion of programming tasks, the student has to complete the write up and attend the viva-
voce.
3. Write-up(W): 10 Marks

The student should complete the write-up, related to the program conducted, in the manual itself, in the
designated space for Record Work. The write-up must be on the following:
• Problem Analysis (Logic Development)
• Program execution with Test Cases

Marks shall be awarded as below:


Marks
After the completion of experiment…
awarded
Completed the write-up in the laboratory scheduled time with good Logic development
: 10
and programs are executed with good test cases
Completed the write-up in the laboratory scheduled time with average logic
: 7-9
development but executed programs with good test cases
Completed the write-up in the laboratory scheduled time with average Logic
: 3-6
development and executed programs with normal/average test cases
Write - up not completed : 0-2

4. Viva-voce(V): 10 Marks

After completing the write-up, the student should attend viva-voce to answer the following:

Interpretation of output: Viva-voce should not be limited to only the sample questions listed in VIVA-
VOCE questions at the end of program, but should go beyond to test the student's involvement in the
program development and also the technical competency.
(i). What did you learn from these programs based on objectives?
(ii). How will you apply knowledge gained, by performing these programs, in future?

Student should be asked to comment, on the following, specific to SPs and EPs:

(iii). Alternative Approach: Can you propose any "alternative logic/solution" to the make the code
effective? (Specific to specified EPs)
(iv). Maintainability of the code: Do you think that the code written by you is maintainable? Justify.
(v). Testability of the code: Do you think that your programs are testable? Justify.
(vi). Extensibility of the code: What are your ideas on code extendibility for additional features to the
existing code?
(vii). Readability of the code: Do you think that the code written by you is readable (easy to follow,
easy to understand)? Justify.
(viii). Robustness of the code: Whether your code is robust? Justify.
(ix). Any other ideas related to the specific SPs/EPs.
Page 6 of 201
Marks will be awarded based on student's performance, as below:
Marks
Viva-Voce
awarded
Reasonable conclusions drawn with good interpretation of results and answered 80-
: 10
100% of the viva-voce questions perfectly
Reasonable conclusions drawn but answered 50-80% of the viva-voce Questions : 7-9
Poor conclusions and interpretation of results with only 30-50% of viva-voce questions
: 3-6
answered
Conclusions without interpretation of results and answered less than 30% of viva-voce
: 0-2
questions posed

(Faculty I/c, Data Mining using Python Laboratory)

Page 7 of 201
MAKE-UP LAB SESSIONS
1. Missing lab sessions due to holidays or unforeseen circumstances / disturbances will cause a
big loss to student learning.
2. To compensate for this loss, lab course faculty has to plan and conduct additional lab sessions,
called Make-up Lab Sessions, beyond working hours of the institute (or) on Saturdays /
Sundays, by giving prior information to students.
3. The lab course faculty has to ensure that Make-up Lab Sessions are arranged in the following
cases
i. to compensate for the lab sessions to be lost due to holidays
ii. to compensate for the lab sessions to be lost due to unforeseen circumstances
4. The dates for Make-up lab sessions for case (i) i.e., for the sessions which are expected to be lost
due to holidays, are to be announced very much at the beginning of semester itself and printed,
in the Lab Programs Calendar.
5. The dates for Make-up lab sessions, for case (ii) i.e., for the sessions which are expected to be lost
due to unforeseen circumstances, are to be announced, conducted and recorded as and when the
lab sessions get disturbed

IMPORTANT NOTE:
a) Completing all stipulated programs is mandatory for the students to appear for
Laboratory End Semester Examination (ESE).
b) It is student's responsibility to complete all programs
c) If any student is absent for any laboratory session due to valid/genuine reasons, he/she
must complete the program within a week time by seeking permission from the lab course
faculty.
d) Upon completion of the programs of lab sessions which were missed due to valid/genuine
reasons, student will be evaluated for only 50% of the maximum marks of the program and
the corresponding attendance will not be counted.
e) Students allowed to utilize the laboratory sessions beyond the working hours.

Page 8 of 201
PDS

The students should follow the following steps known as "Program Development Steps" (PDS) to
develop and execute a given programming task.
1. Problem analysis (Logic Development)
2. Algorithm/Pseudo code
3. Flowchart
4. Coding
5. Testing & Debugging
6. Programming execution with Test Cases

Program Development Steps (PDS)


Note: PDS should be followed for each and every programming task.
1. Problem Analysis (Logic Development): The problem given is analyzed for understanding and
selecting the steps to solve the problem. Under this, we do LOGIC DEVELOPMENT and write
the required FORMULAS to solve the problems. Also, we have to clearly identify the input(s) and
output(s).
2. Algorithm/Pseudocode development: The general description of the solution for the given
problem, called algorithm/pseudocode, is to be developed.
A formula or set of steps for solving a particular problem. To be an algorithm, the set of rules
must be unambiguous and have a clear stopping point. Algorithms can be expressed in any
language, from natural languages like English. In short, algorithm can be viewed as programming
language independent statements.
3. Flowchart: For the above algorithm/pseudocode a flowchart is to be drawn.
4. Coding: Write the programming instructions using selected programming language to implement
the developed algorithm/pseudocode.
5. Testing and debugging: The program is to be tested for syntax and other errors.
6. Program Execution with test cases: Executing the programs with different types of inputs (called
test cases) and analyzing the output.
a. Test Cases:
i. Execute the programs with all possible test cases.
ii. Test cases should include all possible inputs which challenge the output of the
program you have written.
For Example: You are asked to write C-Code to find Factorial of a given integer. After testing &
debugging, during the execution of the program, you should imagine all the possible inputs. An
example is shown below.
Test case (i): 'Input any positive number' to find its factorial.
Test case (ii): 'Input any negative number' to find its factorial.
Test case (iii): 'Zero' to find its factorial.
So, you have to ensure that your code will give appropriate output for above possible test cases.
For Test case (i): The output should be its factorial
For Test case (ii): The output should display the following message "Factorials are only defined for
positive integers. Please input any positive integer"
For Test case (iii): The output should be '1'
That means your code should deliver appropriate output based on all possible inputs.
Defining a test case for the program is another skill to be mastered. Hence, the students are advised
to design appropriate test cases to test the efficiency of the code. If your code passes all possible test
cases, your code is said to be robust.
Page 9 of 201
LABORATORY PROGRAMS - CALENDAR

Week # Date Title of the experiment

18.12.2023 to 1. Write a program to perform multidimensional data model using


Week 1 23.12.2023 SQL queries (Star, snowflake and fact constellation schemes).

25.12.2023 to
Week 2 30.12.2023
2. Write a program to perform various OLAP operations.

01.01.2024 to 3. Introduction to Python programming, Basics of Python.


Week 3 06.01.2023 4. Python operators, Functions and Strings.
5. List Collection and Tuple Collection.
08.01.2024 to
Week 4 13.01.2024
6. Dictionary collection and set collection.
7. Control Structures and Functions.
14.01.2024 to
Week 5 20.01.2024
SANKRANTHI VACATION

22.01.2024 to
Week 6 27.01.2024
8. Introduction to NumPy, Operations on NumPy Arrays.

29.01.2024 to
Week 7 03.02.2024
9. Introduction to Pandas, Getting and Cleaning Data.

05.02.2024 to 10. Introduction to Data Visualization.


Week 8 10.02.2024 11. Basics of Visualization: Plots, Subplots and their Functionalities.

12.02.2024 to
Week 9 20.02.2024
MID SEMESTER EXAMINATION - 1

21.02.2024 to
Week 10 24.02.2024
No Laboratory due to MSE-I on Monday & Tuesday

26.02.2024 to
Week 11 02.03.2024
12. Plotting Data Distributions, Categorical and Time-Series Data.

04.03.2024 to
Week 12 13. Generate association rules from frequent item sets.
09.03.2024

11.03.2024 to 14. Regression and Classification: Linear regression and logistic


Week 13 16.03.2024 regression.

18.03.2024 to 15. Implement Decision tree, random forest, k-Nearest Neighbor


Week 14 23.03.2024 algorithms.

25.03.2024 to
Week 15 30.03.2024
16. Implement K-means and hierarchical clustering algorithms.

01.04.2024 to
Week 16 06.04.2024
Makeup Laboratory

08.04.2024 to
Week 17 20.04.2024
MID SEMESTER EXAMINATION - 2

22.04.2024 to
Week 18 30.04.2024
LABORATORY END SEMESTER EXAMINATION

Page 10 of 201
LAB EXPERIMENTS CALENDAR – MAKE-UP SESSIONS
Make-up Lab
S. No. Time Title of the experiment
on (Date)
Make-up lab sessions - for sessions lost due to holidays

1.

2.

3.

4.

Make-up lab sessions - for sessions lost due to unforeseen circumstances

1.

2.

3.

4.

Page 11 of 201
LIST OF PROGRAMS & CIE
Marks
Exp. Date of Signature
Title of the experiment awarded
No. conduction of faculty
(40)
1. Write a program to perform
multidimensional data model using SQL
E1
queries (Star, snowflake and fact
constellation schemes).

2. Write a program to perform various


E2
OLAP operations.

3. Introduction to Python programming,


E3 Basics of Python.
4. Python operators, Functions and Strings.

5. List Collection and Tuple Collection.


E4 6. Dictionary collection and set collection.
7. Control Structures and Functions.

8. Introduction to NumPy, Operations on


E5
NumPy Arrays.

9. Introduction to Pandas, Getting and


E6
Cleaning Data.

10. Introduction to Data Visualization.


E7 11. Basics of Visualization: Plots, Subplots
and their Functionalities.

12. Plotting Data Distributions, Categorical


E8
and Time-Series Data.

13. Generate association rules from frequent


E9
item sets.

14. Regression and Classification: Linear


E10
regression and logistic regression.

15. Implement Decision tree, random forest,


E11
k-Nearest Neighbor algorithms.

16. Implement K-means and hierarchical


E12
clustering algorithms.

Page 12 of 201
Laboratory Task - 1

E1. 1. Write a program to perform multidimensional data model using SQL queries
(Star, snowflake and fact constellation schemes)

Objectives: This lab will develop students’ knowledge is/on


1. To learn the use of multidimensional model.
2. To learn how to design star, snowflake and fact constellations schemas.
Outcomes: Upon completion of this lab, students will be able to
1. apply the star, snowflake and fact constellation schemas.
2. apply the SQL queries on multidimensional model.

CONCEPT AT A GLANCE

Data: It is a set of facts and figures. It is like raw materials of data items with numbers,
alphabets and other symbols.

Ex: 101, Ashok, IT, 20000.00 etc.

Information: It is a collection of meaningful and relevant data items. When data is


processed, then the resulting values give the information.

Field or Column: To prepare information, data items are organized in the form of
fields.

Data Warehouse: A data warehouse is a subject-oriented, integrated, time-variant, and


non-volatile collection of data in support of management’s decision-making process. Data
warehouse is a relational database that is designed for query and analysis rather than for
transaction processing. It usually contains historical data derived from transaction data,
but it can include data from other sources

A data warehouse is based on a multidimensional data model which views data in


the form of a data cube. A data cube, such as sales, allows data to be modelled and
viewed in multiple dimensions

• Dimension tables contain descriptive properties of the dimension.


EX: item (item_name, brand, type), or time (day, week, month, quarter, year)
• Fact table contains measures (such as dollars_sold) and keys to each of the related
dimension tables
In data warehousing, an n-D base cube is called a base cuboid. The top most 0-D cuboid,
which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.

Page 16 of 201
Data cube: A Lattice of Cuboids

Conceptual Modeling of Data Warehouse

Modeling data warehouses: Data warehouse is modeled using one of the following
multidimensional models which is described by dimensions & measures
1. Star schema: A fact table in the middle connected to a set of dimension tables
2. Snowflake schema: A refinement of Star schema where some dimensional hierarchy is
normalized into a set of smaller dimension tables, forming a shape similar to Snowflake
3. Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of Star
schemas, therefore called galaxy schema or fact constellation

Example of Star Schema

Sales analysis modeling using Star schema w.r.t item, time, branch and location dimensions

Page 17 of 201
Example of Snowflake Schema

Sales analysis modeling using Snowflake schema w.r.t item, time, branch and location dimensions.
In this example item and location tables are normalized.

Example of Fact constellation Schema

Sales analysis modeling using Fact constellation schema w.r.t item, time, branch and location
dimensions. In this example two fact tables, sales and shipping fact tables are sharing common
dimensions.

Page 18 of 201
Demonstrate Star schema creation for sales data analysis.

Assume Sales data is analyzed w.r.t item, location and branch and time dimensions and create the
tables and insert the sample data

Table for time dimension:

SQL>Create table time2017(timekey number(6) primary key,month varchar2(5),quarter


varchar2(3),year number(4));

Table created

Table for item dimension:

SQL>Create table item2017(itemkey number(6) primary key,itemname varchar2(20),brand


varchar2(20), category varchar2(20) );

Table Created

Table for location dimension:

SQL> Create table location2017(lockey number(6) primary key,street varchar2(20),city varchar2(20),


state varchar2(20),country varchar2(20));

Table Created

Table for branch dimension:

SQL> Create table branch2017(branchkey number(6) primary key,brname varchar2(20),brtype


varchar2(20));

Table Created

Fact Table for sales data analysis:

SQL> Create table salesfact(timekey number(6) references time2017(timekey), itemkey number(6)


references item2017(itemkey), brkey number(6) references branch2017(branchkey),lockey number(6)
references location2017(lockey), unitssold number(10,2),dollarssold number(10,2), avgsales
number(10,2));

Table Created

Page 19 of 201
REORD WORK

In-Lab Exercise Problems (EPs)

(a). In-Lab Exercise Problem (EPS): In-lab Exercise Problems (EPs) are the problems to be
coded during the lab session. Student should complete all EPS in the Lab slot itself with
necessary write up in the space provided, by following the required Program
Development Steps (PDS)
Therefore, students should
1. work on the At-Home SPs and execute those SPS with Additional Test Cases
before attending the lab.
2. complete the In-Lab EPS in the lab session.

Prior preparation on EPS will help the student a lot in completing the EPs within the
stipulated lab session. It is always a good practice to attend the Lab with prior
preparation

(b). For EP1- All steps of PDS are mandatory:


Algorithm/pseudocode development and drawing flowchart are mandatory only for
the FIRST Exercise Problem (EP-1) of every Laboratory task.

(c). Student should focus on developing code for the In-Lab EPs, which is
1. readable (with proper Annotations and Indentation)
2. maintainable
3. extendable
4. testable and
5. robust

Page 20 of 201
RECORD WORK

EP1. Create Snowflake schema for sales analysis.


EP2. Insert the data into dimensions and fact tables created in Question No.1.
EP3. Create Star schema for hospital management and insert the data.

Page 21 of 201
Page 22 of 201
Page 23 of 201
Page 24 of 201
Viva-Voce Questions:

1. What is Data warehousing?


2. What are fact tables and dimension tables?
3. What is the difference between data mining and data warehousing.
4. What is an OLTP system and OLAP system?
5. What is Snowflake schema design in database?

Note: For Viva-voce questions, the students should demonstrate their knowledge and
skills through oral communication. Writing answers to these Questions is Not mandatory.
However, it is good practice to have answers written after completion of the lab session for future reference.

Page 25 of 201
Page 26 of 201
Page 27 of 201
Laboratory Task - 2

E2. 2. Write a program to perform various OLAP Operations

Objectives: This lab will develop students’ knowledge is/on


1. to learn how to use OLAP operations on the given data.
Outcomes: Upon completion of this lab, students will be able to
1. apply multidimensional model to analyze the data.
2. apply OLAP operations on the given data.

CONCEPT AT A GLANCE

Group By Clause:

An aggregate function takes multiple rows of data returned by a query and aggregates them into a
single result row. Including the Group By clause limits the window of data processed by the
aggregate function. It produces an aggregated value for each distinct combination of values present
in the columns listed in the Group By clause. The number of rows can be calculated by multiplying
the number of distinct values of each column listed in the Group By clause.

Rollup:

In addition to the regular aggregation results with the Group By clause, the Rollup extension
produces group subtotals from right to left and a grand total. If "n" is the number of columns listed
in the Rollup, there will be n+ 1 levels of subtotals.

Example:

Rollup (a, b, c) creates following subtotals

(a, b, c)

(a, b) (a)

()

Query 1:

SQL>select deptno,job,sum(sal) from emp group by rollup(deptno,job);

Page 28 of 201
It is possible to do a partial rollup to reduce the number of subtotals calculated.

select deptno,job,sum(sal) from emp group by mgr , rollup(deptno,job);

Cube:

In addition to the subtotals generated by the Rollup extension, the Cube extension will generate
subtotals for all combinations of the dimensions specified. If "n" is the number of columns listed in
the CUBE, there will be 2n subtotal combinations. If the number of dimensions increases, so the
combinations of subtotals that need to be calculated will also increase.

CUBE (a, b, c) produces the following subtotals (a, b, c)

(a, b)

(a, c) (a) (b, c) (b) (c)

()

Page 29 of 201
Query 2:

SQL>select deptno,job,sum(sal) from emp group by cube(deptno,job);

Query 3: Partial cube

SQL> select deptno, job, Avg(sal) from emp group by mgr, cube(deptno,job);

Query 4:

SQL>select deptno,job,mgr,max(sal) from emp group by cube(deptno,job,mgr);

Page 30 of 201
Grouping Functions:

Grouping:

It can be quite easy to visually identify subtotals generated by rollups and cubes, but to do it
programmatically, need to know the presence of null valuses. This is where the Grouping function is
useful. It accepts a single column as a parameter and returns "1" if the column contains a null value
generated as part of a subtotal by a ROLLUP or CUBE operation or "0" for any other value, including
stored null values.

Query 5:

SQL>select deptno,job,sum(sal), Grouping(deptno) id_dept,grouping(job) id_job from emp group


by rollup(deptno,job)

Page 31 of 201
Group_Id:

It's possible to write queries that return the duplicate subtotals, which can be a little confusing. The
group_id function assigns the value "0" to the first set, and all subsequent sets get assigned a higher
number.

Grouping sets:

Calculating all possible subtotals in a cube, especially those with many dimensions, can be quite an
intensive process. To calculate few subtotals, this can represent a considerable amount of wasted
effort. If we only need a few of these levels of subtotaling we can use the Grouping Sets expression
and specify exactly which are required.

Composite Columns:

Rollup and Cube consider each column independently when deciding which subtotals must be
calculated. For Rollup this means stepping back through the list to determine the groupings.
Composite columns allow columns to be grouped together with braces so they are treated as a single
unit when determining the necessary groupings. In the following Rollup columns "a" and "b" have
been turned into a composite column by the additional braces. As a result, the group of "a" is no
longer calculated as the column "a" is only present as part of the composite column in the statement.

ROLLUP ((a, b), c)

(a, b, c)

(a, b) ()

Not considered:

(a)

Page 32 of 201
In a similar way, the possible combinations of the following Cube are reduced because references to
"a" or "b" individually are not considered as they are treated as a single column when the groupings
are determined.

CUBE ((a, b), c)

(a, b, c)

(a, b) (c) ()

Not considered:

(a, c) (a) (b, c) (b)

Query 6:

SQL>select deptno,job,mgr, sum(sal),Grouping(deptno) id_dept,grouping(job) id_job from emp


group by rollup(deptno,(job,mgr))

Concatenated Groupings

Concatenated groupings are defined by putting together multiple GROUPING SETS, CUBEs or
ROLLUPs separated by commas. The resulting groupings are the cross-product of all the groups
produced by the individual grouping sets.

Query 7:

SQL>select deptno,job,mgr, sum(sal),Grouping(deptno) id_dept,grouping(job) id_job from emp


group by grouping sets(deptno,job), grouping sets(deptno,mgr)

Page 33 of 201
Page 34 of 201
REORD WORK

In-Lab Exercise Problems (EPs)

(a). In-Lab Exercise Problem (EPS): In-lab Exercise Problems (EPs) are the problems to be
coded during the lab session. Student should complete all EPS in the Lab slot itself with
necessary write up in the space provided, by following the required Program
Development Steps (PDS)
Therefore, students should
a. work on the At-Home SPs and execute those SPS with Additional Test Cases
before attending the lab.
b. complete the In-Lab EPS in the lab session.

Prior preparation on EPS will help the student a lot in completing the EPs within the
stipulated lab session. It is always a good practice to attend the Lab with prior
preparation

(b). For EP1- All steps of PDS are mandatory:


Algorithm/pseudocode development and drawing flowchart are mandatory only for
the FIRST Exercise Problem (EP-1) of every Laboratory task.

(c). Student should focus on developing code for the In-Lab EPs, which is
a. readable (with proper Annotations and Indentation)
b. maintainable
c. extendable
d. testable and
e. robust

Page 35 of 201
RECORD WORK

EP1. Write the queries for the following using sales fact table:
a. Display average sales for rollup combination’s location and branch.
b. Display sum of dollars sold for cube combination location,time,branch.
c. Display average of dollars sold for cube combination location,(time,branch)
d. Demonstrate groping sets on sales table.
e. Demonstrate concatenated grouping sets on sales schema.

Page 36 of 201
Page 37 of 201
Page 38 of 201
Page 39 of 201
Viva-Voce Questions:

1. What is rollup operation?


2. What is cube operation?
3. What is grouping function?
4. What is grouping sets?
5. What is grouping id?

Note: For Viva-voce questions, the students should demonstrate their knowledge and
skills through oral communication. Writing answers to these Questions is Not mandatory.
However, it is good practice to have answers written after completion of the lab session for future reference.

Page 40 of 201
Page 41 of 201
Page 42 of 201
Laboratory Task - 3

3. Introduction to Python programming, Basics of Python.


E3.
4. Python operators, Functions and Strings.

Objectives: This lab will develop students’ knowledge is/on


1. to learn the use of python programming and its basics.
2. to learn how to use python operators, functions and strings.
Outcomes: Upon completion of this lab, students will be able to
1. apply python programming and its basics on the data.
2. apply python operators, functions and strings on the given data.

CONCEPT AT A GLANCE

What is Python?

Python is a popular programming language. It was created by Guido van Rossum, and
released in 1991.Python is a high-level object-oriented programming language It is also called
general- purpose programming language as it is used in almost every domain as mentioned below:

1. Web Development
2. Software Development
3. Game Development
4. AI & ML
5. Data Analytics

Why Python?

Python works on different platforms (Windows, Mac, Linux, Raspberry Pi, etc). Python has a
simple syntax similar to the English language.

Python has syntax that allows developers to write programs with fewer lines than some other
programming languages.

Python runs on an interpreter system, meaning that code can be executed as soon as it is
written. This means that prototyping can be very quick. Python can be treated in a procedural way,
an object-oriented way or a functional way.

Python Quick start

Python is an interpreted programming language; this means that as a developer we write


Python (.py) files in a text editor and then put those files into the python interpreter to be executed.

The way to run a python file is like this on the command line:

Page 43 of 201
C:\Users\Name>python helloworld.py

where "helloworld.py" is the name of the python file.

Let's write our first Python file, called helloworld.py, which can be done in any text editor.
helloworld.py print("Hello, World!")

Save the file. Open the command line, navigate to the directory where we saved the file, and run:
C:\Users\Name>python helloworld.py

The output should read: Hello, World!

Python operators, Functions and strings Python Operators:

Python divides the operators in the following groups:

• Arithmetic operators
• Assignment operators
• Comparison operators
• Logical operators
• Identity operators
• Membership operators
• Bitwise operators

Arithmetic Operators

Arithmetic operators are used to performing mathematical operations like addition, subtraction,
multiplication, and division.

Operator Description Syntax


+ Addition: adds two operands x+y
– Subtraction: subtracts two operands x–y
* Multiplication: multiplies two operands x*y
/ Division (float): divides the first operand by the second x/y
// Division (floor): divides the first operand by the second x // y
Modulus: returns the remainder when the first operand is
% x%y
divided by the second
** Power: Returns first raised to power second x ** y

# Examples of Arithmetic

Operator a = 9

b=4

# Addition of numbers add = a + b

# Subtraction of numbers sub = a - b

# Multiplication of number mul = a * b

# Division(float) of number div1 = a / b

Page 44 of 201
# Division(floor) of number div2 = a // b

# Modulo of both number mod = a % b

# Power p = a ** b

# print results

print(add)

print(sub)

print(mul)

print(div1)

print(div2)

print(mod)

print(p)

Output

13

36

2.25

6561

Comparison Operators

Comparison of Relational operators compares the values. It either returns True or False according to
the condition.

Operator Description Syntax


> Greater than: True if the left operand is greater than the right x>y
< Less than: True if the left operand is less than the right x<y
== Equal to: True if both operands are equal x == y
!= Not equal to – True if operands are not equal x != y
Greater than or equal to True if the left operand is greater than or
>= x >= y
equal to the right
<= Less than or equal to True if the left operand is less than or equal to x <= y
the right

Page 45 of 201
# Examples of Relational Operators

a = 13

b = 33

# a > b is False print(a > b)

# a < b is True print(a < b)

# a == b is False print(a == b)

# a != b is True print(a != b)

# a >= b is False print(a >= b)

# a <= b is True print(a <= b)

Output

False

True

False

True

False

True

Logical Operators

Logical operators perform Logical AND, Logical OR, and Logical NOT operations. It is used to
combine conditional statements.

Operator Description Syntax


and Logical AND: True if both the operands are true x and y
or Logical OR: True if either of the operands is true x or y
not Logical NOT: True if the operand is false not x

# Examples of Logical Operator

a = True

b = False

# Print a and b is False print(a and b)

# Print a or b is True print(a or b)

# Print not a is False print(not a)

Page 46 of 201
Output

False

True

False

Bitwise Operators

Bitwise operators act on bits and perform the bit-by-bit operations. These are used to operate on
binary numbers.

Operator Description Syntax


& Bitwise AND x&y
| Bitwise OR x|y
~ Bitwise NOT ~x
^ Bitwise XOR x^y
>> Bitwise right shift x>>
<< Bitwise left shift x<<

# Examples of Bitwise operators

a = 10

b=4

# Print bitwise AND operation print(a & b)

# Print bitwise OR operation print(a | b)

# Print bitwise NOT operation print(~a)

# print bitwise XOR operation print(a ^ b)

# print bitwise right shift operation print(a >> 2)

# print bitwise left shift operation print(a << 2)

Output

14

-11

14

40

Page 47 of 201
Assignment Operators

Assignment operators are used to assigning values to the variables.

Operator Description Syntax


= Assign value of right side of expression to left side operand x=y+z
Add AND: Add right-side operand with left side operand and then a+=b
+=
assign to left operand a=a+b
Subtract AND: Subtract right operand from left operand and then a-=b
-=
assign to left operand a=a-b
Multiply AND: Multiply right operand with left operand and then a*=b
*=
assign to left operand a=a*b
Divide AND: Divide left operand with right operand and then a/=b
/=
assign to left operand a=a/b
Modulus AND: Takes modulus using left and right operands and a%=b
%=
assign the result to left operand a=a%b
Divide(floor) AND: Divide left operand with right operand and a//=b
//=
then assign the value(floor) to left operand a=a//b
Exponent AND: Calculate exponent (raise power) value using a**=b
**=
operands and assign value to left operand a=a**b
Performs Bitwise AND on operands and assign value to left a&=b
&=
operand a=a&b
a|=b
|= Performs Bitwise OR on operands and assign value to left operand
a=a|b
Performs Bitwise XOR on operands and assign value to left a^=b
^=
operand a=a^b
Performs Bitwise right shift on operands and assign value to left a>>=b
>>=
operand a=a>>b
Performs Bitwise left shift on operands and assign value to left a <<= b a=
<<=
operand a << b

# Examples of Assignment Operators

a = 10

b=a

print(b)

# Add and assign value b += a print(b)

# Subtract and assign value b -= a print(b)

# Multiply and assign b *= a print(b)

# bitwise left shift operator b <<= a print(b)

Output

10

20

10

100

Page 48 of 201
102400

Identity Operators

is and is not are the identity operators both are used to check if two values are located on the same
part of the memory. Two variables that are equal do not imply that they are identical.

is True if the operands are identical

is not True if the operands are not identical a = 10

b = 20

c=a

print(a is not b)

print(a is c)

Output

True

True

Membership Operators

in and not in are the membership operators; used to test whether a value or variable is in a sequence.

in True if value is found in the sequence

not in True if value is not found in the sequence

# Python program to illustrate # not 'in' operator

x = 24

y = 20

list = [10, 20, 30, 40, 50]

if (x not in list):

print("x is NOT present in given list")

else:

print("x is present in given list")

if (y in list):

print("y is present in given list")

else:

print("y is NOT present in given list")

Page 49 of 201
Output

x is NOT present in given list

y is present in given list

Functions:

A function is a block of organized, reusable code that is used to perform a single, related
action. Functions provide better modularity for the application and a high degree of code reusing.

As we already know, Python gives we many built-in functions like print(), etc. but we can also
create own functions. These functions are called user-defined functions.

Creating a Function

In Python a function is defined using the def keyword:

Example

def my_function():

print("Hello from a function")

Calling a Function

To call a function, use the function name followed by parenthesis:

Example

def my_function():

print("Hello from a function")

my_function()

Arguments

Information can be passed into functions as arguments.

Arguments are specified after the function name, inside the parentheses. We can add as many
arguments as we want, just separate them with a comma.

The following example has a function with one argument (fname). When the function is called,
we pass along a first name, which is used inside the function to print the full name:

Page 50 of 201
Example

def my_function(fname):

print(fname+ "Refsnes")

my_function("Emil")

my_function("Tobias")

my_function("Linus")

Strings

Strings in python are surrounded by either single quotation marks, or double quotation marks. 'hello'
is the same as "hello".

We can display a string literal with the print() function:

Example

print("Hello")

Assign String to a Variable

Assigning a string to a variable is done with the variable name followed by an equal sign and the
string:

Example

a= "Hello"

print(a)

Multiline Strings

We can assign a multiline string to a variable by using three quotes:

Example

We can use three double quotes:

a= """Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut
labore et dolore magna aliqua."""

print(a)

Strings are Arrays

Like many other popular programming languages, strings in Python are arrays of bytes representing
unicode characters.

Page 51 of 201
However, Python does not have a character data type, a single character is simply a string with a
length of 1.

Square brackets can be used to access elements of the string.

Example

Get the character at position 1 (remember that the first character has the position 0):

a= "Hello,World!"

print(a[1])

Looping Through a String

Since strings are arrays, we can loop through the characters in a string, with a for loop.

Example

Loop through the letters in the word "banana":

for x in "banana":

print(x)

String Length

To get the length of a string, use the len() function.

Example

The len() function returns the length of a string:

a= "Hello,World!"

print(len(a))

Slicing

We can return a range of characters by using the slice syntax.

Specify the start index and the end index, separated by a colon, to return a part of the string.

Example

Get the characters from position 2 to position 5 (not included):

b= "Hello,World!"

print(b[2:5])

Page 52 of 201
Slice From the Start

By leaving out the start index, the range will start at the first character:

Example

Get the characters from the start to position 5 (not included):

b= "Hello,World!"

print(b[:5])

Slice To the End

By leaving out the end index, the range will go to the end:

Example

Get the characters from position 2, and all the way to the end:

b= "Hello,World!"

print(b[2:])

Page 53 of 201
REORD WORK

In-Lab Exercise Problems (EPs)

(a). In-Lab Exercise Problem (EPS): In-lab Exercise Problems (EPs) are the problems to be
coded during the lab session. Student should complete all EPS in the Lab slot itself with
necessary write up in the space provided, by following the required Program
Development Steps (PDS)
Therefore, students should
a. work on the At-Home SPs and execute those SPS with Additional Test Cases
before attending the lab.
b. complete the In-Lab EPS in the lab session.

Prior preparation on EPS will help the student a lot in completing the EPs within the
stipulated lab session. It is always a good practice to attend the Lab with prior
preparation

(b). For EP1- All steps of PDS are mandatory:


Algorithm/pseudocode development and drawing flowchart are mandatory only for
the FIRST Exercise Problem (EP-1) of every Laboratory task.

(c). Student should focus on developing code for the In-Lab EPs, which is
a. readable (with proper Annotations and Indentation)
b. maintainable
c. extendable
d. testable and
e. robust

Page 54 of 201
RECORD WORK

EP1. Write a function to find factorial of the number.


EP2. Write a function to find LCM of two numbers.
EP3. Write a function to find GCD of two numbers.
EP4. Write a function to extract sub string from the given string.

Page 55 of 201
Page 56 of 201
Page 57 of 201
Page 58 of 201
Viva-Voce Questions:

1. What are the key features of Python?


2. Explain the ternary operator in Python.
3. How would we convert a string into lowercase?
4. What is the pass statement in Python?

Note: For Viva-voce questions, the students should demonstrate their knowledge and skills
through oral communication. Writing answers to these Questions is Not mandatory.
However, it is good practice to have answers written after completion of the lab session for future reference.

Page 59 of 201
Page 60 of 201
Page 61 of 201
Laboratory Task – 4

5. List collection and tuple collection.


E4. 6. Dictionary collection and set collection.
7. Control structures and functions

Objectives: This lab will develop students’ knowledge is/on


1. to learn the use of List.
2. to learn use of Dictionary and set collection.
3. to learn use of control structures and function.
Outcomes: Upon completion of this lab, students will be able to
1. apply the List on a given data.
2. apply Dictionary and set collection on a given data.
3. apply control structures and function on a given data.

CONCEPT AT A GLANCE

List collection and tuple collection.

Lists:

Lists are used to store multiple items in a single variable. Lists are created using square brackets:

Example

Create a List:

thislist=["apple", "banana", "cherry"]

print(thislist)

Output:

['apple', 'banana', 'cherry']

List Items

List items are ordered, changeable, and allow duplicate values.

List items are indexed, the first item has index [0], the second item has index [1] etc.

Ordered

When we say that lists are ordered, it means that the items have a defined order, and that order will
not change. If we add new items to a list, the new items will be placed at the end of the list.

Page 62 of 201
Changeable

The list is changeable, meaning that we can change, add, and remove items in a list after it has been
created.

Allow Duplicates

Since lists are indexed, lists can have items with the same value:

Example

Lists allow duplicate values:

thislist=["apple", "banana", "cherry", "apple", "cherry"]

print(thislist)

List Length

To determine how many items a list has, use the len() function:

Example

Print the number of items in the list:

thislist=["apple", "banana", "cherry"]

print(len(thislist))

List Items - Data Types

List items can be of any data type:

Example

String, int and boolean data types:

list1=["apple", "banana", "cherry"]

list2=[1, 5, 7, 9, 3]

list3 = [True, False, False]

A list can contain different data types:

Example

A list with strings, integers and boolean values:

list1 = ["abc", 34, True, 40, "male"]

Page 63 of 201
Access Items

List items are indexed and we can access them by referring to the index number:

Example

Print the second item of the list:

thislist=["apple", "banana", "cherry"]

print(thislist[1])

Range of Indexes

We can specify a range of indexes by specifying where to start and where to end the range.

When specifying a range, the return value will be a new list with the specified items.

Example

Return the third, fourth, and fifth item:

thislist=["apple", "banana", "cherry", "orange", "kiwi", "melon", "mango"]

print(thislist[2:5])

Example

This example returns the items from the beginning to, but NOT including, "kiwi":

thislist=["apple", "banana", "cherry", "orange", "kiwi", "melon", "mango"]

print(thislist[:4])

Check if Item Exists

To determine if a specified item is present in a list use the in keyword:

Example

Check if "apple" is present in the list:

thislist=["apple", "banana", "cherry"]

if "apple" in thislist:

print("Yes, 'apple' is in the fruits list")

type()

From Python's perspective, lists are defined as objects with the data type 'list':

<class 'list'>

Page 64 of 201
Example

What is the data type of a list?

mylist=["apple", "banana", "cherry"]

print(type(mylist))

Tuple:

Tuples are used to store multiple items in a single variable.

Tuple is one of 4 built-in data types in Python used to store collections of data, the other 3 are List,
Set, and Dictionary, all with different qualities and usage.

A tuple is a collection which is ordered and unchangeable. Tuples are written with round brackets.

Example

Create a Tuple:

thistuple=("apple", "banana", "cherry")

print(thistuple)

Tuple Items

Tuple items are ordered, unchangeable, and allow duplicate values.

Tuple items are indexed, the first item has index [0], the second item has index [1] etc.

Ordered

When we say that tuples are ordered, it means that the items have a defined order, and that order
will not change.

Unchangeable

Tuples are unchangeable, meaning that we cannot change, add or remove items after the tuple has
been created.

Allow Duplicates

Since tuples are indexed, they can have items with the same value:

Example

Tuples allow duplicate values:

thistuple=("apple", "banana", "cherry", "apple", "cherry")

Page 65 of 201
print(thistuple)

Tuple Length

To determine how many items a tuple has, use the len() function:

Example

Print the number of items in the tuple:

thistuple=("apple", "banana", "cherry")

print(len(thistuple))

type()

From Python's perspective, tuples are defined as objects with the data type 'tuple':

<class 'tuple'>

Example

What is the data type of a tuple?

mytuple=("apple", "banana", "cherry")

print(type(mytuple))

Page 66 of 201
Dictionary collection and set collection.

Dictionary

Dictionaries are used to store data values in key: value pairs.

A dictionary is a collection which is ordered*, changeable and does not allow duplicates.

Dictionaries are written with curly brackets, and have keys and values:

Example

Create and print a dictionary:

thisdict = {

"brand": "Ford",

"model": "Mustang",

"year": 1964

print(thisdict)

Dictionary Items

Dictionary items are ordered, changeable, and does not allow duplicates.

Dictionary items are presented in key:value pairs, and can be referred to by using the key name.

Example

Print the "brand" value of the dictionary:

thisdict= {

"brand": "Ford",

"model": "Mustang",

"year": 1964

print(thisdict["brand"])

Ordered or Unordered?

As of Python version 3.7, dictionaries are ordered. In Python 3.6 and earlier, dictionaries are
unordered.

When we say that dictionaries are ordered, it means that the items have a defined order, and that
order will not change.

Page 67 of 201
Unordered means that the items does not have a defined order, we cannot refer to an item by using
an index.

Changeable

Dictionaries are changeable, meaning that we can change, add or remove items after the dictionary
has been created.

Duplicates Not Allowed

Dictionaries cannot have two items with the same key:

Example

Duplicate values will overwrite existing values:

thisdict = {

"brand": "Ford",

"model": "Mustang",

"year": 1964,

"year": 2020

print(thisdict)

type()

From Python's perspective, dictionaries are defined as objects with the data type 'dict':

<class 'dict'>

Example

Print the data type of a dictionary:

thisdict= {

"brand": "Ford",

"model": "Mustang",

"year": 1964

print(type(thisdict))

Page 68 of 201
Accessing Items

We can access the items of a dictionary by referring to its key name, inside square brackets:

Example

Get the value of the "model" key:

thisdict= {

"brand": "Ford",

"model": "Mustang",

"year": 1964

x = thisdict["model"]

There is also a method called get() that will give we the same result:

Example

Get the value of the "model" key:

x = thisdict.get("model")

Nested Dictionaries

A dictionary can contain dictionaries, this is called nested dictionaries.

Example

Create a dictionary that contain three dictionaries:

myfamily={

"child1" :{

"name" : "Emil",

"year" : 2004

},

"child2" :{

"name" : "Tobias",

"year" : 2007

},

"child3" :{

"name" : "Linus",

"year" : 2011

Page 69 of 201
}

Output:

{'child1': {'name': 'Emil', 'year': 2004}, 'child2': {'name': 'Tobias', 'year': 2007}, 'child3': {'name': 'Linus',
'year': 2011}}

Sets

Sets are used to store multiple items in a single variable.

Set is one of 4 built-in data types in Python used to store collections of data, the other 3 are List, Tuple,
and Dictionary, all with different qualities and usage.

A set is a collection which is both unordered and unindexed.

Sets are written with curly brackets.

Example

Create a Set:

thisset={"apple", "banana", "cherry"}

print(thisset)

Set Items

Set items are unordered, unchangeable, and do not allow duplicate values.

Unordered

Unordered means that the items in a set do not have a defined order.

Set items can appear in a different order every time we use them, and cannot be referred to by index
or key.

Unchangeable

Sets are unchangeable, meaning that we cannot change the items after the set has been created.

Duplicates Not Allowed

Sets cannot have two items with the same value.

Example

Duplicate values will be ignored:

Page 70 of 201
thisset={"apple", "banana", "cherry", "apple"}

print(thisset)

Get the Length of a Set

To determine how many items a set has, use the len() method.

Example

Get the number of items in a set:

thisset={"apple", "banana", "cherry"}

print(len(thisset))

Access Items

We cannot access items in a set by referring to an index or a key.

But we can loop through the set items using a for loop, or ask if a specified value is present in a set,
by using the in keyword.

Example

Loop through the set, and print the values:

thisset={"apple", "banana", "cherry"}

for x in thisset:

print(x)

type()

From Python's perspective, sets are defined as objects with the data type 'set':

<class 'set'>

Example

What is the data type of a set?

myset={"apple", "banana", "cherry"}

print(type(myset))

Join Two Sets

There are several ways to join two or more sets in Python.

We can use the union() method that returns a new set containing all items from both sets, or the
update() method that inserts all the items from one set into another:

Page 71 of 201
Example

The union() method returns a new set with all items from both sets:

set1={"a", "b" , "c"}

set2={1, 2, 3}

set3=set1.union(set2)

print(set3)

Page 72 of 201
Control structures and functions.

Python Conditions and If statements

Python supports the usual logical conditions from mathematics:

• Equals: a == b
• Not Equals: a != b
• Less than: a < b
• Less than or equal to: a <= b
• Greater than: a > b
• Greater than or equal to: a >= b

These conditions can be used in several ways, most commonly in "if statements" and loops.

An "if statement" is written by using the if keyword.

Example

If statement:

a= 33

b= 200

if b>a:

print("b is greater than a")

In this example we use two variables, a and b, which are used as part of the if statement to test whether
b is greater than a. As a is 33, and b is 200, we know that 200 is greater than 33, and so we print to
screen that "b is greater than a".

Indentation

Python relies on indentation (whitespace at the beginning of a line) to define scope in the code. Other
programming languages often use curly-brackets for this purpose.

Elif

The elif keyword is pythons way of saying "if the previous conditions were not true, then try this
condition".

Example

a= 33

b= 33

if b>a:

print("b is greater than a")

Page 73 of 201
elif a==b:

print("a and b are equal")

In this example a is equal to b, so the first condition is not true, but the elif condition is true, so we
print to screen that "a and b are equal".

Else

The else keyword catches anything which isn't caught by the preceding conditions.

Example

a= 200

b= 33

if b>a:

print("b is greater than a")

elif a==b:

print("a and b are equal")

else:

print("a is greater than b")

In this example a is greater than b, so the first condition is not true, also the elif condition is not true,
so we go to the else condition and print to screen that "a is greater than b".

We can also have an else without the elif:

Example

a= 200

b= 33

if b>a:

print("b is greater than a")

else:

print("b is not greater than a")

Loops in python

Python programming language provides following types of loops to handle looping requirements.
Python provides three ways for executing the loops. While all the ways provide similar basic
functionality, they differ in their syntax and condition checking time.

Page 74 of 201
While Loop:

In python, while loop is used to execute a block of statements repeatedly until a given a condition is
satisfied. And when the condition becomes false, the line immediately after the loop in program is
executed.

With the while loop we can execute a set of statements as long as a condition is true.

Example

Print i as long as i is less than 6:

i= 1

while i< 6:

print(i)

i += 1

The break Statement

With the break statement we can stop the loop even if the while condition is true:

Example

Exit the loop when i is 3:

i= 1

while i< 6:

print(i)

if i== 3:

break

i += 1

The continue Statement

With the continue statement we can stop the current iteration, and continue with the next:

Example

Continue to the next iteration if i is 3:

i= 0

while i< 6:

i+= 1

if i== 3:

continue

Page 75 of 201
print(i)

The else Statement

With the else statement we can run a block of code once when the condition no longer is true:

Example

Print a message once the condition is false:

i= 1

while i< 6:

print(i)

i+= 1

else:

print("i is no longer less than 6")

For Loop:

A for loop is used for iterating over a sequence (that is either a list, a tuple, a dictionary, a set, or a
string).

This is less like for keyword in other programming languages, and works more like an iterator
method as found in other object-orientated programming languages.

With the for loop we can execute a set of statements, once for each item in a list, tuple, set etc.

Example

Print each fruit in a fruit list:

fruits=["apple", "banana", "cherry"]

for x in fruits:

print(x)

Looping Through a String

Even strings are iterable objects, they contain a sequence of characters:

Example

Loop through the letters in the word "banana":

for x in "banana":

print(x)

Page 76 of 201
The range() Function

To loop through a set of code a specified number of times, we can use the range() function,

The range() function returns a sequence of numbers, starting from 0 by default, and increments by 1
(by default), and ends at a specified number.

Example

Using the range() function:

for x in range(6):

print(x)

The range() function defaults to increment the sequence by 1, however it is possible to specify the
increment value by adding a third parameter: range(2, 30, 3):

Example

Increment the sequence with 3 (default is 1):

for x in range(2, 30, 3):

print(x)

Nested Loops

A nested loop is a loop inside a loop.

The "inner loop" will be executed one time for each iteration of the "outer loop":

Example

Print each adjective for every fruit:

adj=["red", "big", "tasty"]

fruits=["apple", "banana", "cherry"]

for x in adj:

for y in fruits:

print(x, y)

Page 77 of 201
REORD WORK

In-Lab Exercise Problems (EPs)

(a). In-Lab Exercise Problem (EPS): In-lab Exercise Problems (EPs) are the problems to be
coded during the lab session. Student should complete all EPS in the Lab slot itself with
necessary write up in the space provided, by following the required Program
Development Steps (PDS)
Therefore, students should
a. work on the At-Home SPs and execute those SPS with Additional Test Cases
before attending the lab.
b. complete the In-Lab EPS in the lab session.

Prior preparation on EPS will help the student a lot in completing the EPs within the
stipulated lab session. It is always a good practice to attend the Lab with prior
preparation

(b). For EP1- All steps of PDS are mandatory:


Algorithm/pseudocode development and drawing flowchart are mandatory only for
the FIRST Exercise Problem (EP-1) of every Laboratory task.

(c). Student should focus on developing code for the In-Lab EPs, which is
a. readable (with proper Annotations and Indentation)
b. maintainable
c. extendable
d. testable and
e. robust

Page 78 of 201
RECORD WORK

EP1. Write Python code to reverse given list.


EP2. Write Python code to perform various operations on lists.
EP3. Write Python code to perform various operations on tuples.
EP4. Write Python code to perform various operations on dictionaries.
EP5. Write Python code to perform various operations on sets.

Page 79 of 201
Page 80 of 201
Page 81 of 201
Page 82 of 201
Viva-Voce Questions:

1. How set differs from list?


2. Explain Union, Intersection, Difference, Symmetric Difference operations in sets.
3. What is Short Hand if Else condition?
4. What is the difference between break and continue?

Note: For Viva-voce questions, the students should demonstrate their knowledge and skills
through oral communication. Writing answers to these Questions is Not mandatory.
However, it is good practice to have answers written after completion of the lab session for future reference.

Page 83 of 201
Page 84 of 201
Page 85 of 201
Laboratory Task – 5

E5. 8. Introduction to NumPy, Operations on NumPy arrays

Objectives: This lab will develop students’ knowledge is/on


1. to learn the use of Numpy.
2. to learn how to use operations on Numpy arrays.
Outcomes: Upon completion of this lab, students will be able to
1. apply Numpy on the data.
2. apply operations on Numpy arrays.

CONCEPT AT A GLANCE

NumPy is a Python library.

NumPy is used for working with arrays.

NumPy is short for "Numerical Python".

Example

Create a NumPy ndarray Object

NumPy is used to work with arrays. The array object in NumPy is called ndarray.

We can create a NumPy ndarray object by using the array() function.

Create a NumPy array:

import numpy as np

arr=np.array([1, 2, 3, 4, 5])

print(arr)

print(type(arr)

OUTPUT

[1 2 3 4 5]

<class 'numpy.ndarray'>

Dimensions in Arrays

A dimension in arrays is one level of array depth (nested arrays).

Page 86 of 201
0-D Arrays

0-D arrays, or Scalars, are the elements in an array. Each value in an array is a 0-D array.

Example

Create a 0-D array with value 42

import numpy as np

arr=np.array(42)

print(arr)

1-D Arrays

An array that has 0-D arrays as its elements is called uni-dimensional or 1-D array.

These are the most common and basic arrays.

Example

Create a 1-D array containing the values 1,2,3,4,5:

import numpy as np

arr=np.array([1, 2, 3, 4, 5])

print(arr)

2-D Arrays

An array that has 1-D arrays as its elements is called a 2-D array.

These are often used to represent matrix or 2nd order tensors.

Example

Create a 2-D array containing two arrays with the values 1,2,3 and 4,5,6:

import numpy as np

arr=np.array([[1, 2, 3],[4, 5, 6]])

print(arr)

Page 87 of 201
3-D arrays

An array that has 2-D arrays (matrices) as its elements is called 3-D array.

These are often used to represent a 3rd order tensor.

Example

Create a 3-D array with two 2-D arrays, both containing two arrays with the values 1,2,3 and 4,5,6:

import numpy as np

arr=np.array([[[1, 2, 3],[4, 5, 6]],[[1, 2, 3],[4, 5, 6]]])

print(arr)

Access Array Elements

Array indexing is the same as accessing an array element.

We can access an array element by referring to its index number.

The indexes in NumPy arrays start with 0, meaning that the first element has index 0, and the
second has index 1 etc.

Example

Get the first element from the following array:

import numpy as np

arr=np.array([1, 2, 3, 4])

print(arr[0])

Access 2-D Arrays

To access elements from 2-D arrays we can use comma separated integers representing the
dimension and the index of the element.

Example

Access the 2nd element on 1st dim:

import numpy as np

arr=np.array([[1,2,3,4,5],[6,7,8,9,10]])

Page 88 of 201
print('2nd element on 1st dim: ', arr[0, 1])

Access 3-D Arrays

To access elements from 3-D arrays we can use comma separated integers representing the
dimensions and the index of the element.

Example

Access the third element of the second array of the first array:

import numpy as np

arr=np.array([[[1, 2, 3],[4, 5, 6]],[[7, 8, 9],[10, 11, 12]]])

print(arr[0, 1, 2])

Slicing arrays

Slicing in python means taking elements from one given index to another given index.

We pass slice instead of index like this: [start:end].

We can also define the step, like this: [start:end:step].

If we don't pass start its considered 0

If we don't pass end its considered length of array in that dimension

If we don't pass step its considered 1

Example

Slice elements from index 1 to index 5 from the following array:

import numpy as np

arr=np.array([1, 2, 3, 4, 5, 6, 7])

print(arr[1:5])

Negative Slicing

Use the minus operator to refer to an index from the end:

Page 89 of 201
Example

Slice from the index 3 from the end to index 1 from the end:

import numpy as np

arr=np.array([1, 2, 3, 4, 5, 6, 7])

print(arr[-3:-1])

STEP

Use the step value to determine the step of the slicing:

Example

Return every other element from index 1 to index 5:

import numpy as np

arr=np.array([1, 2, 3, 4, 5, 6, 7])

print(arr[1:5:2])

Slicing 2-D Arrays

Example

From the second element, slice elements from index 1 to index 4 (not included):

import numpy as np

arr=np.array([[1, 2, 3, 4, 5],[6, 7, 8, 9, 10]])

print(arr[1, 1:4])

Shape of an Array

The shape of an array is the number of elements in each dimension.

Get the Shape of an Array.

NumPy arrays have an attribute called shape that returns a tuple with each index having the
number of corresponding elements.

Example

Page 90 of 201
Print the shape of a 2-D array:

import numpy as np

arr=np.array([[1, 2, 3, 4], [5, 6, 7, 8]])

print(arr.shape)

Joining NumPy Arrays

Joining means putting contents of two or more arrays in a single array.

In SQL we join tables based on a key, whereas in NumPy we join arrays by axes.

We pass a sequence of arrays that we want to join to the concatenate() function, along with the axis.
If axis is not explicitly passed, it is taken as 0.

Example

Join two arrays

import numpy as np

arr1=np.array([1, 2, 3])

arr2=np.array([4, 5, 6])

arr=np.concatenate((arr1,arr2))

print(arr)

Splitting NumPy Arrays

Splitting is reverse operation of Joining.

Joining merges multiple arrays into one and Splitting breaks one array into multiple.

We use array_split() for splitting arrays, we pass it the array we want to split and the number of
splits.

Example

Split the array in 3 parts:

import numpy as np

arr=np.array([1, 2, 3, 4, 5, 6])

Page 91 of 201
newarr= np.array_split(arr, 3)

print(newarr)

Searching Arrays

You can search an array for a certain value, and return the indexes that get a match.

To search an array, use the where() method.

Example

Find the indexes where the value is 4:

import numpy as np

arr=np.array([1, 2, 3, 4, 5, 4, 4])

x= np.where(arr== 4)

print(x)

Sorting Arrays

Sorting means putting elements in an ordered sequence.

Ordered sequence is any sequence that has an order corresponding to elements, like numeric or
alphabetical, ascending or descending.

The NumPy ndarray object has a function called sort(), that will sort a specified array.

Example

Sort the array:

import numpy as np

arr=np.array([3, 2, 0, 1])

print(np.sort(arr))

Page 92 of 201
REORD WORK

In-Lab Exercise Problems (EPs)

1. In-Lab Exercise Problem (EPS): In-lab Exercise Problems (EPs) are the problems to be
coded during the lab session. Student should complete all EPS in the Lab slot itself with
necessary write up in the space provided, by following the required Program
Development Steps (PDS)
Therefore, students should
a. work on the At-Home SPs and execute those SPS with Additional Test Cases
before attending the lab.
b. complete the In-Lab EPS in the lab session.

Prior preparation on EPS will help the student a lot in completing the EPs within the
stipulated lab session. It is always a good practice to attend the Lab with prior
preparation

2. Student should focus on developing code for the In-Lab EPs, which is
a. readable (with proper Annotations and Indentation)
b. maintainable
c. extendable
d. testable and
e. robust

Page 93 of 201
RECORD WORK

EP1. Write Python code to print column wise addition of a 2-D array.
EP2. Write Python code to print row wise addition of a 2-D array.
EP3. Write Python code to print diagonal elements of a 2-D array.

Page 94 of 201
Page 95 of 201
Page 96 of 201
Page 97 of 201
Viva-Voce Questions:

1. What is reshape function?


2. How to create identity matrix?
3. List the commands used to find min, max and average of array elements?
4. What is slicing?

Note: For Viva-voce questions, the students should demonstrate their knowledge and skills
through oral communication. Writing answers to these Questions is Not mandatory.
However, it is good practice to have answers written after completion of the lab session for future reference.

Page 98 of 201
Page 99 of 201
Page 100 of 201
Laboratory Task – 6

E6. 9. Introduction to Pandas, Getting and Cleaning Data.

Objectives: This lab will develop students’ knowledge is/on


1. to learn introduction to pandas.
2. to learn how to clean the data on the dataset.
Outcomes: Upon completion of this lab, students will be able to
1. apply python libraries.
2. apply techniques to clean the data on the dataset.

CONCEPT AT A GLANCE

Pandas is a Python library.

Pandas is used to analyze data.

Example

import pandas

mydataset={

'cars':["BMW", "Volvo", "Ford"],

'passings':[3, 7, 2]

myvar=pandas.DataFrame(mydataset)

print(myvar)

Pandas as pd

Pandas is usually imported under the pd alias.

Create an alias with the as keyword while importing:

import pandas as pd

Now the Pandas package can be referred to as pd instead of pandas.

Example

import pandas as pd

mydataset={

'cars':["BMW", "Volvo", "Ford"],

Page 101 of 201


'passings': [3, 7, 2]

myvar=pd.DataFrame(mydataset)

print(myvar)

What is a Series?

A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

Example

Create a simple Pandas Series from a list:

import pandas as pd

a=[1, 7, 2]

myvar=pd.Series(a)

print(myvar)

Labels

If nothing else is specified, the values are labeled with their index number. First value has index 0,
second value has index 1 etc.

This label can be used to access a specified value.

Example

Return the first value of the Series:

print(myvar[0])

import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar[0])

Output:

Create Labels

With the index argument, we can name our own labels.

Page 102 of 201


Example

Create our own labels:

import pandas as pd

a=[1, 7, 2]

myvar=pd.Series(a,index=["x", "y", "z"])

print(myvar)

Example:

import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a, index = ["x", "y", "z"])

print(myvar)

output:

x 1

y 7

z 2

dtype: int64

What is a DataFrame?

A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with
rows and columns.

Example

Create a simple Pandas DataFrame:

import pandas as pd

data = {

"calories": [420, 380, 390],

"duration": [50, 40, 45]

#load data into a DataFrame object:

df = pd.DataFrame(data)

print(df)

Page 103 of 201


Result

calories duration

0 420 50

1 380 40

2 390 45

Locate Row

As you can see from the result above, the DataFrame is like a table with rows and columns.

Pandas use the loc attribute to return one or more specified row(s)

Example

Return row 0:

#refer to the row index:

print(df.loc[0])

Result

calories 420

duration 50

Name: 0, dtype: int64

Named Indexes

With the index argument, you can name your own indexes.

Example

Add a list of names to give each row a name:

import pandas as pd

data = {

"calories": [420, 380, 390],

"duration": [50, 40, 45]

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df)

Page 104 of 201


Result

calories duration

day1 420 50

day2 380 40

day3 390 45

Read CSV Files

A simple way to store big data sets is to use CSV files (comma separated files).

CSV files contains plain text and is a well know format that can be read by everyone including Pandas.

In our examples we will be using a CSV file called 'data.csv'.

Download data.csv. or Open data.csv

Example

Load the CSV into a DataFrame:

import pandas as pd

df = pd.read_csv('data.csv')

print(df.to_string())

Read JSON

Big data sets are often stored, or extracted as JSON.

JSON is plain text, but has the format of an object, and is well known in the world of programming,
including Pandas.

In our examples we will be using a JSON file called 'data.json'.

Open data.json.

Example

Load the JSON file into a DataFrame:

import pandas as pd

df = pd.read_json('data.json')

print(df.to_string())

Viewing the Data

One of the most used method for getting a quick overview of the DataFrame, is the head() method.

Page 105 of 201


The head() method returns the headers and a specified number of rows, starting from the top.

Example

Get a quick overview by printing the first 10 rows of the DataFrame:

import pandas as pd

df = pd.read_csv('data.csv')

print(df.head(10))

Pandas - Cleaning Data

Data Cleaning

Data cleaning means fixing bad data in your data set.

Bad data could be:

• Empty cells
• Data in wrong format
• Wrong data
• Duplicates

Data Set:
Duration Date Pulse Maxpulse Calories
0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaN 100 119 282.0

Page 106 of 201


23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 2020/12/26 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 NaN
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0

The data set contains some empty cells ("Date" in row 22, and "Calories" in row 18 and 28).

The data set contains wrong format ("Date" in row 26).

The data set contains wrong data ("Duration" in row 7).

The data set contains duplicates (row 11 and 12).

Empty Cells

Empty cells can potentially give you a wrong result when you analyze data.

Remove Rows

One way to deal with empty cells is to remove rows that contain empty cells.

This is usually OK, since data sets can be very big, and removing a few rows will not have a big
impact on the result.

Example

Return a new Data Frame with no empty cells:

import pandas as pd

df = pd.read_csv('data.csv')

new_df = df.dropna()

print(new_df.to_string())

Replace Empty Values

Another way of dealing with empty cells is to insert a new value instead.

This way you do not have to delete entire rows just because of some empty cells.

The fillna() method allows us to replace empty cells with a value:

Page 107 of 201


Example

Replace NULL values with the number 130:

import pandas as pd

df = pd.read_csv('data.csv')

df.fillna(130, inplace = True)

Data of Wrong Format

Cells with data of wrong format can make it difficult, or even impossible, to analyze data.

To fix it, you have two options: remove the rows, or convert all cells in the columns into the same
format.

Convert Into a Correct Format

In our Data Frame, we have two cells with the wrong format. Check out row 22 and 26, the 'Date'
column should be a string that represents a date:

Duration Date Pulse Maxpulse Calories


0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaN 100 119 282.0
23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5

Page 108 of 201


26 60 20201226 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 NaN
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0

Let's try to convert all cells in the 'Date' column into dates.

Pandas has a to_datetime() method for this:

Example

Convert to date:

import pandas as pd

df = pd.read_csv('data.csv')

df['Date'] = pd.to_datetime(df['Date'])

print(df.to_string())

Result:

Duration Date Pulse Maxpulse Calories


0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaT 100 119 282.0
23 60 '2020/12/23' 130 101 300.0

Page 109 of 201


24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 '2020/12/26' 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 NaN
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0

Wrong Data

"Wrong data" does not have to be "empty cells" or "wrong format", it can just be wrong, like if
someone registered "199" instead of "1.99".

Sometimes we can spot wrong data by looking at the data set, because we have an expectation of
what it should be.

If you take a look at our data set, you can see that in row 7, the duration is 450, but for all the other
rows the duration is between 30 and 60.

It doesn't have to be wrong, but taking in consideration that this is the data set of someone's workout
sessions, we conclude with the fact that this person did not work out in 450 minutes.

Duration Date Pulse Maxpulse Calories


0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaN 100 119 282.0
23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0

Page 110 of 201


25 60 '2020/12/25' 102 126 334.5
26 60 20201226 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 NaN
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0

Replacing Values

One way to fix wrong values is to replace them with something else.

In our example, it is most likely a typo, and the value should be "45" instead of "450", and we could
just insert "45" in row 7:

Example

Set "Duration" = 45 in row 7:

df.loc[7, 'Duration'] = 45

Discovering Duplicates

Duplicate rows are rows that have been registered more than one time.

Duration Date Pulse Maxpulse Calories


0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaN 100 119 282.0
23 60 '2020/12/23' 130 101 300.0

Page 111 of 201


24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 20201226 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 NaN
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0

By taking a look at our test data set, we can assume that row 11 and 12 are duplicates.

To discover duplicates, we can use the duplicated() method.

The duplicated() method returns a Boolean values for each row:

Example

Returns True for every row that is a duplicate, othwerwise False:

print(df.duplicated())

Page 112 of 201


REORD WORK

In-Lab Exercise Problems (EPs)

1. In-Lab Exercise Problem (EPS): In-lab Exercise Problems (EPs) are the problems to be
coded during the lab session. Student should complete all EPS in the Lab slot itself with
necessary write up in the space provided, by following the required Program
Development Steps (PDS)
Therefore, students should
a. work on the At-Home SPs and execute those SPS with Additional Test Cases
before attending the lab.
b. complete the In-Lab EPS in the lab session.

Prior preparation on EPS will help the student a lot in completing the EPs within the
stipulated lab session. It is always a good practice to attend the Lab with prior
preparation

2. Student should focus on developing code for the In-Lab EPs, which is
a. readable (with proper Annotations and Indentation)
b. maintainable
c. extendable
d. testable and
e. robust

Page 113 of 201


RECORD WORK

EP1. Write Python code to sort values in ascending and descending order.
EP2. Create a dataset with null values and write code to remove null values from the dataset.

Page 114 of 201


Page 115 of 201
Page 116 of 201
Page 117 of 201
Viva-Voce Questions:

1. What is a data frame?


2. How to get the information of data set?
3. List the methods available to clean the data.
4. What is JSON?

Note: For Viva-voce questions, the students should demonstrate their knowledge and skills
through oral communication. Writing answers to these Questions is Not mandatory.
However, it is good practice to have answers written after completion of the lab session for future reference.

Page 118 of 201


Page 119 of 201
Page 120 of 201
Laboratory Task – 7

10. Introduction to Data Visualization.


E7.
11. Basics of Visualization: Plots, Subplots and their Functionalities.

Objectives: This lab will develop students’ knowledge is/on


1. to learn about introduction to data visualization.
2. to learn Basics of visualization like plots, subplots and their functionalities.
Outcomes: Upon completion of this lab, students will be able to
1. apply data visualization on datasets
2. apply Basics of visualization tools to analyze the data.

CONCEPT AT A GLANCE

Introduction to Data visualization

Data visualization is the discipline of trying to understand data by placing it in a visual context so
that patterns, trends and correlations that might not otherwise be detected can be exposed.

Python offers multiple great graphing libraries that come packed with lots of different features. No
matter if you want to create interactive, live or highly customized plots python has an excellent library
for you.

To get a little overview here are a few popular plotting libraries:

• Matplotlib: low level, provides lots of freedom

• Pandas Visualization: easy to use interface, built on Matplotlib

• Seaborn: high-level interface, great default styles

• ggplot: based on R’s ggplot2, uses Grammar of Graphics

• Plotly: can create interactive plots

What is Matplotlib?

Matplotlib is a low-level graph plotting library in python that serves as a visualization utility.
Matplotlib was created by John D. Hunter. Matplotlib is open source and we can use it freely.
Matplotlib is mostly written in python, a few segments are written in C, Objective-C and JavaScript
for Platform compatibility.

Page 121 of 201


Installation of Matplotlib

Install it using this command:

C:\Users\Your Name>pip install matplotlib

If this command fails, then use a python distribution that already has Matplotlib installed, like
Anaconda, Spyder etc.

Import Matplotlib

Once Matplotlib is installed, import it in your applications by adding the import module statement:

import matplotlib

Pyplot

Most of the Matplotlib utilities lies under the pyplot submodule, and are usually imported under the
plt alias:

import matplotlib.pyplot as plt

Now the Pyplot package can be referred to as plt.

Example

Draw a line in a diagram from position (0,0) to position (6,250):

#Three lines to make our compiler able to draw:

import sys

import matplotlib

matplotlib.use('Agg')

import matplotlib.pyplot as plt

import numpy as np

xpoints = np.array([0, 6])

ypoints = np.array([0, 250])

plt.plot(xpoints, ypoints)

plt.show()

Page 122 of 201


#Two lines to make our compiler able to draw:

plt.savefig(sys.stdout.buffer)

sys.stdout.flush()

Basics of visualization: Plots, Subplots and their functionalities

Matplotlib Plotting

Plotting x and y points

The plot() function is used to draw points (markers) in a diagram. By default, the plot() function
draws a line from point to point. The function takes parameters for specifying points in the
diagram. Parameter 1 is an array containing the points on the x-axis. Parameter 2 is an array
containing the points on the y-axis. If we need to plot a line from (1, 3) to (8, 10), we have to pass
two arrays [1, 8] and [3, 10] to the plot function.

#Three lines to make our compiler able to draw:

import sys

import matplotlib

matplotlib.use('Agg')

import matplotlib.pyplot as plt

import numpy as np

xpoints = np.array([1, 8])

ypoints = np.array([3, 10])

plt.plot(xpoints, ypoints)

Page 123 of 201


plt.show()

#Two lines to make our compiler able to draw:

plt.savefig(sys.stdout.buffer)

sys.stdout.flush()

Matplotlib Subplots

Display Multiple Plots

With the subplots() function we can draw multiple plots in one figure:

The subplots() function takes three arguments that describes the layout of the figure. The layout is
organized in rows and columns, which are represented by the first and second argument. The third
argument represents the index of the current plot.

plt.subplot(1, 2, 1)

#the figure has 1 row, 2 columns, and this plot is the first plot.

plt.subplot(1, 2, 2)

#the figure has 1 row, 2 columns, and this plot is the second plot.

#Three lines to make our compiler able to draw:

import sys

import matplotlib

matplotlib.use('Agg')

import matplotlib.pyplot as plt

Page 124 of 201


import numpy as np

#plot 1:

x = np.array([0, 1, 2, 3])

y = np.array([3, 8, 1, 10])

plt.subplot(1, 2, 1)

plt.plot(x,y)

#plot 2:

x = np.array([0, 1, 2, 3])

y = np.array([10, 20, 30, 40])

plt.subplot(1, 2, 2)

plt.plot(x,y)

plt.show()

#Two lines to make our compiler able to draw:

plt.savefig(sys.stdout.buffer)

sys.stdout.flush()

Page 125 of 201


### Line Plot Graph

x = [1,2,3,4,6,8,9,10,12,15]

y = [10,20,30,40,60,80,90,100,102,150]

plt.figure(figsize=(8,5)) ## width,height

plt.title("Line Plot Graph",fontsize=15,color='red')

plt.xlabel("X Axis --->",fontsize=12,color='red')

plt.ylabel("Y Axis --->",fontsize=12,color='red')

plt.plot(x,y,color='green',lw="3",linestyle="dotted",label="Line Plot")

plt.legend(loc="best")

## linestyle - solid,dotted,dashed, lw= line width

plt.show()

### Scatter Plot Graph

x = [1,2,3,4,6,8,9,10,12,15]

y = [10,20,30,40,60,80,90,100,102,150]

plt.figure(figsize=(8,5)) ## width,height

plt.title("Scatter Plot Graph",fontsize=15,color='red')

plt.xlabel("X Axis --->",fontsize=12,color='red')

plt.ylabel("Y Axis --->",fontsize=12,color='red')

plt.scatter(x,y,color='green',label="Scatter Plot",s=150,marker="*")

Page 126 of 201


### o,*,d,v,^,<,> ### https://matplotlib.org/stable/api/markers_api.html

plt.legend(loc="best")

plt.show()

### Bar Plot Graph

x = [1,2,3,4,6,8,9,10,12,15]

y = [10,20,30,40,60,80,90,100,102,150]

plt.figure(figsize=(8,5)) ## width,height

plt.title("Bar Plot Graph",fontsize=15,color='red')

plt.xlabel("X Axis --->",fontsize=12,color='red')

plt.ylabel("Y Axis --->",fontsize=12,color='red')

plt.bar(x,y,color=['green','orange'],label="Bar Plot",width=0.6)

plt.legend(loc="best")

plt.show()

Page 127 of 201


Page 128 of 201
REORD WORK

In-Lab Exercise Problems (EPs)

1. In-Lab Exercise Problem (EPS): In-lab Exercise Problems (EPs) are the problems to be
coded during the lab session. Student should complete all EPS in the Lab slot itself with
necessary write up in the space provided, by following the required Program
Development Steps (PDS)
Therefore, students should
a. work on the At-Home SPs and execute those SPS with Additional Test Cases
before attending the lab.
b. complete the In-Lab EPS in the lab session.

Prior preparation on EPS will help the student a lot in completing the EPs within the
stipulated lab session. It is always a good practice to attend the Lab with prior
preparation

2. Student should focus on developing code for the In-Lab EPs, which is
a. readable (with proper Annotations and Indentation)
b. maintainable
c. extendable
d. testable and
e. robust

Page 129 of 201


RECORD WORK

EP1. Write Python code to visualize sine plot.


EP2. Write Python code to visualize cosine plot.
EP3. Write Python code to visualize relplot.

Page 130 of 201


Page 131 of 201
Page 132 of 201
Page 133 of 201
Viva-Voce Questions:

1. What is a scatter graph?


2. What is subplot?
3. What is a legend?

Note: For Viva-voce questions, the students should demonstrate their knowledge and skills
through oral communication. Writing answers to these Questions is Not mandatory.
However, it is good practice to have answers written after completion of the lab session for future reference.

Page 134 of 201


Page 135 of 201
Page 136 of 201
Laboratory Task – 8

E8. 12. Plotting Data distributions, Categorical and Time-Series data

Objectives: This lab will develop students’ knowledge is/on


1. to learn the use plotting data distribution on categorical data.
2. to learn use plotting data distribution on Time-series data.
Outcomes: Upon completion of this lab, students will be able to
1. apply plotting on categorical data.
2. apply plotting on Time-series data.

CONCEPT AT A GLANCE

Data Distribution

In the real world, the data sets are much bigger, but it can be difficult to gather real world data, at
least at an early stage of a project.

How Can we Get Big Data Sets?

To create big data sets for testing, we use the Python module NumPy, which comes with a number
of methods to create random data sets, of any size.

Histogram

To visualize the data set we can draw a histogram with the data we collected. We will use the Python
module Matplotlib to draw a histogram.

Normal Data Distribution

In probability theory this kind of data distribution is known as the normal data distribution, or the
Gaussian data distribution, after the mathematician Carl Friedrich Gauss who came up with the
formula of this data distribution.

What is a Time Series?

Time series is a sequence of observations recorded at regular time intervals.

Depending on the frequency of observations, a time series may typically be hourly, daily, weekly,
monthly, quarterly and annual. Sometimes, you might have seconds and minute-wise time series as
well, like, number of clicks and user visits every minute etc.

What is a Categorical data?

Page 137 of 201


Categorical features can only take on a limited, and usually fixed, number of possible values. For
example, if a dataset is about information related to users, then you will typically find features like
country, gender, age group, etc. Alternatively, if the data you're working with is related to products,
you will find features like product type, manufacturer, seller and so on.

These are all categorical features in your dataset. These features are typically stored as text values
which represent various traits of the observations. For example, gender is described as Male (M) or
Female (F), product type could be described as electronics, apparels, food etc.

### Bar Plot Graph

x = [1,2,3,4,6,8,9,10,12,15]

y = [10,20,30,40,60,80,90,100,102,150]

plt.figure(figsize=(8,5)) ## width,height

plt.title("Bar Plot Graph",fontsize=15,color='red')

plt.xlabel("X Axis --->",fontsize=12,color='red')

plt.ylabel("Y Axis --->",fontsize=12,color='red')

plt.bar(x,y,color=['green','orange'],label="Bar Plot",width=0.6)

plt.plot(x,y,color='red',lw="1",linestyle="solid",label="Line Plot")

plt.legend(loc="best")

plt.show()Description of Crab Dataset:

### Horizontal Bar Plot Graph

x = [1,2,3,4,6,8,9,10,12,15]

y = [10,20,30,40,60,80,90,100,102,150]

Page 138 of 201


plt.figure(figsize=(8,5)) ## width,height

plt.title("Bar Plot Graph",fontsize=15,color='red')

plt.ylabel("X Axis --->",fontsize=12,color='red')

plt.xlabel("Y Axis --->",fontsize=12,color='red')

plt.barh(x,y,color=['green','orange'],label="Bar Plot",height=0.6)

plt.legend(loc="best")

plt.show()

### Pie Chart

slices = [30,100,50,22,44,66,22,55]

names = ["A","B","C","D","E","F","G","H"]

cols = ["red","blue","orange","green","pink","violet","magenta","yellow"]

plt.figure(figsize=(6,6))

plt.pie(slices,labels=names,colors=cols,autopct="%0.2f%%",explode=(0.2,0,0,0.5,0,0,0,0))

plt.legend(loc=4)

plt.show()

Page 139 of 201


Page 140 of 201
REORD WORK

In-Lab Exercise Problems (EPs)

1. In-Lab Exercise Problem (EPS): In-lab Exercise Problems (EPs) are the problems to be
coded during the lab session. Student should complete all EPS in the Lab slot itself with
necessary write up in the space provided, by following the required Program
Development Steps (PDS)
Therefore, students should
a. work on the At-Home SPs and execute those SPS with Additional Test Cases
before attending the lab.
b. complete the In-Lab EPS in the lab session.

Prior preparation on EPS will help the student a lot in completing the EPs within the
stipulated lab session. It is always a good practice to attend the Lab with prior
preparation

2. Student should focus on developing code for the In-Lab EPs, which is
a. readable (with proper Annotations and Indentation)
b. maintainable
c. extendable
d. testable and
e. robust

Page 141 of 201


RECORD WORK

EP1. Draw the all types of chart for sales data analysis.

Page 142 of 201


Page 143 of 201
Page 144 of 201
Viva-Voce Questions:

1. What is a normal distribution curve?


2. Give example for time series data.
3. What tools are used to analyze the data?
4. What is categorical data?

Note: For Viva-voce questions, the students should demonstrate their knowledge and skills
through oral communication. Writing answers to these Questions is Not mandatory.
However, it is good practice to have answers written after completion of the lab session for future reference.

Page 145 of 201


Page 146 of 201
Page 147 of 201
Laboratory Task – 9

E9. 13. Generate association rules from frequent item sets.

Objectives: This lab will develop students’ knowledge is/on


1. to learn the use association rules on the dataset.
2. to learn how to find frequent item sets on the dataset.
Outcomes: Upon completion of this lab, students will be able to
1. apply association rules on the dataset.
2. apply association rules to find frequent item sets on the dataset.

CONCEPT AT A GLANCE

Association Mining searches for frequent items in the data-set. In frequent mining usually the
interesting associations and correlations between item sets in transactional and relational databases
are found.

In short, Frequent Mining shows which items appear together in a transaction or relation.

Need of Association Mining:

Frequent mining is generation of association rules from a Transactional Dataset. If there are 2 items
X and Y purchased frequently then its good to put them together in stores or provide some discount
offer on one item on purchase of other item. This can really increase the sales. For example it is
likely to find that if a customer buys Milk and bread he/she also buys Butter.

So the association rule is [‘milk]^[‘bread’]=>[‘butter’]. So seller can suggest the customer to buy
butter if he/she buys Milk and Bread.

Important Definitions:

Support: It is one of the measure of interestingness. This tells about usefulness and certainty of rules.
5% Support means total 5% of transactions in database follow the rule.

Support(A -> B) = Support_count(A 𝖴 B)

Confidence: A confidence of 60% means that 60% of the customers who purchased a milk and
bread also bought butter.

Confidence(A -> B) = Support_count(A 𝖴 B) / Support_count(A)

If a rule satisfies both minimum support and minimum confidence, it is a strong rule.

Page 148 of 201


Program:

pip install apyori pip install fsspec import numpy as np

import matplotlib.pyplot as plt import pandas as pd

from apyori import apriori store_data=pd.read_csv('/content/drive/MyDrive/store_data.csv',


header=None) num_records=len(store_data)

print(num_records) records=[]

for i in range(0,num_records):

records.append([str(store_data.values[i,j]) for j in range(0,4) if str(store_data.values[i,j]) !=


'nan'])

print(records)

association_rules=apriori(records,min_support=0.4,min_confidence=0.2)
association_results=list(association_rules)

print(len(association_results))

sup_list=[]

conf_list=[]

for item in association_results:

# first index of the inner list

# Contains base item and add item pair =item[0]

# print(pair)

# items = [x for x in pair]

if(len(list(item[2][0][1]))>=2):

#print("Rule: " + list(item[2][0][1])[0]+"->"+list(item[2][0][1])[1])

print("Rule: " + str(list(item[2][0][1])))

#second index of the inner list

print("Support: " + str(item[1])) sup_list.append(item[1])

#third index of the list located at 0th of the third index of the inner list

conf_list.append(item[2][0][2])

print("Confidence: " + str(item[2][0][2]))

Page 149 of 201


# print("Lift: " + str(item[2][0][3]))

print("=====================================")

Output:

Rule: ['i2', 'i1'] Support: 0.6

Confidence: 0.6

=====================================

Rule: ['i3', 'i1'] Support: 0.6

Confidence: 0.6

=====================================

Rule: ['i4', 'i1'] Support: 0.4

Confidence: 0.4

=====================================

Rule: ['i2', 'i3'] Support: 0.6

Confidence: 0.6

=====================================

Rule: ['i3', 'i4'] Support: 0.4

Confidence: 0.4

=====================================

Rule: ['i2', 'i3', 'i1']

Support: 0.4

Confidence: 0.4

=====================================

Rule: ['i3', 'i4', 'i1']

Support: 0.4

Confidence: 0.4

Page 150 of 201


REORD WORK

In-Lab Exercise Problems (EPs)

1. In-Lab Exercise Problem (EPS): In-lab Exercise Problems (EPs) are the problems to be
coded during the lab session. Student should complete all EPS in the Lab slot itself with
necessary write up in the space provided, by following the required Program
Development Steps (PDS)
Therefore, students should
a. work on the At-Home SPs and execute those SPS with Additional Test Cases
before attending the lab.
b. complete the In-Lab EPS in the lab session.

Prior preparation on EPS will help the student a lot in completing the EPs within the
stipulated lab session. It is always a good practice to attend the Lab with prior
preparation

2. Student should focus on developing code for the In-Lab EPs, which is
a. readable (with proper Annotations and Indentation)
b. maintainable
c. extendable
d. testable and
e. robust

Page 151 of 201


RECORD WORK

EP1. Generate association rules from medical dataset.


EP2. Generate association rules from grocery dataset.

Page 152 of 201


Page 153 of 201
Page 154 of 201
Page 155 of 201
Viva-Voce Questions:

1. Define frequent itemset.


2. Define support.
3. Define confidence.
4. Define Association mining.

Note: For Viva-voce questions, the students should demonstrate their knowledge and skills
through oral communication. Writing answers to these Questions is Not mandatory.
However, it is good practice to have answers written after completion of the lab session for future reference.

Page 156 of 201


Page 157 of 201
Page 158 of 201
Laboratory Task – 10

E10. 14. Regression and Classification: Linear regression and logistic regression.

Objectives: This lab will develop students’ knowledge is/on


1. to learn the Regression and classification model.
2. to learn how to use linear regression and logistic regression.
Outcomes: Upon completion of this lab, students will be able to
1. apply Regression and classification model.
2. apply linear regression and logistic regression model.

CONCEPT AT A GLANCE

Regression

Regression Analysis is a statistical process for estimating the relationships between the dependent
variables or criterion variables and one or more independent variables or predictors. Regression
analysis explains the changes in criteria in relation to changes in select predictors. The conditional
expectation of the criteria is based on predictors where the average value of the dependent variables
is given when the independent variables are changed. Three major uses for regression analysis are
determining the strength of predictors, forecasting an effect, and trend forecasting.

Types of Linear Regression

Linear regression is of the following two types −

• Simple Linear Regression


• Multiple Linear Regression

Simple Linear Regression (SLR)

It is the most basic version of linear regression which predicts a response using a single feature. The
assumption in SLR is that the two variables are linearly related.

Multiple Linear Regression (MLR)

It is the extension of simple linear regression that predicts a response using two or more features.
Mathematically we can explain it as follows

Consider a dataset having n observations, p features i.e. independent variables and y as one response
i.e. dependent variable the regression line for p features can be calculated as follows:

h(xi)=b0+b1xi1+b2xi2+...+bpxiph(xi)=b0+b1xi1+b2xi2+...+bpxip

Here, h(xi) is the predicted response value and b0,b1,b2…,bp are the regression coefficients.

Page 159 of 201


Multiple Linear Regression models always includes the errors in the data known as residual error
which changes the calculation as follows

h(xi)=b0+b1xi1+b2xi2+...+bpxip+ei

We can also write the above equation as follows:

yi=h(xi)+ei or ei=yi−h(xi)

Introduction to Logistic Regression

Logistic regression is a supervised learning classification algorithm used to predict the probability of
a target variable. The nature of target or dependent variable is dichotomous, which means there
would be only two possible classes.

In simple words, the dependent variable is binary in nature having data coded as either 1 (stands for
success/yes) or 0 (stands for failure/no).

Mathematically, a logistic regression model predicts P(Y=1) as a function of X. It is one of the simplest
ML algorithms that can be used for various classification problems such as spam detection, Diabetes
prediction, cancer detection etc.

Classification

There are two forms of data analysis that can be used for extracting models describing important
classes or to predict future data trends. These two forms are as follows

• Classification
• Prediction

Classification models predict categorical class labels; and prediction models predict continuous
valued functions.

What is classification?

Following are the examples of cases where the data analysis task is Classification

• A bank loan officer wants to analyze the data in order to know which customer (loan
applicant) are risky or which are safe.
• A marketing manager at a company needs to analyze a customer with a given profile, who
will buy a new computer.

In both of the above examples, a model or classifier is constructed to predict the categorical labels.
These labels are risky or safe for loan application data and yes or no for marketing data.

Linear Regression:

import pandas as pd

Page 160 of 201


import numpy as np

import matplotlib.pyplot as plt

data = pd.read_csv("/content/Salary_Data.csv")

data.head()

Output:

YearsExperience Salary

0 1.1 39343.0

1 1.3 46205.0

2 1.5 37731.0

3 2.0 43525.0

4 2.2 39891.0

x = np.array(data[['YearsExperience']]) ## feature

y = np.array(data['Salary']) ## target

from sklearn.model_selection import train_test_split

xtrain,xtest,ytrain,ytest = train_test_split(x,y,train_size=0.8,random_state=9014)

### Build the model

from sklearn.linear_model import LinearRegression

model = LinearRegression()

### Train the model

model.fit(xtrain,ytrain)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

### Prediction

ypred = model.predict(xtest)

ypred

Output:

array([ 56182.55053157, 100294.69682063, 44919.8748833 , 62752.44465973,

122820.04811718, 115311.59768499])

Ytest

Output:

array([ 54445., 101302., 43525., 63218., 122391., 116969.])

xtest

Page 161 of 201


Output:

array([[ 3.2],

[ 7.9],

[ 2. ],

[ 3.9],

[10.3],

[ 9.5]])

### Calculate R2 Score

from sklearn.metrics import r2_score

score = r2_score(ytest,ypred)

score

Output:

0.99842716176972

m = model.coef_

c = model.intercept_

print(m,c)

Output:

[9385.56304023] 26148.74880284306

### Draw line of regression for training samples

plt.figure(figsize=(10,6))

plt.scatter(xtrain,ytrain,color="red",label="Actual Samples")

plt.scatter(xtrain,model.predict(xtrain),color="blue",label="Predicted Samples")

plt.plot(xtrain,model.predict(xtrain),color="yellow",label="Line of Regression")

plt.legend()

plt.show()

Page 162 of 201


### Draw line of regression for testing samples

plt.figure(figsize=(10,6))

plt.scatter(xtest,ytest,color="red",label="Actual Samples")

plt.scatter(xtest,model.predict(xtest),color="blue",label="Predicted Samples")

plt.plot(xtest,model.predict(xtest),color="yellow",label="Line of Regression")

plt.legend()

plt.show()

Page 163 of 201


Logistic Regression:

import pandas as pd

import numpy as np

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

data = sns.load_dataset('titanic')

### check for null

data.isnull().sum()

Output:

survived 0

pclass 0

sex 0

age 177

sibsp 0

parch 0

fare 0

embarked 2

class 0

who 0

adult_male 0

deck 688

embark_town 2

alive 0

alone 0

dtype: int64

mean_age = round(data['age'].mean(),2)

data['age'] = data['age'].fillna(mean_age)

data.isnull().sum()

Page 164 of 201


Output:

survived 0

pclass 0

sex 0

age 0

sibsp 0

parch 0

fare 0

embarked 2

class 0

who 0

adult_male 0

deck 688

embark_town 2

alive 0

alone 0

dtype: int64

data = data.drop(["deck","embark_town"],axis=1)

data = data.dropna()

data.isnull().sum()

y = np.array(data['survived']) ## target

x = data[['pclass','sex','age','sibsp','parch','embarked']]

### Lable Encoding

x['sex'] = x['sex'].map({"male":0,"female":1})

x['embarked'] = x['embarked'].map({"S":0,"C":1,"Q":2})

### Split the data into training and testing

xtrain,xtest,ytrain,ytest = train_test_split(x,y,train_size=0.80,random_state=3)

## Build the model

model = LogisticRegression()

## Train the model

model.fit(xtrain,ytrain)

Page 165 of 201


### Prediction

ypred = model.predict(xtest)

ypred

Output:

array([1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1,

0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,

1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,

1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,

0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,

1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,

0, 0])

ytest

Output:

array([1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1,

1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,

0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,

0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0,

1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,

1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,

0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,

1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,

0, 0])

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(ytest,ypred)

cm

Page 166 of 201


Output:

array([[96, 14],

[27, 41]])

cm.diagonal().sum()/cm.sum()

Output:

0.7696629213483146

from sklearn.metrics import accuracy_score

a = accuracy_score(ytest,ypred)

Output:

0.7696629213483146

Page 167 of 201


REORD WORK

In-Lab Exercise Problems (EPs)

1. In-Lab Exercise Problem (EPS): In-lab Exercise Problems (EPs) are the problems to be
coded during the lab session. Student should complete all EPS in the Lab slot itself with
necessary write up in the space provided, by following the required Program
Development Steps (PDS)
Therefore, students should
a. work on the At-Home SPs and execute those SPS with Additional Test Cases
before attending the lab.
b. complete the In-Lab EPS in the lab session.

Prior preparation on EPS will help the student a lot in completing the EPs within the
stipulated lab session. It is always a good practice to attend the Lab with prior
preparation

2. Student should focus on developing code for the In-Lab EPs, which is
a. readable (with proper Annotations and Indentation)
b. maintainable
c. extendable
d. testable and
e. robust

Page 168 of 201


RECORD WORK

EP1. Predict salary of an employee based on years of experience using linear regression.
EP2. Classify the loan applicants using logistic regression.

Page 169 of 201


Page 170 of 201
Page 171 of 201
Page 172 of 201
Viva-Voce Questions:

1. Differentiate classification and prediction.


2. List the metrics used in classification.
3. Define linear regression model.
4. Differentiate simple linear regression and multiple linear regression.

Note: For Viva-voce questions, the students should demonstrate their knowledge and skills
through oral communication. Writing answers to these Questions is Not mandatory.
However, it is good practice to have answers written after completion of the lab session for future reference.

Page 173 of 201


Page 174 of 201
Page 175 of 201
Laboratory Task – 11

E11. 15. Implement Decision tree, random forest, k-Nearest Neighbor algorithms.

Objectives: This lab will develop students’ knowledge is/on


1. to learn the use the decision tree and random forest methods.
2. to learn how use K-Nearest Neighbor algorithms in python.
Outcomes: Upon completion of this lab, students will be able to
1. apply decision tree and random forest methods.
2. apply K-Nearest Neighbor algorithms in python.

CONCEPT AT A GLANCE

Decision tree Algorithm

In general, Decision tree analysis is a predictive modeling tool that can be applied across many areas.
Decision trees can be constructed by an algorithmic approach that can split the dataset in different
ways based on different conditions. Decisions tress are the most powerful algorithms that falls under
the category of supervised algorithms.

Random Forest Algorithm

Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based on the
concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex
problem and to improve the performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree
and based on the majority votes of predictions, and it predicts the final output. The greater number
of trees in the forest leads to higher accuracy and prevents the problem of overfitting.

K-Nearest Neighbor Algorithm

K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be used for
both classification as well as regression predictive problems. However, it is mainly used for
classification predictive problems in industry. The following two properties would define KNN well.

Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a specialized
training phase and uses all the data for training while classification.

Non-parametric learning algorithm − KNN is also a non-parametric learning algorithm because it


doesn’t assume anything about the underlying data.

Page 176 of 201


DecisionTree:

import math

!pip install xlsxwriter

import xlsxwriter

import pandas as pd

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import confusion_matrix

from sklearn.metrics import roc_curve,auc

book = xlsxwriter.Workbook("dt.xlsx")

sheet = book.add_worksheet()

r=0

sheet.write(r, 0, 'DecisionTree')

r=r+1

sheet.write(r, 0, 'Accuracy')

sheet.write(r, 1, 'Precision')

sheet.write(r, 2, 'Recall')

sheet.write(r, 3, 'F-measure')

sheet.write(r, 4,'Specificty')

sheet.write(r, 5,'GeometricMean')

sheet.write(r, 6,'AUC')

x = df_final.drop(['Defective'],axis=1)

y = df_final.Defective

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30)

clf = DecisionTreeClassifier()

clf = clf.fit(x_train, y_train)

predictions = clf.predict(x_test)

c=confusion_matrix(y_test, predictions)

print("confusion_matrix:")

print(confusion_matrix(y_test,predictions))

mo=c[1][0] + c[1][1] + c[0][0] + c[0][1]

Page 177 of 201


mo1=c[0][1] + c[1][1]

mo2=c[1][0] + c[1][1]

if mo!=0:

acc = round(((c[1][1] + c[0][0]) / mo),2)

else:

acc=0

if mo1!=0:

pre = round((c[1][1] / mo1),2)

else:

pre =0

if mo2!=0:

rec = round((c[1][1] / mo2),2)

else:

rec=0

print("Accuracy=",acc)

print("Precision=",pre)

print("Recall=",rec)

mo3=pre+rec

if mo3!=0:

fm = round(((2 * pre * rec) / mo3),2)

else:

fm=0

print("F-measure=",fm)

mo4=c[0][0]+c[0][1]

if mo4!=0:

sp = round((c[0][0] / mo4),2)

else:

sp=0

print("Specificity=",sp)

gm = round((math.sqrt(rec * sp)),2)

print("Geometric Mean=",gm)

Page 178 of 201


tp = c[1][1]

fn = c[1][0]

fp = c[0][1]

tn = c[0][0]

c=0

dt_fpr,dt_tpr,threshold = roc_curve(y_test,predictions)

auc = round((auc(dt_fpr,dt_tpr)),2)

print("AUC=",auc)

r=r+1

sheet.write(r, c, acc)

sheet.write(r, c + 1, pre)

sheet.write(r, c + 2, rec)

sheet.write(r, c + 3, fm)

sheet.write(r, c + 4, sp)

sheet.write(r, c + 5, gm)

sheet.write(r, c + 6, auc)

book.close()

Output:

confusion_matrix:

[[252 25]

[ 26 262]]

Accuracy= 0.91

Precision= 0.91

Recall= 0.91

F-measure= 0.91

Specificity= 0.91

Geometric Mean= 0.91

AUC= 0.91

Page 179 of 201


RandomForest:

import math

!pip install xlsxwriter

import xlsxwriter

import pandas as pd

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import confusion_matrix

from sklearn.metrics import roc_curve,auc

book = xlsxwriter.Workbook("rf.xlsx")

sheet = book.add_worksheet()

r=0

sheet.write(r, 0, 'RandomForest')

r=r+1

sheet.write(r, 0, 'Accuracy')

sheet.write(r, 1, 'Precision')

sheet.write(r, 2, 'Recall')

sheet.write(r, 3, 'F-measure')

sheet.write(r, 4,'Specificty')

sheet.write(r, 5,'GeometricMean')

sheet.write(r, 6,'AUC')

x = df_final.drop(['Defective'],axis=1)

y = df_final.Defective

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30)

rfc = RandomForestClassifier(n_estimators=100)

rfc.fit(x_train, y_train)

predictions = rfc.predict(x_test)

c=confusion_matrix(y_test, predictions)

print("confusion_matrix:")

print(confusion_matrix(y_test,predictions))

mo=c[1][0] + c[1][1] + c[0][0] + c[0][1]

mo1=c[0][1] + c[1][1]

Page 180 of 201


mo2=c[1][0] + c[1][1]

if mo!=0:

acc = round(((c[1][1] + c[0][0]) / mo),2)

else:

acc=0

if mo1!=0:

pre = round((c[1][1] / mo1),2)

else:

pre =0

if mo2!=0:

rec = round((c[1][1] / mo2),2)

else:

rec=0

print("Accuracy=",acc)

print("Precision=",pre)

print("Recall=",rec)

mo3=pre+rec

if mo3!=0:

fm = round(((2 * pre * rec) / mo3),2)

else:

fm=0

print("F-measure=",fm)

mo4=c[0][0]+c[0][1]

if mo4!=0:

sp = round((c[0][0] / mo4),2)

else:

sp=0

print("Specificity=",sp)

gm = round((math.sqrt(rec * sp)),2)

print("Geometric Mean=",gm)

tp = c[1][1]

Page 181 of 201


fn = c[1][0]

fp = c[0][1]

tn = c[0][0]

rf_fpr,rf_tpr,thresholds = roc_curve(y_test,predictions)

auc = round((auc(rf_fpr,rf_tpr)),2)

print("AUC=",auc)

c=0

r=r+1

sheet.write(r, c, acc)

sheet.write(r, c + 1, pre)

sheet.write(r, c + 2, rec)

sheet.write(r, c + 3, fm)

sheet.write(r, c + 4, sp)

sheet.write(r, c + 5, gm)

sheet.write(r, c + 6, auc)

book.close()

Output:

Confusion_matrix:

[[271 14]

[ 35 245]]

Accuracy= 0.91

Precision= 0.95

Recall= 0.88

F-measure= 0.91

Specificity= 0.95

Geometric Mean= 0.91

AUC= 0.91

Page 182 of 201


K-Nearest Neighbor(KNN)

import math

!pip install xlsxwriter

import xlsxwriter

import pandas as pd

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import confusion_matrix

from sklearn.metrics import roc_curve,auc

book = xlsxwriter.Workbook("knn.xlsx")

sheet = book.add_worksheet()

r=0

sheet.write(r, 0, 'K-Nearest Neighbor(KNN)')

r=r+1

sheet.write(r, 0, 'Accuracy')

sheet.write(r, 1, 'Precision')

sheet.write(r, 2, 'Recall')

sheet.write(r, 3, 'F-measure')

sheet.write(r, 4,'Specificty')

sheet.write(r, 5,'GeometricMean')

sheet.write(r, 6,'AUC')

x = df_final.drop(['Defective'],axis=1)

y = df_final.Defective

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30)

knn = KNeighborsClassifier(n_neighbors=5)

knn.fit(x_train, y_train)

predictions = knn.predict(x_test)

c=confusion_matrix(y_test, predictions)

print("confusion_matrix:")

print(confusion_matrix(y_test,predictions))

mo=c[1][0] + c[1][1] + c[0][0] + c[0][1]

mo1=c[0][1] + c[1][1]

Page 183 of 201


mo2=c[1][0] + c[1][1]

if mo!=0:

acc = round(((c[1][1] + c[0][0]) / mo),2)

else:

acc=0

if mo1!=0:

pre = round((c[1][1] / mo1),2)

else:

pre =0

if mo2!=0:

rec = round((c[1][1] / mo2),2)

else:

rec=0

print("Accuracy=",acc)

print("Precision=",pre)

print("Recall=",rec)

mo3=pre+rec

if mo3!=0:

fm = round(((2 * pre * rec) / mo3),2)

else:

fm=0

print("F-measure=",fm)

mo4=c[0][0]+c[0][1]

if mo4!=0:

sp = round((c[0][0] / mo4),2)

else:

sp=0

print("Specificity=",sp)

gm = round((math.sqrt(rec * sp)),2)

print("Geometric Mean=",gm)

tp = c[1][1]

Page 184 of 201


fn = c[1][0]

fp = c[0][1]

tn = c[0][0]

c=0

knn_fpr,knn_tpr,threshold = roc_curve(y_test,predictions)

auc = round((auc(knn_fpr,knn_tpr)),2)

print("AUC=",auc)

r=r+1

sheet.write(r, c, acc)

sheet.write(r, c + 1, pre)

sheet.write(r, c + 2, rec)

sheet.write(r, c + 3, fm)

sheet.write(r, c + 4, sp)

sheet.write(r, c + 5, gm)

sheet.write(r, c + 6, auc)

book.close()

Output:

confusion_matrix:

[[277 14]

[ 36 238]]

Accuracy= 0.91

Precision= 0.94

Recall= 0.87

F-measure= 0.9

Specificity= 0.95

Geometric Mean= 0.91

AUC= 0.91

Page 185 of 201


REORD WORK

In-Lab Exercise Problems (EPs)

1. In-Lab Exercise Problem (EPS): In-lab Exercise Problems (EPs) are the problems to be
coded during the lab session. Student should complete all EPS in the Lab slot itself with
necessary write up in the space provided, by following the required Program
Development Steps (PDS)
Therefore, students should
a. work on the At-Home SPs and execute those SPS with Additional Test Cases
before attending the lab.
b. complete the In-Lab EPS in the lab session.

Prior preparation on EPS will help the student a lot in completing the EPs within the
stipulated lab session. It is always a good practice to attend the Lab with prior
preparation

2. Student should focus on developing code for the In-Lab EPs, which is
a. readable (with proper Annotations and Indentation)
b. maintainable
c. extendable
d. testable and
e. robust

Page 186 of 201


RECORD WORK

EP1. Using weather dataset, forecast the “Play” using decision tree algorithm.
EP2. Using customer transaction history, predict the customer decision on product purchase.

Page 187 of 201


Page 188 of 201
Page 189 of 201
Page 190 of 201
Viva-Voce Questions:

1. How KNN differs with Naïve Bayes algorithm?


2. Differentiate decision tree and random forest algorithms.
3. Define Decision tree.
4. Give the advantages of random forest algorithm.

Note: For Viva-voce questions, the students should demonstrate their knowledge and skills
through oral communication. Writing answers to these Questions is Not mandatory.
However, it is good practice to have answers written after completion of the lab session for future reference.

Page 191 of 201


Page 192 of 201
Page 193 of 201
Laboratory Task – 12

E12. 16. Implement K-means and hierarchical clustering algorithms.

Objectives: This lab will develop students’ knowledge is/on


1. to learn the use K-means algorithm.
2. to learn how to how to use hierarchical clustering algorithms.
Outcomes: Upon completion of this lab, students will be able to
1. apply k-means algorithm.
2. apply hierarchical clustering algorithms.

CONCEPT AT A GLANCE

K-Means Algorithm

K-means clustering algorithm computes the centroids and iterates until we it finds optimal centroid.
It assumes that the number of clusters are already known. It is also called flat clustering algorithm.
The number of clusters identified from data by algorithm is represented by ‘K’ in K-means.

In this algorithm, the data points are assigned to a cluster in such a manner that the sum of the squared
distance between the data points and centroid would be minimum. It is to be understood that less
variation within the clusters will lead to more similar data points within same cluster.

Hierarchical Clustering

Hierarchical clustering is another unsupervised learning algorithm that is used to group together the
unlabeled data points having similar characteristics. Hierarchical clustering algorithms falls into
following two categories −

Agglomerative hierarchical algorithms − In agglomerative hierarchical algorithms, each data point is


treated as a single cluster and then successively merge or agglomerate (bottom-up approach) the pairs
of clusters. The hierarchy of the clusters is represented as a dendogram or tree structure.

Divisive hierarchical algorithms − On the other hand, in divisive hierarchical algorithms, all the data
points are treated as one big cluster and the process of clustering involves dividing (Top-down
approach) the one big cluster into various small clusters.

K-Means algorithm:

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

df = pd.read_csv('/content/iris.csv')

Page 194 of 201


df.head(10)

x = df.iloc[:, [0,1,2,3]].values

kmeans5 = KMeans(n_clusters=5)

y_kmeans5 = kmeans5.fit_predict(x)

print(y_kmeans5)

kmeans5.cluster_centers_

Output:

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1111111111111000300030330303003030300

0000033330300033303333303340244232424

4404442204020420042224002440444044404

4 0]

array([[6.20769231, 2.85384615, 4.74615385, 1.56410256],

[5.006 , 3.418 , 1.464 , 0.244 ],

[7.475 , 3.125 , 6.3 , 2.05 ],

[5.508 , 2.6 , 3.908 , 1.204 ],

[6.52916667, 3.05833333, 5.50833333, 2.1625 ]])

Error =[]

for i in range(1, 11):

kmeans = KMeans(n_clusters = i).fit(x)

kmeans.fit(x)

Error.append(kmeans.inertia_)

import matplotlib.pyplot as plt

plt.plot(range(1, 11), Error)

plt.title('Elbow method')

plt.xlabel('No of clusters')

plt.ylabel('Error')

plt.show()

kmeans3 = KMeans(n_clusters=3)

y_kmeans3 = kmeans3.fit_predict(x)

Page 195 of 201


print(y_kmeans3)

kmeans3.cluster_centers_

Output:

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0000000000000112111111111111111111111

1112111111111111111111111121222212222

2211222212121221122222122221222122212

2 1]

array([[5.006 , 3.418 , 1.464 , 0.244 ],

[5.9016129 , 2.7483871 , 4.39354839, 1.43387097],

[6.85 , 3.07368421, 5.74210526, 2.07105263]])

Page 196 of 201


REORD WORK

In-Lab Exercise Problems (EPs)

1. In-Lab Exercise Problem (EPS): In-lab Exercise Problems (EPs) are the problems to be
coded during the lab session. Student should complete all EPS in the Lab slot itself with
necessary write up in the space provided, by following the required Program
Development Steps (PDS)
Therefore, students should
a. work on the At-Home SPs and execute those SPS with Additional Test Cases
before attending the lab.
b. complete the In-Lab EPS in the lab session.

Prior preparation on EPS will help the student a lot in completing the EPs within the
stipulated lab session. It is always a good practice to attend the Lab with prior
preparation

2. Student should focus on developing code for the In-Lab EPs, which is
a. readable (with proper Annotations and Indentation)
b. maintainable
c. extendable
d. testable and
e. robust

Page 197 of 201


RECORD WORK

EP1. Cluster the employees based on their salaries using k-means algorithm.

Page 198 of 201


Page 199 of 201
Page 200 of 201
Page 201 of 201
Viva-Voce Questions:

1. Define cluster.
2. List various partitioning algorithms.
3. Give the difference between agglomerative and Divisive hierarchical clustering methods.
4. What is K-Means Algorithm?
5. Give the library file used for K-Means method in python.

Note: For Viva-voce questions, the students should demonstrate their knowledge and skills
through oral communication. Writing answers to these Questions is Not mandatory.
However, it is good practice to have answers written after completion of the lab session for future reference.

Page 202 of 201


Page 203 of 201
Page 204 of 201

You might also like