Data Mining Python Lab

U18IT608
DATA MINING USING PYTHON

LABORATORY
A Learner Centric
LABORATORY MANUAL & RECORD BOOK
First Edition, December 2023
Class: B.Tech.(IT) VI-Semester

[For the Students admitted under URR18 Regulation]
Academic Year: ………………………… Semester: ……………
Student Details
Student Name:
Roll Number:
Semester/Branch/Section:
1.
2.
Laboratory Course Faculty:
3.
4.
Course Offered by
DEPARTMENT OF INFORMATION TECHNOLOGY
KAKATIYA INSTITUTE OF TECHNOLOGY & SCIENCE, WARANGAL
(An Autonomous Institute under Kakatiya University, Warangal)
DEPARTMENT OF INFORMATION TECHNOLOGY
CERTIFICATE
This is to certify that it is a bonafide record of practical work done by Mr. / Kum.
………………………………………………………………………………………
bearing the Roll No. ………………………… of ………………… Class
………………… branch in the Design and Analysis of Algorithms Laboratory (DAA Lab)
during the academic year ………………… under our supervision.
Course Faculty Head of the Department

(Name & Signature)
Date: ……………………
Date: ……………………
Examiner
(Signature with Date)
PREFACE
Dear students,
This lab manual is designed and developed as “A Learner Centric Laboratory Manual and
Record Book (LMRB)” for Outcome Based Education (OBE).
a) A well-defined learner centric continuous internal evaluation (CIE) will be followed in

this lab. It is expected to make students active learners, skilled and acquire several
competencies related to the laboratory programming tasks
b) Hence, students are advised to love learning, follow the stipulated CIE and become active
learners.
c) Active learning will ensure students acquire the 21st century skills and competencies to be
successful in a job.
1. Learner Centric Lab Manual & Record Book (LMRB):

1.1. This Learner centric LMRB contains relevant information for all programming tasks
2. Videos on programming Tasks:

2.1. Lab course faculty will make videos on essential back ground needed for programs and upload
in the CourseWeb portal well in advance
3. Requisite prior knowledge (K):

3.1. Student should come prepared with requisite prior knowledge on the programs to be developed
3.2. To gain requisite prior knowledge on the programs to be completed, the student should
3.2.1. watch the videos on lab experiments, which are posted in CourseWeb portal by your lab
course faculty
3.2.2. read the information given in this Learner centric LMRB
3.2.3. execute the At-Home sample programs related to the corresponding lab task, as given in this
learner centric LMRB
4. During lab session:

4.1. Before start of Programming Task
KNOWLEDGE (K): 10 Marks
1. Before start of Programming Task, student will be tested on knowledge(K). Whether the
student has requisite prior knowledge on the programs to be developed.
2. The lab faculty will
a. Check whether the student executed the At-Home sample programs with
Additional Test cases?
b. Ask around 4-5 questions to test the student’s pre-requisite knowledge(K)
3. This component of 10 marks will be awarded based on student’s prior preparation on the
programming task to be completed. To score well, in this component, students are
expected to,
a. execute the At-Home SPs as HOMEWORK with Additional Test Cases
b. have complete Knowledge on the In- Lab EPs to be executed in the lab
4. The student will be permitted to do the programming task after pre-requisite the
Knowledge (K) test
5. This Knowledge (K) Test is aimed at imparting / enhancing the following skills for the
students
a. Communication (oral) skills
b. Requisite prior knowledge
4.2. During Programming Task:

PARTICIPATION (P): 10 marks
a) Students should complete the given programming task(s) i.e., In - lab EPs first 2 periods
of the lab session.
b) Student will be observed for taking part in developing PDs for In - lab EPs, from problem
analysis to debugging & testing
c) Hence, while doing programming task, every student should be proactive to earn 10
marks for participation
d) This PARTICIPATION (P) section is aimed at imparting /enhancing the following skills
for the students
a. Problem analysis (Logic Development)
b. Algorithm/pseudocode
c. Flowchart
d. Coding
e. Testing & Debugging
f. Test Cases
Note:
a) Student should complete the Programming Task in first 2 periods of the lab session
b) And use the last 30 – 45 minutes of lab time to do the following tasks
a. Complete the record write up (W:10 marks) and
b. attend viva-voce (V:10 marks)
c) LAB RECORD WTIE UP is to be completed in the respective lab session itself. The lab
course faculty will complete the evaluation in the lab session itself.
d) Lab record WRITE UP should not be carried to home for completion.
4.3. After completion of Programming Task:

4.3.1. WRITE UP (W): 10 marks
a) After completion of the task(s), the student has to complete the write up for the record
in the lab session itself.
b) Write up related to problem analysis, flowchart, algorithm & code is normally common
to all students, and hence attracts no marks for these items
c) Evaluation under this Write Up section is purely based on how student practices all steps
of program development for execution of the program
d) To score well in this section, prior preparation & lateral thinking are needed.
e) For In-lab EPs, the student should focus on developing a code which should be readable
(with proper Annotations and Indentation), maintainable, extendable, testable and
robust.
f) Copying from other student’s code is not allowed and attracts award of ZERO
marks in this section
g) This WRITE UP (W) section is aimed at imparting / enhancing the following skills for
the students
a. Coding skills
b. Innovation & lateral thinking (ILT) skills
c. Research skills (inferring & predicting)
4.3.2. VIVA-VOCE (V): 10 marks

a) After completing the write up, the student should go for viva-voce
b) To score well in this section, student should
a. come prepared to programming task(s) with prior knowledge (K),
b. participate (P) actively during the development and execution of Programming tasks
c. focus on answering the sample questions given under viva-voce (V)
c) This VIVA-VOCE (V) section is aimed at imparting / enhancing the following skills for
the students
a. Communication (oral) skills
b. Coding skills
c. Innovation & lateral thinking (ILT) skills
d. Research skills (inferring & predicting)
NOTE: FACULTY WILL COMPLETE THE STUDENT EVALUATION IN THE LAB SESSION
ITSELF SO STUDENT SHOULD COMPLETE THE WRITE UP IN THE LAB SESSION
ITSELF.
The lab course faculty will assess and evaluate the student in four quadrants i.e. K, P, W & V
during the lab slot itself, and award the marks after conduction of viva-voce. This evaluation
gives scope for the students to improve, in the upcoming weeks of programming tasks, by
demonstrating relevant skills and the competencies in K, P, W & V.
Bottom Line:
a) A well-defined leaner centric continuous internal evaluation (CIE) will be followed in this
lab. It is expected to make students active learners, skilled and acquire several
competencies related to the programming tasks
b) Hence, students are advised to love learning, follow the stipulated CIE and become active
learners
c) Active learning will ensure students acquire the 21st century skills and competencies to be
successful in a job
INDEX
Institute vision and mission 1
Department vision and mission 1
Program Educational Objectives (PEOs) 2
Program Outcomes (POs) 2
Program Specific Outcomes (PSOs) 3
Instructions to the students 4
Rubrics for Continuous Internal Evaluation (CIE) 5
Make-up laboratory sessions 8
Laboratory programs Calendar 10
List of programs to be performed 15

INSTITUTE VISION & MISSION
Vision of Institute:
• To make our students technologically superior and ethically strong by providing quality
education with the help of our dedicated faculty and staff and thus improve the quality of
human life.
Mission of the Institute:

• To provide latest technical knowledge, analytical and practical skills, managerial competence
and interactive abilities to students, so that their employability is enhanced.
• To provide a strong human resource base for catering to the changing needs of the Industry
and Commerce.
• To inculcate a sense of brotherhood and national integrity.
DEPARTMENT VISION & MISSION
Vision of Department:
• To become a Centre of Excellence in the Information Technology discipline with effective
teaching and strong research environment that makes our students globally competitive with
strong ethical values and leadership abilities.
Mission of the Department:

• To impart technical knowledge to the students to turn out proficient and well-groomed
engineers.
• Motivate students to improve skills by attending training programs and internships that leads
to develop innovative projects in emerging technologies.
• To train our students for higher education, leadership in profession and adopt quality research.
Page 1 of 201
Program - B.Tech. Information Technology
PROGRAM EDUCATIONAL OBJECTIVES (PEOs)
Within first few years after graduation, the Information Technology graduates will be able
to…
PEO1 To provide students with a sound foundation in Information Technology theory
and practices to analyze, formulate and solve engineering problems
PEO2 To develop an ability to design algorithms, implement programs and deploy
software.
PEO3 To develop Information Technology solutions with the changing needs of the
society for the career-related activities.
PROGRAM OUTCOMES (POs)
Program Outcomes Engineering graduates will be able to
PO1 Engineering Apply the knowledge of mathematics, science,

knowledge engineering fundamentals, and an engineering
specialization to the solution of complex engineering
problems.
PO2 Problem analysis Identify, formulate, review research literature, and
analyze complex engineering problems reaching
substantiated conclusions using first principles of
mathematics, natural sciences, and engineering sciences.
PO3 Design/development Design solutions for complex engineering problems and
of solutions design system components or processes that meet the
specified needs with appropriate consideration for the
public health and safety, and the cultural, societal, and
environmental considerations.
PO4 Conduct investigations Use research-based knowledge and research methods
of complex problems including design of experiments, analysis and
interpretation of data, and synthesis of the information to
provide valid conclusions.
PO5 Modern tool usage Create, select, and apply appropriate techniques,
resources, and modern engineering and IT tools including
prediction and modeling to complex engineering
activities with an understanding of the limitations.
PO6 The engineer and Apply reasoning informed by the contextual knowledge
society to assess societal, health, safety, legal and cultural issues
and the consequent responsibilities relevant to the
professional engineering practice.
Page 2 of 201
PO7 Environment and Understand the impact of the professional engineering
sustainability solutions in societal and environmental contexts, and
demonstrate the knowledge of, and need for sustainable
development.
PO8 Ethics Apply ethical principles and commit to professional ethics
and responsibilities and norms of the engineering practice
PO9 Individual and Function effectively as an individual, and as a member or
teamwork leader in diverse teams, and in multidisciplinary settings.
PO10 Communication Communicate effectively on complex engineering
activities with the engineering community and with
society at large, such as, being able to comprehend and
write effective reports and design documentation, make
effective presentations, and give and receive clear
instructions.
PO11 Project management Demonstrate knowledge and understanding of the
and finance engineering and management principles and apply these
to one’s own work, as a member and leader in a team, to
manage projects and in multidisciplinary environments.
PO12 Life-long learning Recognize the need for, and have the preparation and
ability to engage in independent and life-long learning in
the broadest context of technological change.
PROGRAM SPECIFIC OUTCOMES (PSOs)
PSO The Information Technology Engineering graduates will be able to
PSO1 Apply analytical and experimental problem-solving skills in the Information

Technology discipline.
PSO2 Use fundamental knowledge to investigate new and emerging technologies
leading to innovations in the field of Information Technology.
PSO3 Begin immediate professional practice as an Information Technology Engineer.
Page 3 of 201
INSTRUCTIONS TO THE STUDENTS
1. This Learner Centric LABORATORY MANUAL & RECORD BOOK (LMRB) is essential for the
student and must be brought to every laboratory session.
2. This learner centric LMRB consists of At-Home Sample Programs (SPs) and In-lab Exercise
Problems (EPs)
a) At-Home Sample Programs (SPs): At-Home Sample Programs (SPs) are the HOMEWORK
programs to be completed, before attending the lab. You should execute these Sample Programs
(SPs) with the given sample test cases and check for the results. In addition, as a proof of
completion of the HOMEWORK, the student should execute the SPs, with other set of test cases
and record the answers in the space provided under Additional Test Cases
(i). You should design your own additional Test Cases and execute these SPs.
(ii). You should design the test cases, which challenge the robustness of the code. The
challenging test cases have the capacity to halt the program execution.
(iii). You should bring those challenging test cases to the notice of course faculty, so that the
code of SPs can suitably be modified to make the code robust.
b) In-Lab Exercise Problem (EPs): In-Lab Exercise Problems (EPs) are the problems to be coded
during the lab session. Student should complete all EPs in the Lab slot itself with necessary
write up in the space provided, by following the required Program Development Steps (PDS).
Therefore, students should:
(i). work on the At-Home and execute SPs with Additional Test Cases before attending the
lab.
(ii). Complete the In-Lab the EPs in the lab session.
Prior preparation on EPs will help the student a lot in completing the EPs within the
stipulated lab session. It is always a good practice to attend the Lab with prior preparation.
c) For EP1 - All steps of PDS are mandatory: Algorithm/Psuedocode development is mandatory
only for the FIRST Exercise Problem (EP-1) of every Laboratory task.
d) For OTHER EPs (EP2 onwards): To save time during lab session, the student can skip "writing
algorithm" for other EPs. The student should focus and work on Problem Analysis (Logic
development), Coding, Testing & Debugging and execution with Test Cases.
e) Student should focus on developing code for the In-Lab EPs, which is readable (with proper
Annotations and Indentation), maintainable, extendable, testable and robust.
3. All the EPs must be completed within the stipulated time.
4. Students should demonstrate the required skills during ORAL VIVA-VOCE. It is not mandatory
to write the answers to the viva voce questions of every lab tasks. But it is a good practice to keep
the question answered in the place provided, after completion of lab session.
5. Incompletion of the lab record will result in reduction of marks.
Page 4 of 201
Rubrics for Continuous Internal Evaluation (CIE)
Continuous Internal Evaluation (CIE) for Practical (Laboratory) Course shall carry 40% weightage.
CIE throughout the semester shall consist of the following for each experiment/lab.
CIE- Assessment for experiments done in every lab Weightage
Requisite Prior Coding Knowledge Knowledge (K) 10%
Participation as an individual while developing programs Participation (P) 10%

Write-up for Record Work Writing (W) 10 %
Viva-voce (oral) Viva-voce (V) 10%
Every laboratory session is evaluated for a total of 40 marks. The details have been listed below.
A. Before start of Programming Task

1. Requisite Prior Coding knowledge (K): 10 Marks
Student should come prepared to the lab session and is expected to answer the following, prior to the
start of the programming task:
i. Whether student worked on the given At-Home sample programs and executed with
Additional Test Cases - 5 marks
ii. A total of 3-5 questions shall be asked (for 5 marks) on whether student gained requisite prior
knowledge on the At-Home SPs and the In-Lab EPs to be completed in the lab session
These Five (05) marks shall be awarded based on student's performance, as below:
% of questions answered satisfactorily Marks awarded
80-100 : 5
60-80 : 3-4
30-60 : 2
0-30 : 1
Note: Faculty will check whether student expected the At-Home SPs with at additional Test Cases
and affix signature.
B. During Programming Task

2. Participation (P): 10 Marks
Once the student is allowed to develop the program for the In-Lab EPs, marks will be awarded based on
his/her participation as an individual while developing the programs by following the PDS.
Marks shall be awarded as below:

Marks
After the completion of programming task(s)…
Awarded
Completed the EPs effectively without the assistance of faculty and answered all the
10
questions related to the of programming tasks :
Page 5 of 201
Completed the program effectively with partial assistance of faculty but able to answer
7-9
the questions related to programming tasks :
Completed the EPs only with full assistance of faculty but unable to answer the
3-6
questions on programming tasks :
Unable to complete the EPs even with assistance of the faculty : 0-2
C. Activities to be completed by student after completion of programming tasks:

After completion of programming tasks, the student has to complete the write up and attend the viva-
voce.
3. Write-up(W): 10 Marks
The student should complete the write-up, related to the program conducted, in the manual itself, in the
designated space for Record Work. The write-up must be on the following:
• Problem Analysis (Logic Development)
• Program execution with Test Cases
Marks shall be awarded as below:

Marks
After the completion of experiment…
awarded
Completed the write-up in the laboratory scheduled time with good Logic development
: 10
and programs are executed with good test cases
Completed the write-up in the laboratory scheduled time with average logic
: 7-9
development but executed programs with good test cases
Completed the write-up in the laboratory scheduled time with average Logic
: 3-6
development and executed programs with normal/average test cases
Write - up not completed : 0-2
4. Viva-voce(V): 10 Marks
After completing the write-up, the student should attend viva-voce to answer the following:
Interpretation of output: Viva-voce should not be limited to only the sample questions listed in VIVA-
VOCE questions at the end of program, but should go beyond to test the student's involvement in the
program development and also the technical competency.
(i). What did you learn from these programs based on objectives?
(ii). How will you apply knowledge gained, by performing these programs, in future?
Student should be asked to comment, on the following, specific to SPs and EPs:
(iii). Alternative Approach: Can you propose any "alternative logic/solution" to the make the code
effective? (Specific to specified EPs)
(iv). Maintainability of the code: Do you think that the code written by you is maintainable? Justify.
(v). Testability of the code: Do you think that your programs are testable? Justify.
(vi). Extensibility of the code: What are your ideas on code extendibility for additional features to the
existing code?
(vii). Readability of the code: Do you think that the code written by you is readable (easy to follow,
easy to understand)? Justify.
(viii). Robustness of the code: Whether your code is robust? Justify.
(ix). Any other ideas related to the specific SPs/EPs.
Page 6 of 201
Marks will be awarded based on student's performance, as below:
Marks
Viva-Voce
awarded
Reasonable conclusions drawn with good interpretation of results and answered 80-
: 10
100% of the viva-voce questions perfectly
Reasonable conclusions drawn but answered 50-80% of the viva-voce Questions : 7-9
Poor conclusions and interpretation of results with only 30-50% of viva-voce questions
: 3-6
answered
Conclusions without interpretation of results and answered less than 30% of viva-voce
: 0-2
questions posed
(Faculty I/c, Data Mining using Python Laboratory)
Page 7 of 201
MAKE-UP LAB SESSIONS
1. Missing lab sessions due to holidays or unforeseen circumstances / disturbances will cause a
big loss to student learning.
2. To compensate for this loss, lab course faculty has to plan and conduct additional lab sessions,
called Make-up Lab Sessions, beyond working hours of the institute (or) on Saturdays /
Sundays, by giving prior information to students.
3. The lab course faculty has to ensure that Make-up Lab Sessions are arranged in the following
cases
i. to compensate for the lab sessions to be lost due to holidays
ii. to compensate for the lab sessions to be lost due to unforeseen circumstances
4. The dates for Make-up lab sessions for case (i) i.e., for the sessions which are expected to be lost
due to holidays, are to be announced very much at the beginning of semester itself and printed,
in the Lab Programs Calendar.
5. The dates for Make-up lab sessions, for case (ii) i.e., for the sessions which are expected to be lost
due to unforeseen circumstances, are to be announced, conducted and recorded as and when the
lab sessions get disturbed
IMPORTANT NOTE:
a) Completing all stipulated programs is mandatory for the students to appear for
Laboratory End Semester Examination (ESE).
b) It is student's responsibility to complete all programs
c) If any student is absent for any laboratory session due to valid/genuine reasons, he/she
must complete the program within a week time by seeking permission from the lab course
faculty.
d) Upon completion of the programs of lab sessions which were missed due to valid/genuine
reasons, student will be evaluated for only 50% of the maximum marks of the program and
the corresponding attendance will not be counted.
e) Students allowed to utilize the laboratory sessions beyond the working hours.
Page 8 of 201
PDS
The students should follow the following steps known as "Program Development Steps" (PDS) to
develop and execute a given programming task.
1. Problem analysis (Logic Development)
2. Algorithm/Pseudo code
3. Flowchart
4. Coding
5. Testing & Debugging
6. Programming execution with Test Cases
Program Development Steps (PDS)

Note: PDS should be followed for each and every programming task.
1. Problem Analysis (Logic Development): The problem given is analyzed for understanding and
selecting the steps to solve the problem. Under this, we do LOGIC DEVELOPMENT and write
the required FORMULAS to solve the problems. Also, we have to clearly identify the input(s) and
output(s).
2. Algorithm/Pseudocode development: The general description of the solution for the given
problem, called algorithm/pseudocode, is to be developed.
A formula or set of steps for solving a particular problem. To be an algorithm, the set of rules
must be unambiguous and have a clear stopping point. Algorithms can be expressed in any
language, from natural languages like English. In short, algorithm can be viewed as programming
language independent statements.
3. Flowchart: For the above algorithm/pseudocode a flowchart is to be drawn.
4. Coding: Write the programming instructions using selected programming language to implement
the developed algorithm/pseudocode.
5. Testing and debugging: The program is to be tested for syntax and other errors.
6. Program Execution with test cases: Executing the programs with different types of inputs (called
test cases) and analyzing the output.
a. Test Cases:
i. Execute the programs with all possible test cases.
ii. Test cases should include all possible inputs which challenge the output of the
program you have written.
For Example: You are asked to write C-Code to find Factorial of a given integer. After testing &
debugging, during the execution of the program, you should imagine all the possible inputs. An
example is shown below.
Test case (i): 'Input any positive number' to find its factorial.
Test case (ii): 'Input any negative number' to find its factorial.
Test case (iii): 'Zero' to find its factorial.
So, you have to ensure that your code will give appropriate output for above possible test cases.
For Test case (i): The output should be its factorial
For Test case (ii): The output should display the following message "Factorials are only defined for
positive integers. Please input any positive integer"
For Test case (iii): The output should be '1'
That means your code should deliver appropriate output based on all possible inputs.
Defining a test case for the program is another skill to be mastered. Hence, the students are advised
to design appropriate test cases to test the efficiency of the code. If your code passes all possible test
cases, your code is said to be robust.
Page 9 of 201
LABORATORY PROGRAMS - CALENDAR
Week # Date Title of the experiment
18.12.2023 to 1. Write a program to perform multidimensional data model using

Week 1 23.12.2023 SQL queries (Star, snowflake and fact constellation schemes).
25.12.2023 to
Week 2 30.12.2023
2. Write a program to perform various OLAP operations.
01.01.2024 to 3. Introduction to Python programming, Basics of Python.

Week 3 06.01.2023 4. Python operators, Functions and Strings.
5. List Collection and Tuple Collection.
08.01.2024 to
Week 4 13.01.2024
6. Dictionary collection and set collection.
7. Control Structures and Functions.
14.01.2024 to
Week 5 20.01.2024
SANKRANTHI VACATION
22.01.2024 to
Week 6 27.01.2024
8. Introduction to NumPy, Operations on NumPy Arrays.
29.01.2024 to
Week 7 03.02.2024
9. Introduction to Pandas, Getting and Cleaning Data.
05.02.2024 to 10. Introduction to Data Visualization.

Week 8 10.02.2024 11. Basics of Visualization: Plots, Subplots and their Functionalities.
12.02.2024 to
Week 9 20.02.2024
MID SEMESTER EXAMINATION - 1
21.02.2024 to
Week 10 24.02.2024
No Laboratory due to MSE-I on Monday & Tuesday
26.02.2024 to
Week 11 02.03.2024
12. Plotting Data Distributions, Categorical and Time-Series Data.
04.03.2024 to
Week 12 13. Generate association rules from frequent item sets.
09.03.2024
11.03.2024 to 14. Regression and Classification: Linear regression and logistic

Week 13 16.03.2024 regression.
18.03.2024 to 15. Implement Decision tree, random forest, k-Nearest Neighbor

Week 14 23.03.2024 algorithms.
25.03.2024 to
Week 15 30.03.2024
16. Implement K-means and hierarchical clustering algorithms.
01.04.2024 to
Week 16 06.04.2024
Makeup Laboratory
08.04.2024 to
Week 17 20.04.2024
MID SEMESTER EXAMINATION - 2
22.04.2024 to
Week 18 30.04.2024
LABORATORY END SEMESTER EXAMINATION
Page 10 of 201
LAB EXPERIMENTS CALENDAR – MAKE-UP SESSIONS
Make-up Lab
S. No. Time Title of the experiment
on (Date)
Make-up lab sessions - for sessions lost due to holidays
1.
2.
3.
4.
Make-up lab sessions - for sessions lost due to unforeseen circumstances
1.
2.
3.
4.
Page 11 of 201
LIST OF PROGRAMS & CIE
Marks
Exp. Date of Signature
Title of the experiment awarded
No. conduction of faculty
(40)
1. Write a program to perform
multidimensional data model using SQL
E1
queries (Star, snowflake and fact
constellation schemes).
2. Write a program to perform various

E2
OLAP operations.
3. Introduction to Python programming,

E3 Basics of Python.
4. Python operators, Functions and Strings.
5. List Collection and Tuple Collection.

E4 6. Dictionary collection and set collection.
7. Control Structures and Functions.
8. Introduction to NumPy, Operations on

E5
NumPy Arrays.
9. Introduction to Pandas, Getting and

E6
Cleaning Data.
10. Introduction to Data Visualization.

E7 11. Basics of Visualization: Plots, Subplots
and their Functionalities.
12. Plotting Data Distributions, Categorical

E8
and Time-Series Data.
13. Generate association rules from frequent

E9
item sets.
14. Regression and Classification: Linear

E10
regression and logistic regression.
15. Implement Decision tree, random forest,

E11
k-Nearest Neighbor algorithms.
16. Implement K-means and hierarchical

E12
clustering algorithms.
Page 12 of 201
Laboratory Task - 1
E1. 1. Write a program to perform multidimensional data model using SQL queries
(Star, snowflake and fact constellation schemes)
Objectives: This lab will develop students’ knowledge is/on

1. To learn the use of multidimensional model.
2. To learn how to design star, snowflake and fact constellations schemas.
Outcomes: Upon completion of this lab, students will be able to
1. apply the star, snowflake and fact constellation schemas.
2. apply the SQL queries on multidimensional model.
CONCEPT AT A GLANCE
Data: It is a set of facts and figures. It is like raw materials of data items with numbers,
alphabets and other symbols.
Ex: 101, Ashok, IT, 20000.00 etc.
Information: It is a collection of meaningful and relevant data items. When data is

processed, then the resulting values give the information.
Field or Column: To prepare information, data items are organized in the form of
fields.
Data Warehouse: A data warehouse is a subject-oriented, integrated, time-variant, and

non-volatile collection of data in support of management’s decision-making process. Data
warehouse is a relational database that is designed for query and analysis rather than for
transaction processing. It usually contains historical data derived from transaction data,
but it can include data from other sources
A data warehouse is based on a multidimensional data model which views data in

the form of a data cube. A data cube, such as sales, allows data to be modelled and
viewed in multiple dimensions
• Dimension tables contain descriptive properties of the dimension.

EX: item (item_name, brand, type), or time (day, week, month, quarter, year)
• Fact table contains measures (such as dollars_sold) and keys to each of the related
dimension tables
In data warehousing, an n-D base cube is called a base cuboid. The top most 0-D cuboid,
which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.
Page 16 of 201
Data cube: A Lattice of Cuboids
Conceptual Modeling of Data Warehouse
Modeling data warehouses: Data warehouse is modeled using one of the following
multidimensional models which is described by dimensions & measures
1. Star schema: A fact table in the middle connected to a set of dimension tables
2. Snowflake schema: A refinement of Star schema where some dimensional hierarchy is
normalized into a set of smaller dimension tables, forming a shape similar to Snowflake
3. Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of Star
schemas, therefore called galaxy schema or fact constellation
Example of Star Schema
Sales analysis modeling using Star schema w.r.t item, time, branch and location dimensions
Page 17 of 201
Example of Snowflake Schema
Sales analysis modeling using Snowflake schema w.r.t item, time, branch and location dimensions.
In this example item and location tables are normalized.
Example of Fact constellation Schema
Sales analysis modeling using Fact constellation schema w.r.t item, time, branch and location
dimensions. In this example two fact tables, sales and shipping fact tables are sharing common
dimensions.
Page 18 of 201
Demonstrate Star schema creation for sales data analysis.
Assume Sales data is analyzed w.r.t item, location and branch and time dimensions and create the
tables and insert the sample data
Table for time dimension:
SQL>Create table time2017(timekey number(6) primary key,month varchar2(5),quarter

varchar2(3),year number(4));
Table created
Table for item dimension:
SQL>Create table item2017(itemkey number(6) primary key,itemname varchar2(20),brand

varchar2(20), category varchar2(20) );
Table Created
Table for location dimension:
SQL> Create table location2017(lockey number(6) primary key,street varchar2(20),city varchar2(20),

state varchar2(20),country varchar2(20));
Table Created
Table for branch dimension:
SQL> Create table branch2017(branchkey number(6) primary key,brname varchar2(20),brtype

varchar2(20));
Table Created
Fact Table for sales data analysis:
SQL> Create table salesfact(timekey number(6) references time2017(timekey), itemkey number(6)

references item2017(itemkey), brkey number(6) references branch2017(branchkey),lockey number(6)
references location2017(lockey), unitssold number(10,2),dollarssold number(10,2), avgsales
number(10,2));
Table Created
Page 19 of 201
REORD WORK
In-Lab Exercise Problems (EPs)
(a). In-Lab Exercise Problem (EPS): In-lab Exercise Problems (EPs) are the problems to be
coded during the lab session. Student should complete all EPS in the Lab slot itself with
necessary write up in the space provided, by following the required Program
Development Steps (PDS)
Therefore, students should
1. work on the At-Home SPs and execute those SPS with Additional Test Cases
before attending the lab.
2. complete the In-Lab EPS in the lab session.
Prior preparation on EPS will help the student a lot in completing the EPs within the
stipulated lab session. It is always a good practice to attend the Lab with prior
preparation
(b). For EP1- All steps of PDS are mandatory:

Algorithm/pseudocode development and drawing flowchart are mandatory only for
the FIRST Exercise Problem (EP-1) of every Laboratory task.
(c). Student should focus on developing code for the In-Lab EPs, which is
1. readable (with proper Annotations and Indentation)
2. maintainable
3. extendable
4. testable and
5. robust
Page 20 of 201
RECORD WORK
EP1. Create Snowflake schema for sales analysis.

EP2. Insert the data into dimensions and fact tables created in Question No.1.
EP3. Create Star schema for hospital management and insert the data.
Page 21 of 201
Page 22 of 201
Page 23 of 201
Page 24 of 201
Viva-Voce Questions:
1. What is Data warehousing?

2. What are fact tables and dimension tables?
3. What is the difference between data mining and data warehousing.
4. What is an OLTP system and OLAP system?
5. What is Snowflake schema design in database?
Note: For Viva-voce questions, the students should demonstrate their knowledge and
skills through oral communication. Writing answers to these Questions is Not mandatory.
However, it is good practice to have answers written after completion of the lab session for future reference.
Page 25 of 201
Page 26 of 201
Page 27 of 201
Laboratory Task - 2
E2. 2. Write a program to perform various OLAP Operations

1. to learn how to use OLAP operations on the given data.
1. apply multidimensional model to analyze the data.
2. apply OLAP operations on the given data.
CONCEPT AT A GLANCE
Group By Clause:
An aggregate function takes multiple rows of data returned by a query and aggregates them into a
single result row. Including the Group By clause limits the window of data processed by the
aggregate function. It produces an aggregated value for each distinct combination of values present
in the columns listed in the Group By clause. The number of rows can be calculated by multiplying
the number of distinct values of each column listed in the Group By clause.
Rollup:
In addition to the regular aggregation results with the Group By clause, the Rollup extension
produces group subtotals from right to left and a grand total. If "n" is the number of columns listed
in the Rollup, there will be n+ 1 levels of subtotals.
Example:
Rollup (a, b, c) creates following subtotals
(a, b, c)
(a, b) (a)
()
Query 1:
SQL>select deptno,job,sum(sal) from emp group by rollup(deptno,job);
Page 28 of 201
It is possible to do a partial rollup to reduce the number of subtotals calculated.
select deptno,job,sum(sal) from emp group by mgr , rollup(deptno,job);
Cube:
In addition to the subtotals generated by the Rollup extension, the Cube extension will generate
subtotals for all combinations of the dimensions specified. If "n" is the number of columns listed in
the CUBE, there will be 2n subtotal combinations. If the number of dimensions increases, so the
combinations of subtotals that need to be calculated will also increase.
CUBE (a, b, c) produces the following subtotals (a, b, c)
(a, b)
(a, c) (a) (b, c) (b) (c)
()
Page 29 of 201
Query 2:
SQL>select deptno,job,sum(sal) from emp group by cube(deptno,job);
Query 3: Partial cube
SQL> select deptno, job, Avg(sal) from emp group by mgr, cube(deptno,job);
Query 4:
SQL>select deptno,job,mgr,max(sal) from emp group by cube(deptno,job,mgr);
Page 30 of 201
Grouping Functions:
Grouping:
It can be quite easy to visually identify subtotals generated by rollups and cubes, but to do it
programmatically, need to know the presence of null valuses. This is where the Grouping function is
useful. It accepts a single column as a parameter and returns "1" if the column contains a null value
generated as part of a subtotal by a ROLLUP or CUBE operation or "0" for any other value, including
stored null values.
Query 5:
SQL>select deptno,job,sum(sal), Grouping(deptno) id_dept,grouping(job) id_job from emp group

by rollup(deptno,job)
Page 31 of 201
Group_Id:
It's possible to write queries that return the duplicate subtotals, which can be a little confusing. The
group_id function assigns the value "0" to the first set, and all subsequent sets get assigned a higher
number.
Grouping sets:
Calculating all possible subtotals in a cube, especially those with many dimensions, can be quite an
intensive process. To calculate few subtotals, this can represent a considerable amount of wasted
effort. If we only need a few of these levels of subtotaling we can use the Grouping Sets expression
and specify exactly which are required.
Composite Columns:
Rollup and Cube consider each column independently when deciding which subtotals must be
calculated. For Rollup this means stepping back through the list to determine the groupings.
Composite columns allow columns to be grouped together with braces so they are treated as a single
unit when determining the necessary groupings. In the following Rollup columns "a" and "b" have
been turned into a composite column by the additional braces. As a result, the group of "a" is no
longer calculated as the column "a" is only present as part of the composite column in the statement.
ROLLUP ((a, b), c)
(a, b, c)
(a, b) ()
Not considered:
(a)
Page 32 of 201
In a similar way, the possible combinations of the following Cube are reduced because references to
"a" or "b" individually are not considered as they are treated as a single column when the groupings
are determined.
CUBE ((a, b), c)
(a, b, c)
(a, b) (c) ()
Not considered:
(a, c) (a) (b, c) (b)
Query 6:
SQL>select deptno,job,mgr, sum(sal),Grouping(deptno) id_dept,grouping(job) id_job from emp

group by rollup(deptno,(job,mgr))
Concatenated Groupings
Concatenated groupings are defined by putting together multiple GROUPING SETS, CUBEs or
ROLLUPs separated by commas. The resulting groupings are the cross-product of all the groups
produced by the individual grouping sets.
Query 7:
SQL>select deptno,job,mgr, sum(sal),Grouping(deptno) id_dept,grouping(job) id_job from emp

group by grouping sets(deptno,job), grouping sets(deptno,mgr)
Page 33 of 201
Page 34 of 201
REORD WORK
a. work on the At-Home SPs and execute those SPS with Additional Test Cases
b. complete the In-Lab EPS in the lab session.
preparation

a. readable (with proper Annotations and Indentation)
b. maintainable
c. extendable
d. testable and
e. robust
Page 35 of 201
RECORD WORK
EP1. Write the queries for the following using sales fact table:
a. Display average sales for rollup combination’s location and branch.
b. Display sum of dollars sold for cube combination location,time,branch.
c. Display average of dollars sold for cube combination location,(time,branch)
d. Demonstrate groping sets on sales table.
e. Demonstrate concatenated grouping sets on sales schema.
Page 36 of 201
Page 37 of 201
Page 38 of 201
Page 39 of 201
1. What is rollup operation?

2. What is cube operation?
3. What is grouping function?
4. What is grouping sets?
5. What is grouping id?
Note: For Viva-voce questions, the students should demonstrate their knowledge and
skills through oral communication. Writing answers to these Questions is Not mandatory.
Page 40 of 201
Page 41 of 201
Page 42 of 201
Laboratory Task - 3
3. Introduction to Python programming, Basics of Python.

E3.
4. Python operators, Functions and Strings.

1. to learn the use of python programming and its basics.
2. to learn how to use python operators, functions and strings.
1. apply python programming and its basics on the data.
2. apply python operators, functions and strings on the given data.
CONCEPT AT A GLANCE
What is Python?
Python is a popular programming language. It was created by Guido van Rossum, and
released in 1991.Python is a high-level object-oriented programming language It is also called
general- purpose programming language as it is used in almost every domain as mentioned below:
1. Web Development
2. Software Development
3. Game Development
4. AI & ML
5. Data Analytics
Why Python?
Python works on different platforms (Windows, Mac, Linux, Raspberry Pi, etc). Python has a
simple syntax similar to the English language.
Python has syntax that allows developers to write programs with fewer lines than some other
programming languages.
Python runs on an interpreter system, meaning that code can be executed as soon as it is
written. This means that prototyping can be very quick. Python can be treated in a procedural way,
an object-oriented way or a functional way.
Python Quick start
Python is an interpreted programming language; this means that as a developer we write

Python (.py) files in a text editor and then put those files into the python interpreter to be executed.
The way to run a python file is like this on the command line:
Page 43 of 201
C:\Users\Name>python helloworld.py
where "helloworld.py" is the name of the python file.
Let's write our first Python file, called helloworld.py, which can be done in any text editor.
helloworld.py print("Hello, World!")
Save the file. Open the command line, navigate to the directory where we saved the file, and run:
C:\Users\Name>python helloworld.py
The output should read: Hello, World!
Python operators, Functions and strings Python Operators:
Python divides the operators in the following groups:
• Arithmetic operators
• Assignment operators
• Comparison operators
• Logical operators
• Identity operators
• Membership operators
• Bitwise operators
Arithmetic Operators
Arithmetic operators are used to performing mathematical operations like addition, subtraction,
multiplication, and division.
Operator Description Syntax

+ Addition: adds two operands x+y
– Subtraction: subtracts two operands x–y
* Multiplication: multiplies two operands x*y
/ Division (float): divides the first operand by the second x/y
// Division (floor): divides the first operand by the second x // y
Modulus: returns the remainder when the first operand is
% x%y
divided by the second
** Power: Returns first raised to power second x ** y
# Examples of Arithmetic
Operator a = 9
b=4
# Addition of numbers add = a + b
# Subtraction of numbers sub = a - b
# Multiplication of number mul = a * b
# Division(float) of number div1 = a / b
Page 44 of 201
# Division(floor) of number div2 = a // b
# Modulo of both number mod = a % b
# Power p = a ** b
# print results
print(add)
print(sub)
print(mul)
print(div1)
print(div2)
print(mod)
print(p)
Output
13
36
2.25
6561
Comparison Operators
Comparison of Relational operators compares the values. It either returns True or False according to
the condition.

> Greater than: True if the left operand is greater than the right x>y
< Less than: True if the left operand is less than the right x<y
== Equal to: True if both operands are equal x == y
!= Not equal to – True if operands are not equal x != y
Greater than or equal to True if the left operand is greater than or
>= x >= y
equal to the right
<= Less than or equal to True if the left operand is less than or equal to x <= y
the right
Page 45 of 201
# Examples of Relational Operators
a = 13
b = 33
# a > b is False print(a > b)
# a < b is True print(a < b)
# a == b is False print(a == b)
# a != b is True print(a != b)
# a >= b is False print(a >= b)
# a <= b is True print(a <= b)
Output
False
True
False
True
False
True
Logical Operators
Logical operators perform Logical AND, Logical OR, and Logical NOT operations. It is used to
combine conditional statements.

and Logical AND: True if both the operands are true x and y
or Logical OR: True if either of the operands is true x or y
not Logical NOT: True if the operand is false not x
# Examples of Logical Operator
a = True
b = False
# Print a and b is False print(a and b)
# Print a or b is True print(a or b)
# Print not a is False print(not a)
Page 46 of 201
Output
False
True
False
Bitwise Operators
Bitwise operators act on bits and perform the bit-by-bit operations. These are used to operate on
binary numbers.

& Bitwise AND x&y
| Bitwise OR x|y
~ Bitwise NOT ~x
^ Bitwise XOR x^y
>> Bitwise right shift x>>
<< Bitwise left shift x<<
# Examples of Bitwise operators
a = 10
b=4
# Print bitwise AND operation print(a & b)
# Print bitwise OR operation print(a | b)
# Print bitwise NOT operation print(~a)
# print bitwise XOR operation print(a ^ b)
# print bitwise right shift operation print(a >> 2)
# print bitwise left shift operation print(a << 2)
Output
14
-11
14
40
Page 47 of 201
Assignment Operators
Assignment operators are used to assigning values to the variables.

= Assign value of right side of expression to left side operand x=y+z
Add AND: Add right-side operand with left side operand and then a+=b
+=
assign to left operand a=a+b
Subtract AND: Subtract right operand from left operand and then a-=b
-=
assign to left operand a=a-b
Multiply AND: Multiply right operand with left operand and then a*=b
*=
assign to left operand a=a*b
Divide AND: Divide left operand with right operand and then a/=b
/=
assign to left operand a=a/b
Modulus AND: Takes modulus using left and right operands and a%=b
%=
assign the result to left operand a=a%b
Divide(floor) AND: Divide left operand with right operand and a//=b
//=
then assign the value(floor) to left operand a=a//b
Exponent AND: Calculate exponent (raise power) value using a**=b
**=
operands and assign value to left operand a=a**b
Performs Bitwise AND on operands and assign value to left a&=b
&=
operand a=a&b
a|=b
|= Performs Bitwise OR on operands and assign value to left operand
a=a|b
Performs Bitwise XOR on operands and assign value to left a^=b
^=
operand a=a^b
Performs Bitwise right shift on operands and assign value to left a>>=b
>>=
operand a=a>>b
Performs Bitwise left shift on operands and assign value to left a <<= b a=
<<=
operand a << b
# Examples of Assignment Operators
a = 10
b=a
print(b)
# Add and assign value b += a print(b)
# Subtract and assign value b -= a print(b)
# Multiply and assign b *= a print(b)
# bitwise left shift operator b <<= a print(b)
Output
10
20
10
100
Page 48 of 201
102400
Identity Operators
is and is not are the identity operators both are used to check if two values are located on the same
part of the memory. Two variables that are equal do not imply that they are identical.
is True if the operands are identical
is not True if the operands are not identical a = 10
b = 20
c=a
print(a is not b)
print(a is c)
Output
True
True
Membership Operators
in and not in are the membership operators; used to test whether a value or variable is in a sequence.
in True if value is found in the sequence
not in True if value is not found in the sequence
# Python program to illustrate # not 'in' operator
x = 24
y = 20
list = [10, 20, 30, 40, 50]
if (x not in list):
print("x is NOT present in given list")
else:
print("x is present in given list")
if (y in list):
print("y is present in given list")
else:
print("y is NOT present in given list")
Page 49 of 201
Output
x is NOT present in given list
y is present in given list
Functions:
A function is a block of organized, reusable code that is used to perform a single, related
action. Functions provide better modularity for the application and a high degree of code reusing.
As we already know, Python gives we many built-in functions like print(), etc. but we can also
create own functions. These functions are called user-defined functions.
Creating a Function
In Python a function is defined using the def keyword:
Example
def my_function():
print("Hello from a function")
Calling a Function
To call a function, use the function name followed by parenthesis:
Example
def my_function():
print("Hello from a function")
my_function()
Arguments
Information can be passed into functions as arguments.
Arguments are specified after the function name, inside the parentheses. We can add as many
arguments as we want, just separate them with a comma.
The following example has a function with one argument (fname). When the function is called,
we pass along a first name, which is used inside the function to print the full name:
Page 50 of 201
Example
def my_function(fname):
print(fname+ "Refsnes")
my_function("Emil")
my_function("Tobias")
my_function("Linus")
Strings
Strings in python are surrounded by either single quotation marks, or double quotation marks. 'hello'
is the same as "hello".
We can display a string literal with the print() function:
Example
print("Hello")
Assign String to a Variable
Assigning a string to a variable is done with the variable name followed by an equal sign and the
string:
Example
a= "Hello"
print(a)
Multiline Strings
We can assign a multiline string to a variable by using three quotes:
Example
We can use three double quotes:
a= """Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut
labore et dolore magna aliqua."""
print(a)
Strings are Arrays
Like many other popular programming languages, strings in Python are arrays of bytes representing
unicode characters.
Page 51 of 201
However, Python does not have a character data type, a single character is simply a string with a
length of 1.
Square brackets can be used to access elements of the string.
Example
Get the character at position 1 (remember that the first character has the position 0):
a= "Hello,World!"
print(a[1])
Looping Through a String
Since strings are arrays, we can loop through the characters in a string, with a for loop.
Example
Loop through the letters in the word "banana":
for x in "banana":
print(x)
String Length
To get the length of a string, use the len() function.
Example
The len() function returns the length of a string:
a= "Hello,World!"
print(len(a))
Slicing
We can return a range of characters by using the slice syntax.
Specify the start index and the end index, separated by a colon, to return a part of the string.
Example
Get the characters from position 2 to position 5 (not included):
b= "Hello,World!"
print(b[2:5])
Page 52 of 201
Slice From the Start
By leaving out the start index, the range will start at the first character:
Example
Get the characters from the start to position 5 (not included):
b= "Hello,World!"
print(b[:5])
Slice To the End
By leaving out the end index, the range will go to the end:
Example
Get the characters from position 2, and all the way to the end:
b= "Hello,World!"
print(b[2:])
Page 53 of 201
REORD WORK
preparation

b. maintainable
c. extendable
d. testable and
e. robust
Page 54 of 201
RECORD WORK
EP1. Write a function to find factorial of the number.

EP2. Write a function to find LCM of two numbers.
EP3. Write a function to find GCD of two numbers.
EP4. Write a function to extract sub string from the given string.
Page 55 of 201
Page 56 of 201
Page 57 of 201
Page 58 of 201
1. What are the key features of Python?

2. Explain the ternary operator in Python.
3. How would we convert a string into lowercase?
4. What is the pass statement in Python?
Note: For Viva-voce questions, the students should demonstrate their knowledge and skills
through oral communication. Writing answers to these Questions is Not mandatory.
Page 59 of 201
Page 60 of 201
Page 61 of 201
Laboratory Task – 4
5. List collection and tuple collection.

E4. 6. Dictionary collection and set collection.
7. Control structures and functions

1. to learn the use of List.
2. to learn use of Dictionary and set collection.
3. to learn use of control structures and function.
1. apply the List on a given data.
2. apply Dictionary and set collection on a given data.
3. apply control structures and function on a given data.
CONCEPT AT A GLANCE
List collection and tuple collection.
Lists:
Lists are used to store multiple items in a single variable. Lists are created using square brackets:
Example
Create a List:
thislist=["apple", "banana", "cherry"]
print(thislist)
Output:
['apple', 'banana', 'cherry']
List Items
List items are ordered, changeable, and allow duplicate values.
List items are indexed, the first item has index [0], the second item has index [1] etc.
Ordered
When we say that lists are ordered, it means that the items have a defined order, and that order will
not change. If we add new items to a list, the new items will be placed at the end of the list.
Page 62 of 201
Changeable
The list is changeable, meaning that we can change, add, and remove items in a list after it has been
created.
Allow Duplicates
Since lists are indexed, lists can have items with the same value:
Example
Lists allow duplicate values:
thislist=["apple", "banana", "cherry", "apple", "cherry"]
print(thislist)
List Length
To determine how many items a list has, use the len() function:
Example
Print the number of items in the list:
print(len(thislist))
List Items - Data Types
List items can be of any data type:
Example
String, int and boolean data types:
list1=["apple", "banana", "cherry"]
list2=[1, 5, 7, 9, 3]
list3 = [True, False, False]
A list can contain different data types:
Example
A list with strings, integers and boolean values:
list1 = ["abc", 34, True, 40, "male"]
Page 63 of 201
Access Items
List items are indexed and we can access them by referring to the index number:
Example
Print the second item of the list:
print(thislist[1])
Range of Indexes
We can specify a range of indexes by specifying where to start and where to end the range.
When specifying a range, the return value will be a new list with the specified items.
Example
Return the third, fourth, and fifth item:
thislist=["apple", "banana", "cherry", "orange", "kiwi", "melon", "mango"]
print(thislist[2:5])
Example
This example returns the items from the beginning to, but NOT including, "kiwi":
thislist=["apple", "banana", "cherry", "orange", "kiwi", "melon", "mango"]
print(thislist[:4])
Check if Item Exists
To determine if a specified item is present in a list use the in keyword:
Example
Check if "apple" is present in the list:
if "apple" in thislist:
print("Yes, 'apple' is in the fruits list")
type()
From Python's perspective, lists are defined as objects with the data type 'list':
<class 'list'>
Page 64 of 201
Example
What is the data type of a list?
mylist=["apple", "banana", "cherry"]
print(type(mylist))
Tuple:
Tuples are used to store multiple items in a single variable.
Tuple is one of 4 built-in data types in Python used to store collections of data, the other 3 are List,
Set, and Dictionary, all with different qualities and usage.
A tuple is a collection which is ordered and unchangeable. Tuples are written with round brackets.
Example
Create a Tuple:
thistuple=("apple", "banana", "cherry")
print(thistuple)
Tuple Items
Tuple items are ordered, unchangeable, and allow duplicate values.
Tuple items are indexed, the first item has index [0], the second item has index [1] etc.
Ordered
When we say that tuples are ordered, it means that the items have a defined order, and that order
will not change.
Unchangeable
Tuples are unchangeable, meaning that we cannot change, add or remove items after the tuple has
been created.
Allow Duplicates
Since tuples are indexed, they can have items with the same value:
Example
Tuples allow duplicate values:
thistuple=("apple", "banana", "cherry", "apple", "cherry")
Page 65 of 201
print(thistuple)
Tuple Length
To determine how many items a tuple has, use the len() function:
Example
Print the number of items in the tuple:
thistuple=("apple", "banana", "cherry")
print(len(thistuple))
type()
From Python's perspective, tuples are defined as objects with the data type 'tuple':
<class 'tuple'>
Example
What is the data type of a tuple?
mytuple=("apple", "banana", "cherry")
print(type(mytuple))
Page 66 of 201
Dictionary collection and set collection.
Dictionary
Dictionaries are used to store data values in key: value pairs.
A dictionary is a collection which is ordered*, changeable and does not allow duplicates.
Dictionaries are written with curly brackets, and have keys and values:
Example
Create and print a dictionary:
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
print(thisdict)
Dictionary Items
Dictionary items are ordered, changeable, and does not allow duplicates.
Dictionary items are presented in key:value pairs, and can be referred to by using the key name.
Example
Print the "brand" value of the dictionary:
thisdict= {
"brand": "Ford",
"model": "Mustang",
"year": 1964
print(thisdict["brand"])
Ordered or Unordered?
As of Python version 3.7, dictionaries are ordered. In Python 3.6 and earlier, dictionaries are
unordered.
When we say that dictionaries are ordered, it means that the items have a defined order, and that
order will not change.
Page 67 of 201
Unordered means that the items does not have a defined order, we cannot refer to an item by using
an index.
Changeable
Dictionaries are changeable, meaning that we can change, add or remove items after the dictionary
has been created.
Duplicates Not Allowed
Dictionaries cannot have two items with the same key:
Example
Duplicate values will overwrite existing values:
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964,
"year": 2020
print(thisdict)
type()
From Python's perspective, dictionaries are defined as objects with the data type 'dict':
<class 'dict'>
Example
Print the data type of a dictionary:
thisdict= {
"brand": "Ford",
"model": "Mustang",
"year": 1964
print(type(thisdict))
Page 68 of 201
Accessing Items
We can access the items of a dictionary by referring to its key name, inside square brackets:
Example
Get the value of the "model" key:
thisdict= {
"brand": "Ford",
"model": "Mustang",
"year": 1964
x = thisdict["model"]
There is also a method called get() that will give we the same result:
Example
Get the value of the "model" key:
x = thisdict.get("model")
Nested Dictionaries
A dictionary can contain dictionaries, this is called nested dictionaries.
Example
Create a dictionary that contain three dictionaries:
myfamily={
"child1" :{
"name" : "Emil",
"year" : 2004
},
"child2" :{
"name" : "Tobias",
"year" : 2007
},
"child3" :{
"name" : "Linus",
"year" : 2011
Page 69 of 201
}
Output:
{'child1': {'name': 'Emil', 'year': 2004}, 'child2': {'name': 'Tobias', 'year': 2007}, 'child3': {'name': 'Linus',
'year': 2011}}
Sets
Sets are used to store multiple items in a single variable.
Set is one of 4 built-in data types in Python used to store collections of data, the other 3 are List, Tuple,
and Dictionary, all with different qualities and usage.
A set is a collection which is both unordered and unindexed.
Sets are written with curly brackets.
Example
Create a Set:
thisset={"apple", "banana", "cherry"}
print(thisset)
Set Items
Set items are unordered, unchangeable, and do not allow duplicate values.
Unordered
Unordered means that the items in a set do not have a defined order.
Set items can appear in a different order every time we use them, and cannot be referred to by index
or key.
Unchangeable
Sets are unchangeable, meaning that we cannot change the items after the set has been created.
Duplicates Not Allowed
Sets cannot have two items with the same value.
Example
Duplicate values will be ignored:
Page 70 of 201
thisset={"apple", "banana", "cherry", "apple"}
print(thisset)
Get the Length of a Set
To determine how many items a set has, use the len() method.
Example
Get the number of items in a set:
print(len(thisset))
Access Items
We cannot access items in a set by referring to an index or a key.
But we can loop through the set items using a for loop, or ask if a specified value is present in a set,
by using the in keyword.
Example
Loop through the set, and print the values:
for x in thisset:
print(x)
type()
From Python's perspective, sets are defined as objects with the data type 'set':
<class 'set'>
Example
What is the data type of a set?
myset={"apple", "banana", "cherry"}
print(type(myset))
Join Two Sets
There are several ways to join two or more sets in Python.
We can use the union() method that returns a new set containing all items from both sets, or the
update() method that inserts all the items from one set into another:
Page 71 of 201
Example
The union() method returns a new set with all items from both sets:
set1={"a", "b" , "c"}
set2={1, 2, 3}
set3=set1.union(set2)
print(set3)
Page 72 of 201
Control structures and functions.
Python Conditions and If statements
Python supports the usual logical conditions from mathematics:
• Equals: a == b
• Not Equals: a != b
• Less than: a < b
• Less than or equal to: a <= b
• Greater than: a > b
• Greater than or equal to: a >= b
These conditions can be used in several ways, most commonly in "if statements" and loops.
An "if statement" is written by using the if keyword.
Example
If statement:
a= 33
b= 200
if b>a:
print("b is greater than a")
In this example we use two variables, a and b, which are used as part of the if statement to test whether
b is greater than a. As a is 33, and b is 200, we know that 200 is greater than 33, and so we print to
screen that "b is greater than a".
Indentation
Python relies on indentation (whitespace at the beginning of a line) to define scope in the code. Other
programming languages often use curly-brackets for this purpose.
Elif
The elif keyword is pythons way of saying "if the previous conditions were not true, then try this
condition".
Example
a= 33
b= 33
if b>a:
Page 73 of 201
elif a==b:
print("a and b are equal")
In this example a is equal to b, so the first condition is not true, but the elif condition is true, so we
print to screen that "a and b are equal".
Else
The else keyword catches anything which isn't caught by the preceding conditions.
Example
a= 200
b= 33
if b>a:
elif a==b:
print("a and b are equal")
else:
print("a is greater than b")
In this example a is greater than b, so the first condition is not true, also the elif condition is not true,
so we go to the else condition and print to screen that "a is greater than b".
We can also have an else without the elif:
Example
a= 200
b= 33
if b>a:
else:
print("b is not greater than a")
Loops in python
Python programming language provides following types of loops to handle looping requirements.
Python provides three ways for executing the loops. While all the ways provide similar basic
functionality, they differ in their syntax and condition checking time.
Page 74 of 201
While Loop:
In python, while loop is used to execute a block of statements repeatedly until a given a condition is
satisfied. And when the condition becomes false, the line immediately after the loop in program is
executed.
With the while loop we can execute a set of statements as long as a condition is true.
Example
Print i as long as i is less than 6:
i= 1
while i< 6:
print(i)
i += 1
The break Statement
With the break statement we can stop the loop even if the while condition is true:
Example
Exit the loop when i is 3:
i= 1
while i< 6:
print(i)
if i== 3:
break
i += 1
The continue Statement
With the continue statement we can stop the current iteration, and continue with the next:
Example
Continue to the next iteration if i is 3:
i= 0
while i< 6:
i+= 1
if i== 3:
continue
Page 75 of 201
print(i)
The else Statement
With the else statement we can run a block of code once when the condition no longer is true:
Example
Print a message once the condition is false:
i= 1
while i< 6:
print(i)
i+= 1
else:
print("i is no longer less than 6")
For Loop:
A for loop is used for iterating over a sequence (that is either a list, a tuple, a dictionary, a set, or a
string).
This is less like for keyword in other programming languages, and works more like an iterator
method as found in other object-orientated programming languages.
With the for loop we can execute a set of statements, once for each item in a list, tuple, set etc.
Example
Print each fruit in a fruit list:
fruits=["apple", "banana", "cherry"]
for x in fruits:
print(x)
Looping Through a String
Even strings are iterable objects, they contain a sequence of characters:
Example
Loop through the letters in the word "banana":
for x in "banana":
print(x)
Page 76 of 201
The range() Function
To loop through a set of code a specified number of times, we can use the range() function,
The range() function returns a sequence of numbers, starting from 0 by default, and increments by 1
(by default), and ends at a specified number.
Example
Using the range() function:
for x in range(6):
print(x)
The range() function defaults to increment the sequence by 1, however it is possible to specify the
increment value by adding a third parameter: range(2, 30, 3):
Example
Increment the sequence with 3 (default is 1):
for x in range(2, 30, 3):
print(x)
Nested Loops
A nested loop is a loop inside a loop.
The "inner loop" will be executed one time for each iteration of the "outer loop":
Example
Print each adjective for every fruit:
adj=["red", "big", "tasty"]
fruits=["apple", "banana", "cherry"]
for x in adj:
for y in fruits:
print(x, y)
Page 77 of 201
REORD WORK
preparation

b. maintainable
c. extendable
d. testable and
e. robust
Page 78 of 201
RECORD WORK
EP1. Write Python code to reverse given list.

EP2. Write Python code to perform various operations on lists.
EP3. Write Python code to perform various operations on tuples.
EP4. Write Python code to perform various operations on dictionaries.
EP5. Write Python code to perform various operations on sets.
Page 79 of 201
Page 80 of 201
Page 81 of 201
Page 82 of 201
1. How set differs from list?

2. Explain Union, Intersection, Difference, Symmetric Difference operations in sets.
3. What is Short Hand if Else condition?
4. What is the difference between break and continue?
Page 83 of 201
Page 84 of 201
Page 85 of 201
E5. 8. Introduction to NumPy, Operations on NumPy arrays

1. to learn the use of Numpy.
2. to learn how to use operations on Numpy arrays.
1. apply Numpy on the data.
2. apply operations on Numpy arrays.
CONCEPT AT A GLANCE
NumPy is a Python library.
NumPy is used for working with arrays.
NumPy is short for "Numerical Python".
Example
Create a NumPy ndarray Object
NumPy is used to work with arrays. The array object in NumPy is called ndarray.
We can create a NumPy ndarray object by using the array() function.
Create a NumPy array:
import numpy as np
arr=np.array([1, 2, 3, 4, 5])
print(arr)
print(type(arr)
OUTPUT
[1 2 3 4 5]
<class 'numpy.ndarray'>
Dimensions in Arrays
A dimension in arrays is one level of array depth (nested arrays).
Page 86 of 201
0-D Arrays
0-D arrays, or Scalars, are the elements in an array. Each value in an array is a 0-D array.
Example
Create a 0-D array with value 42
import numpy as np
arr=np.array(42)
print(arr)
1-D Arrays
An array that has 0-D arrays as its elements is called uni-dimensional or 1-D array.
These are the most common and basic arrays.
Example
Create a 1-D array containing the values 1,2,3,4,5:
import numpy as np
arr=np.array([1, 2, 3, 4, 5])
print(arr)
2-D Arrays
An array that has 1-D arrays as its elements is called a 2-D array.
These are often used to represent matrix or 2nd order tensors.
Example
Create a 2-D array containing two arrays with the values 1,2,3 and 4,5,6:
import numpy as np
arr=np.array([[1, 2, 3],[4, 5, 6]])
print(arr)
Page 87 of 201
3-D arrays
An array that has 2-D arrays (matrices) as its elements is called 3-D array.
These are often used to represent a 3rd order tensor.
Example
Create a 3-D array with two 2-D arrays, both containing two arrays with the values 1,2,3 and 4,5,6:
import numpy as np
arr=np.array([[[1, 2, 3],[4, 5, 6]],[[1, 2, 3],[4, 5, 6]]])
print(arr)
Access Array Elements
Array indexing is the same as accessing an array element.
We can access an array element by referring to its index number.
The indexes in NumPy arrays start with 0, meaning that the first element has index 0, and the
second has index 1 etc.
Example
Get the first element from the following array:
import numpy as np
arr=np.array([1, 2, 3, 4])
print(arr[0])
Access 2-D Arrays
To access elements from 2-D arrays we can use comma separated integers representing the
dimension and the index of the element.
Example
Access the 2nd element on 1st dim:
import numpy as np
arr=np.array([[1,2,3,4,5],[6,7,8,9,10]])
Page 88 of 201
print('2nd element on 1st dim: ', arr[0, 1])
Access 3-D Arrays
To access elements from 3-D arrays we can use comma separated integers representing the
dimensions and the index of the element.
Example
Access the third element of the second array of the first array:
import numpy as np
arr=np.array([[[1, 2, 3],[4, 5, 6]],[[7, 8, 9],[10, 11, 12]]])
print(arr[0, 1, 2])
Slicing arrays
Slicing in python means taking elements from one given index to another given index.
We pass slice instead of index like this: [start:end].
We can also define the step, like this: [start:end:step].
If we don't pass start its considered 0
If we don't pass end its considered length of array in that dimension
If we don't pass step its considered 1
Example
Slice elements from index 1 to index 5 from the following array:
import numpy as np
arr=np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5])
Negative Slicing
Use the minus operator to refer to an index from the end:
Page 89 of 201
Example
Slice from the index 3 from the end to index 1 from the end:
import numpy as np
arr=np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[-3:-1])
STEP
Use the step value to determine the step of the slicing:
Example
Return every other element from index 1 to index 5:
import numpy as np
arr=np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5:2])
Slicing 2-D Arrays
Example
From the second element, slice elements from index 1 to index 4 (not included):
import numpy as np
arr=np.array([[1, 2, 3, 4, 5],[6, 7, 8, 9, 10]])
print(arr[1, 1:4])
Shape of an Array
The shape of an array is the number of elements in each dimension.
Get the Shape of an Array.
NumPy arrays have an attribute called shape that returns a tuple with each index having the
number of corresponding elements.
Example
Page 90 of 201
Print the shape of a 2-D array:
import numpy as np
arr=np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
print(arr.shape)
Joining NumPy Arrays
Joining means putting contents of two or more arrays in a single array.
In SQL we join tables based on a key, whereas in NumPy we join arrays by axes.
We pass a sequence of arrays that we want to join to the concatenate() function, along with the axis.
If axis is not explicitly passed, it is taken as 0.
Example
Join two arrays
import numpy as np
arr1=np.array([1, 2, 3])
arr2=np.array([4, 5, 6])
arr=np.concatenate((arr1,arr2))
print(arr)
Splitting NumPy Arrays
Splitting is reverse operation of Joining.
Joining merges multiple arrays into one and Splitting breaks one array into multiple.
We use array_split() for splitting arrays, we pass it the array we want to split and the number of
splits.
Example
Split the array in 3 parts:
import numpy as np
arr=np.array([1, 2, 3, 4, 5, 6])
Page 91 of 201
newarr= np.array_split(arr, 3)
print(newarr)
Searching Arrays
You can search an array for a certain value, and return the indexes that get a match.
To search an array, use the where() method.
Example
Find the indexes where the value is 4:
import numpy as np
arr=np.array([1, 2, 3, 4, 5, 4, 4])
x= np.where(arr== 4)
print(x)
Sorting Arrays
Sorting means putting elements in an ordered sequence.
Ordered sequence is any sequence that has an order corresponding to elements, like numeric or
alphabetical, ascending or descending.
The NumPy ndarray object has a function called sort(), that will sort a specified array.
Example
Sort the array:
import numpy as np
arr=np.array([3, 2, 0, 1])
print(np.sort(arr))
Page 92 of 201
REORD WORK
1. In-Lab Exercise Problem (EPS): In-lab Exercise Problems (EPs) are the problems to be
preparation
2. Student should focus on developing code for the In-Lab EPs, which is
b. maintainable
c. extendable
d. testable and
e. robust
Page 93 of 201
RECORD WORK
EP1. Write Python code to print column wise addition of a 2-D array.
EP2. Write Python code to print row wise addition of a 2-D array.
EP3. Write Python code to print diagonal elements of a 2-D array.
Page 94 of 201
Page 95 of 201
Page 96 of 201
Page 97 of 201
1. What is reshape function?

2. How to create identity matrix?
3. List the commands used to find min, max and average of array elements?
4. What is slicing?
Page 98 of 201
Page 99 of 201
Page 100 of 201
E6. 9. Introduction to Pandas, Getting and Cleaning Data.

1. to learn introduction to pandas.
2. to learn how to clean the data on the dataset.
1. apply python libraries.
2. apply techniques to clean the data on the dataset.
CONCEPT AT A GLANCE
Pandas is a Python library.
Pandas is used to analyze data.
Example
import pandas
mydataset={
'cars':["BMW", "Volvo", "Ford"],
'passings':[3, 7, 2]
myvar=pandas.DataFrame(mydataset)
print(myvar)
Pandas as pd
Pandas is usually imported under the pd alias.
Create an alias with the as keyword while importing:
import pandas as pd
Now the Pandas package can be referred to as pd instead of pandas.
Example
import pandas as pd
mydataset={
'cars':["BMW", "Volvo", "Ford"],
Page 101 of 201

'passings': [3, 7, 2]
myvar=pd.DataFrame(mydataset)
print(myvar)
What is a Series?
A Pandas Series is like a column in a table.
It is a one-dimensional array holding data of any type.
Example
Create a simple Pandas Series from a list:
import pandas as pd
a=[1, 7, 2]
myvar=pd.Series(a)
print(myvar)
Labels
If nothing else is specified, the values are labeled with their index number. First value has index 0,
second value has index 1 etc.
This label can be used to access a specified value.
Example
Return the first value of the Series:
print(myvar[0])
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar[0])
Output:
Create Labels
With the index argument, we can name our own labels.
Page 102 of 201

Example
Create our own labels:
import pandas as pd
a=[1, 7, 2]
myvar=pd.Series(a,index=["x", "y", "z"])
print(myvar)
Example:
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)
output:
x 1
y 7
z 2
dtype: int64
What is a DataFrame?
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with
rows and columns.
Example
Create a simple Pandas DataFrame:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
Page 103 of 201

Result
calories duration
0 420 50
1 380 40
2 390 45
Locate Row
As you can see from the result above, the DataFrame is like a table with rows and columns.
Pandas use the loc attribute to return one or more specified row(s)
Example
Return row 0:
#refer to the row index:
print(df.loc[0])
Result
calories 420
duration 50
Name: 0, dtype: int64
Named Indexes
With the index argument, you can name your own indexes.
Example
Add a list of names to give each row a name:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)
Page 104 of 201

Result
calories duration
day1 420 50
day2 380 40
day3 390 45
Read CSV Files
A simple way to store big data sets is to use CSV files (comma separated files).
CSV files contains plain text and is a well know format that can be read by everyone including Pandas.
In our examples we will be using a CSV file called 'data.csv'.
Download data.csv. or Open data.csv
Example
Load the CSV into a DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.to_string())
Read JSON
Big data sets are often stored, or extracted as JSON.
JSON is plain text, but has the format of an object, and is well known in the world of programming,
including Pandas.
In our examples we will be using a JSON file called 'data.json'.
Open data.json.
Example
Load the JSON file into a DataFrame:
import pandas as pd
df = pd.read_json('data.json')
Viewing the Data
One of the most used method for getting a quick overview of the DataFrame, is the head() method.
Page 105 of 201

The head() method returns the headers and a specified number of rows, starting from the top.
Example
Get a quick overview by printing the first 10 rows of the DataFrame:
import pandas as pd
print(df.head(10))
Pandas - Cleaning Data
Data Cleaning
Data cleaning means fixing bad data in your data set.
Bad data could be:
• Empty cells
• Data in wrong format
• Wrong data
• Duplicates
Data Set:
Duration Date Pulse Maxpulse Calories
0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaN 100 119 282.0
Page 106 of 201

23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 2020/12/26 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 NaN
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0
The data set contains some empty cells ("Date" in row 22, and "Calories" in row 18 and 28).
The data set contains wrong format ("Date" in row 26).
The data set contains wrong data ("Duration" in row 7).
The data set contains duplicates (row 11 and 12).
Empty Cells
Empty cells can potentially give you a wrong result when you analyze data.
Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells.
This is usually OK, since data sets can be very big, and removing a few rows will not have a big
impact on the result.
Example
Return a new Data Frame with no empty cells:
import pandas as pd
new_df = df.dropna()
print(new_df.to_string())
Replace Empty Values
Another way of dealing with empty cells is to insert a new value instead.
This way you do not have to delete entire rows just because of some empty cells.
The fillna() method allows us to replace empty cells with a value:
Page 107 of 201

Example
Replace NULL values with the number 130:
import pandas as pd
df.fillna(130, inplace = True)
Data of Wrong Format
Cells with data of wrong format can make it difficult, or even impossible, to analyze data.
To fix it, you have two options: remove the rows, or convert all cells in the columns into the same
format.
Convert Into a Correct Format
In our Data Frame, we have two cells with the wrong format. Check out row 22 and 26, the 'Date'
column should be a string that represents a date:

0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaN 100 119 282.0
23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
Page 108 of 201

26 60 20201226 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 NaN
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0
Let's try to convert all cells in the 'Date' column into dates.
Pandas has a to_datetime() method for this:
Example
Convert to date:
import pandas as pd
df['Date'] = pd.to_datetime(df['Date'])
Result:

0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaT 100 119 282.0
23 60 '2020/12/23' 130 101 300.0
Page 109 of 201

24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 '2020/12/26' 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 NaN
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0
Wrong Data
"Wrong data" does not have to be "empty cells" or "wrong format", it can just be wrong, like if
someone registered "199" instead of "1.99".
Sometimes we can spot wrong data by looking at the data set, because we have an expectation of
what it should be.
If you take a look at our data set, you can see that in row 7, the duration is 450, but for all the other
rows the duration is between 30 and 60.
It doesn't have to be wrong, but taking in consideration that this is the data set of someone's workout
sessions, we conclude with the fact that this person did not work out in 450 minutes.

0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaN 100 119 282.0
23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
Page 110 of 201

25 60 '2020/12/25' 102 126 334.5
26 60 20201226 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 NaN
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0
Replacing Values
One way to fix wrong values is to replace them with something else.
In our example, it is most likely a typo, and the value should be "45" instead of "450", and we could
just insert "45" in row 7:
Example
Set "Duration" = 45 in row 7:
df.loc[7, 'Duration'] = 45
Discovering Duplicates
Duplicate rows are rows that have been registered more than one time.

0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaN 100 119 282.0
23 60 '2020/12/23' 130 101 300.0
Page 111 of 201

24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 20201226 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 NaN
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0
By taking a look at our test data set, we can assume that row 11 and 12 are duplicates.
To discover duplicates, we can use the duplicated() method.
The duplicated() method returns a Boolean values for each row:
Example
Returns True for every row that is a duplicate, othwerwise False:
print(df.duplicated())
Page 112 of 201

REORD WORK
preparation
b. maintainable
c. extendable
d. testable and
e. robust
Page 113 of 201

RECORD WORK
EP1. Write Python code to sort values in ascending and descending order.
EP2. Create a dataset with null values and write code to remove null values from the dataset.
Page 114 of 201

Page 115 of 201
Page 116 of 201
Page 117 of 201
1. What is a data frame?

2. How to get the information of data set?
3. List the methods available to clean the data.
4. What is JSON?
Page 118 of 201

Page 119 of 201
Page 120 of 201
10. Introduction to Data Visualization.

E7.
11. Basics of Visualization: Plots, Subplots and their Functionalities.

1. to learn about introduction to data visualization.
2. to learn Basics of visualization like plots, subplots and their functionalities.
1. apply data visualization on datasets
2. apply Basics of visualization tools to analyze the data.
CONCEPT AT A GLANCE
Introduction to Data visualization
Data visualization is the discipline of trying to understand data by placing it in a visual context so
that patterns, trends and correlations that might not otherwise be detected can be exposed.
Python offers multiple great graphing libraries that come packed with lots of different features. No
matter if you want to create interactive, live or highly customized plots python has an excellent library
for you.
To get a little overview here are a few popular plotting libraries:
• Matplotlib: low level, provides lots of freedom
• Pandas Visualization: easy to use interface, built on Matplotlib
• Seaborn: high-level interface, great default styles
• ggplot: based on R’s ggplot2, uses Grammar of Graphics
• Plotly: can create interactive plots
What is Matplotlib?
Matplotlib is a low-level graph plotting library in python that serves as a visualization utility.
Matplotlib was created by John D. Hunter. Matplotlib is open source and we can use it freely.
Matplotlib is mostly written in python, a few segments are written in C, Objective-C and JavaScript
for Platform compatibility.
Page 121 of 201

Installation of Matplotlib
Install it using this command:
C:\Users\Your Name>pip install matplotlib
If this command fails, then use a python distribution that already has Matplotlib installed, like
Anaconda, Spyder etc.
Import Matplotlib
Once Matplotlib is installed, import it in your applications by adding the import module statement:
import matplotlib
Pyplot
Most of the Matplotlib utilities lies under the pyplot submodule, and are usually imported under the
plt alias:
import matplotlib.pyplot as plt
Now the Pyplot package can be referred to as plt.
Example
Draw a line in a diagram from position (0,0) to position (6,250):
#Three lines to make our compiler able to draw:
import sys
import matplotlib
matplotlib.use('Agg')
import numpy as np
xpoints = np.array([0, 6])
ypoints = np.array([0, 250])
plt.plot(xpoints, ypoints)
plt.show()
Page 122 of 201

#Two lines to make our compiler able to draw:
plt.savefig(sys.stdout.buffer)
sys.stdout.flush()
Basics of visualization: Plots, Subplots and their functionalities
Matplotlib Plotting
Plotting x and y points
The plot() function is used to draw points (markers) in a diagram. By default, the plot() function
draws a line from point to point. The function takes parameters for specifying points in the
diagram. Parameter 1 is an array containing the points on the x-axis. Parameter 2 is an array
containing the points on the y-axis. If we need to plot a line from (1, 3) to (8, 10), we have to pass
two arrays [1, 8] and [3, 10] to the plot function.
import sys
import matplotlib
import numpy as np
xpoints = np.array([1, 8])
ypoints = np.array([3, 10])
plt.plot(xpoints, ypoints)
Page 123 of 201

plt.show()
sys.stdout.flush()
Matplotlib Subplots
Display Multiple Plots
With the subplots() function we can draw multiple plots in one figure:
The subplots() function takes three arguments that describes the layout of the figure. The layout is
organized in rows and columns, which are represented by the first and second argument. The third
argument represents the index of the current plot.
plt.subplot(1, 2, 1)
#the figure has 1 row, 2 columns, and this plot is the first plot.
#the figure has 1 row, 2 columns, and this plot is the second plot.
import sys
import matplotlib
Page 124 of 201

import numpy as np
#plot 1:
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])
plt.plot(x,y)
#plot 2:
x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])
plt.plot(x,y)
plt.show()
sys.stdout.flush()
Page 125 of 201

### Line Plot Graph
x = [1,2,3,4,6,8,9,10,12,15]
y = [10,20,30,40,60,80,90,100,102,150]
plt.figure(figsize=(8,5)) ## width,height
plt.title("Line Plot Graph",fontsize=15,color='red')
plt.xlabel("X Axis --->",fontsize=12,color='red')
plt.ylabel("Y Axis --->",fontsize=12,color='red')
plt.plot(x,y,color='green',lw="3",linestyle="dotted",label="Line Plot")
plt.legend(loc="best")
## linestyle - solid,dotted,dashed, lw= line width
plt.show()
### Scatter Plot Graph
x = [1,2,3,4,6,8,9,10,12,15]
y = [10,20,30,40,60,80,90,100,102,150]
plt.title("Scatter Plot Graph",fontsize=15,color='red')
plt.scatter(x,y,color='green',label="Scatter Plot",s=150,marker="*")
Page 126 of 201

### o,*,d,v,^,<,> ### https://matplotlib.org/stable/api/markers_api.html
plt.show()
### Bar Plot Graph
x = [1,2,3,4,6,8,9,10,12,15]
y = [10,20,30,40,60,80,90,100,102,150]
plt.title("Bar Plot Graph",fontsize=15,color='red')
plt.bar(x,y,color=['green','orange'],label="Bar Plot",width=0.6)
plt.show()
Page 127 of 201

Page 128 of 201
REORD WORK
preparation
b. maintainable
c. extendable
d. testable and
e. robust
Page 129 of 201

RECORD WORK
EP1. Write Python code to visualize sine plot.

EP2. Write Python code to visualize cosine plot.
EP3. Write Python code to visualize relplot.
Page 130 of 201

Page 131 of 201
Page 132 of 201
Page 133 of 201
1. What is a scatter graph?

2. What is subplot?
3. What is a legend?
Page 134 of 201

Page 135 of 201
Page 136 of 201
E8. 12. Plotting Data distributions, Categorical and Time-Series data

1. to learn the use plotting data distribution on categorical data.
2. to learn use plotting data distribution on Time-series data.
1. apply plotting on categorical data.
2. apply plotting on Time-series data.
CONCEPT AT A GLANCE
Data Distribution
In the real world, the data sets are much bigger, but it can be difficult to gather real world data, at
least at an early stage of a project.
How Can we Get Big Data Sets?
To create big data sets for testing, we use the Python module NumPy, which comes with a number
of methods to create random data sets, of any size.
Histogram
To visualize the data set we can draw a histogram with the data we collected. We will use the Python
module Matplotlib to draw a histogram.
Normal Data Distribution
In probability theory this kind of data distribution is known as the normal data distribution, or the
Gaussian data distribution, after the mathematician Carl Friedrich Gauss who came up with the
formula of this data distribution.
What is a Time Series?
Time series is a sequence of observations recorded at regular time intervals.
Depending on the frequency of observations, a time series may typically be hourly, daily, weekly,
monthly, quarterly and annual. Sometimes, you might have seconds and minute-wise time series as
well, like, number of clicks and user visits every minute etc.
What is a Categorical data?
Page 137 of 201

Categorical features can only take on a limited, and usually fixed, number of possible values. For
example, if a dataset is about information related to users, then you will typically find features like
country, gender, age group, etc. Alternatively, if the data you're working with is related to products,
you will find features like product type, manufacturer, seller and so on.
These are all categorical features in your dataset. These features are typically stored as text values
which represent various traits of the observations. For example, gender is described as Male (M) or
Female (F), product type could be described as electronics, apparels, food etc.
### Bar Plot Graph
x = [1,2,3,4,6,8,9,10,12,15]
y = [10,20,30,40,60,80,90,100,102,150]
plt.bar(x,y,color=['green','orange'],label="Bar Plot",width=0.6)
plt.plot(x,y,color='red',lw="1",linestyle="solid",label="Line Plot")
plt.show()Description of Crab Dataset:
### Horizontal Bar Plot Graph
x = [1,2,3,4,6,8,9,10,12,15]
y = [10,20,30,40,60,80,90,100,102,150]
Page 138 of 201

plt.ylabel("X Axis --->",fontsize=12,color='red')
plt.xlabel("Y Axis --->",fontsize=12,color='red')
plt.barh(x,y,color=['green','orange'],label="Bar Plot",height=0.6)
plt.show()
### Pie Chart
slices = [30,100,50,22,44,66,22,55]
names = ["A","B","C","D","E","F","G","H"]
cols = ["red","blue","orange","green","pink","violet","magenta","yellow"]
plt.figure(figsize=(6,6))
plt.pie(slices,labels=names,colors=cols,autopct="%0.2f%%",explode=(0.2,0,0,0.5,0,0,0,0))
plt.legend(loc=4)
plt.show()
Page 139 of 201

Page 140 of 201
REORD WORK
preparation
b. maintainable
c. extendable
d. testable and
e. robust
Page 141 of 201

RECORD WORK
EP1. Draw the all types of chart for sales data analysis.
Page 142 of 201

Page 143 of 201
Page 144 of 201
1. What is a normal distribution curve?

2. Give example for time series data.
3. What tools are used to analyze the data?
4. What is categorical data?
Page 145 of 201

Page 146 of 201
Page 147 of 201
E9. 13. Generate association rules from frequent item sets.

1. to learn the use association rules on the dataset.
2. to learn how to find frequent item sets on the dataset.
1. apply association rules on the dataset.
2. apply association rules to find frequent item sets on the dataset.
CONCEPT AT A GLANCE
Association Mining searches for frequent items in the data-set. In frequent mining usually the
interesting associations and correlations between item sets in transactional and relational databases
are found.
In short, Frequent Mining shows which items appear together in a transaction or relation.
Need of Association Mining:
Frequent mining is generation of association rules from a Transactional Dataset. If there are 2 items
X and Y purchased frequently then its good to put them together in stores or provide some discount
offer on one item on purchase of other item. This can really increase the sales. For example it is
likely to find that if a customer buys Milk and bread he/she also buys Butter.
So the association rule is [‘milk]^[‘bread’]=>[‘butter’]. So seller can suggest the customer to buy
butter if he/she buys Milk and Bread.
Important Definitions:
Support: It is one of the measure of interestingness. This tells about usefulness and certainty of rules.
5% Support means total 5% of transactions in database follow the rule.
Support(A -> B) = Support_count(A 𝖴 B)
Confidence: A confidence of 60% means that 60% of the customers who purchased a milk and
bread also bought butter.
Confidence(A -> B) = Support_count(A 𝖴 B) / Support_count(A)
If a rule satisfies both minimum support and minimum confidence, it is a strong rule.
Page 148 of 201

Program:
pip install apyori pip install fsspec import numpy as np
import matplotlib.pyplot as plt import pandas as pd
from apyori import apriori store_data=pd.read_csv('/content/drive/MyDrive/store_data.csv',

header=None) num_records=len(store_data)
print(num_records) records=[]
for i in range(0,num_records):
records.append([str(store_data.values[i,j]) for j in range(0,4) if str(store_data.values[i,j]) !=

'nan'])
print(records)
association_rules=apriori(records,min_support=0.4,min_confidence=0.2)
association_results=list(association_rules)
print(len(association_results))
sup_list=[]
conf_list=[]
for item in association_results:
# first index of the inner list
# Contains base item and add item pair =item[0]
# print(pair)
# items = [x for x in pair]
if(len(list(item[2][0][1]))>=2):
#print("Rule: " + list(item[2][0][1])[0]+"->"+list(item[2][0][1])[1])
print("Rule: " + str(list(item[2][0][1])))
#second index of the inner list
print("Support: " + str(item[1])) sup_list.append(item[1])
#third index of the list located at 0th of the third index of the inner list
conf_list.append(item[2][0][2])
print("Confidence: " + str(item[2][0][2]))
Page 149 of 201

# print("Lift: " + str(item[2][0][3]))
print("=====================================")
Output:
Rule: ['i2', 'i1'] Support: 0.6
Confidence: 0.6
=====================================
Confidence: 0.6
=====================================
Confidence: 0.4
=====================================
Confidence: 0.6
=====================================
Confidence: 0.4
=====================================
Rule: ['i2', 'i3', 'i1']
Support: 0.4
Confidence: 0.4
=====================================
Rule: ['i3', 'i4', 'i1']
Support: 0.4
Confidence: 0.4
Page 150 of 201

REORD WORK
preparation
b. maintainable
c. extendable
d. testable and
e. robust
Page 151 of 201

RECORD WORK
EP1. Generate association rules from medical dataset.

EP2. Generate association rules from grocery dataset.
Page 152 of 201

Page 153 of 201
Page 154 of 201
Page 155 of 201
1. Define frequent itemset.

2. Define support.
3. Define confidence.
4. Define Association mining.
Page 156 of 201

Page 157 of 201
Page 158 of 201
E10. 14. Regression and Classification: Linear regression and logistic regression.

1. to learn the Regression and classification model.
2. to learn how to use linear regression and logistic regression.
1. apply Regression and classification model.
2. apply linear regression and logistic regression model.
CONCEPT AT A GLANCE
Regression
Regression Analysis is a statistical process for estimating the relationships between the dependent
variables or criterion variables and one or more independent variables or predictors. Regression
analysis explains the changes in criteria in relation to changes in select predictors. The conditional
expectation of the criteria is based on predictors where the average value of the dependent variables
is given when the independent variables are changed. Three major uses for regression analysis are
determining the strength of predictors, forecasting an effect, and trend forecasting.
Types of Linear Regression
Linear regression is of the following two types −
• Simple Linear Regression

• Multiple Linear Regression
Simple Linear Regression (SLR)
It is the most basic version of linear regression which predicts a response using a single feature. The
assumption in SLR is that the two variables are linearly related.
Multiple Linear Regression (MLR)
It is the extension of simple linear regression that predicts a response using two or more features.
Mathematically we can explain it as follows
Consider a dataset having n observations, p features i.e. independent variables and y as one response
i.e. dependent variable the regression line for p features can be calculated as follows:
h(xi)=b0+b1xi1+b2xi2+...+bpxiph(xi)=b0+b1xi1+b2xi2+...+bpxip
Here, h(xi) is the predicted response value and b0,b1,b2…,bp are the regression coefficients.
Page 159 of 201

Multiple Linear Regression models always includes the errors in the data known as residual error
which changes the calculation as follows
h(xi)=b0+b1xi1+b2xi2+...+bpxip+ei
We can also write the above equation as follows:
yi=h(xi)+ei or ei=yi−h(xi)
Introduction to Logistic Regression
Logistic regression is a supervised learning classification algorithm used to predict the probability of
a target variable. The nature of target or dependent variable is dichotomous, which means there
would be only two possible classes.
In simple words, the dependent variable is binary in nature having data coded as either 1 (stands for
success/yes) or 0 (stands for failure/no).
Mathematically, a logistic regression model predicts P(Y=1) as a function of X. It is one of the simplest
ML algorithms that can be used for various classification problems such as spam detection, Diabetes
prediction, cancer detection etc.
Classification
There are two forms of data analysis that can be used for extracting models describing important
classes or to predict future data trends. These two forms are as follows
• Classification
• Prediction
Classification models predict categorical class labels; and prediction models predict continuous
valued functions.
What is classification?
Following are the examples of cases where the data analysis task is Classification
• A bank loan officer wants to analyze the data in order to know which customer (loan
applicant) are risky or which are safe.
• A marketing manager at a company needs to analyze a customer with a given profile, who
will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the categorical labels.
These labels are risky or safe for loan application data and yes or no for marketing data.
Linear Regression:
import pandas as pd
Page 160 of 201

import numpy as np
data = pd.read_csv("/content/Salary_Data.csv")
data.head()
Output:
YearsExperience Salary
0 1.1 39343.0
1 1.3 46205.0
2 1.5 37731.0
3 2.0 43525.0
4 2.2 39891.0
x = np.array(data[['YearsExperience']]) ## feature
y = np.array(data['Salary']) ## target
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest = train_test_split(x,y,train_size=0.8,random_state=9014)
### Build the model
from sklearn.linear_model import LinearRegression
model = LinearRegression()
### Train the model
model.fit(xtrain,ytrain)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
### Prediction
ypred = model.predict(xtest)
ypred
Output:
array([ 56182.55053157, 100294.69682063, 44919.8748833 , 62752.44465973,
122820.04811718, 115311.59768499])
Ytest
Output:
array([ 54445., 101302., 43525., 63218., 122391., 116969.])
xtest
Page 161 of 201

Output:
array([[ 3.2],
[ 7.9],
[ 2. ],
[ 3.9],
[10.3],
[ 9.5]])
### Calculate R2 Score
from sklearn.metrics import r2_score
score = r2_score(ytest,ypred)
score
Output:
0.99842716176972
m = model.coef_
c = model.intercept_
print(m,c)
Output:
[9385.56304023] 26148.74880284306
### Draw line of regression for training samples
plt.scatter(xtrain,ytrain,color="red",label="Actual Samples")
plt.scatter(xtrain,model.predict(xtrain),color="blue",label="Predicted Samples")
plt.plot(xtrain,model.predict(xtrain),color="yellow",label="Line of Regression")
plt.legend()
plt.show()
Page 162 of 201

### Draw line of regression for testing samples
plt.scatter(xtest,ytest,color="red",label="Actual Samples")
plt.scatter(xtest,model.predict(xtest),color="blue",label="Predicted Samples")
plt.plot(xtest,model.predict(xtest),color="yellow",label="Line of Regression")
plt.legend()
plt.show()
Page 163 of 201

Logistic Regression:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
data = sns.load_dataset('titanic')
### check for null
data.isnull().sum()
Output:
survived 0
pclass 0
sex 0
age 177
sibsp 0
parch 0
fare 0
embarked 2
class 0
who 0
adult_male 0
deck 688
embark_town 2
alive 0
alone 0
dtype: int64
mean_age = round(data['age'].mean(),2)
data['age'] = data['age'].fillna(mean_age)
data.isnull().sum()
Page 164 of 201

Output:
survived 0
pclass 0
sex 0
age 0
sibsp 0
parch 0
fare 0
embarked 2
class 0
who 0
adult_male 0
deck 688
embark_town 2
alive 0
alone 0
dtype: int64
data = data.drop(["deck","embark_town"],axis=1)
data = data.dropna()
data.isnull().sum()
y = np.array(data['survived']) ## target
x = data[['pclass','sex','age','sibsp','parch','embarked']]
### Lable Encoding
x['sex'] = x['sex'].map({"male":0,"female":1})
x['embarked'] = x['embarked'].map({"S":0,"C":1,"Q":2})
### Split the data into training and testing
xtrain,xtest,ytrain,ytest = train_test_split(x,y,train_size=0.80,random_state=3)
## Build the model
model = LogisticRegression()
## Train the model
model.fit(xtrain,ytrain)
Page 165 of 201

### Prediction
ypred = model.predict(xtest)
ypred
Output:
array([1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,
1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0])
ytest
Output:
array([1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1,
1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0,
1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,
1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,
0, 0])
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(ytest,ypred)
cm
Page 166 of 201

Output:
array([[96, 14],
[27, 41]])
cm.diagonal().sum()/cm.sum()
Output:
0.7696629213483146
from sklearn.metrics import accuracy_score
a = accuracy_score(ytest,ypred)
Output:
0.7696629213483146
Page 167 of 201

REORD WORK
preparation
b. maintainable
c. extendable
d. testable and
e. robust
Page 168 of 201

RECORD WORK
EP1. Predict salary of an employee based on years of experience using linear regression.
EP2. Classify the loan applicants using logistic regression.
Page 169 of 201

Page 170 of 201
Page 171 of 201
Page 172 of 201
1. Differentiate classification and prediction.

2. List the metrics used in classification.
3. Define linear regression model.
4. Differentiate simple linear regression and multiple linear regression.
Page 173 of 201

Page 174 of 201
Page 175 of 201
E11. 15. Implement Decision tree, random forest, k-Nearest Neighbor algorithms.

1. to learn the use the decision tree and random forest methods.
2. to learn how use K-Nearest Neighbor algorithms in python.
1. apply decision tree and random forest methods.
2. apply K-Nearest Neighbor algorithms in python.
CONCEPT AT A GLANCE
Decision tree Algorithm
In general, Decision tree analysis is a predictive modeling tool that can be applied across many areas.
Decision trees can be constructed by an algorithmic approach that can split the dataset in different
ways based on different conditions. Decisions tress are the most powerful algorithms that falls under
the category of supervised algorithms.
Random Forest Algorithm
Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based on the
concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex
problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree
and based on the majority votes of predictions, and it predicts the final output. The greater number
of trees in the forest leads to higher accuracy and prevents the problem of overfitting.
K-Nearest Neighbor Algorithm
K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be used for
both classification as well as regression predictive problems. However, it is mainly used for
classification predictive problems in industry. The following two properties would define KNN well.
Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a specialized
training phase and uses all the data for training while classification.
Non-parametric learning algorithm − KNN is also a non-parametric learning algorithm because it

doesn’t assume anything about the underlying data.
Page 176 of 201

DecisionTree:
import math
!pip install xlsxwriter
import xlsxwriter
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve,auc
book = xlsxwriter.Workbook("dt.xlsx")
sheet = book.add_worksheet()
r=0
sheet.write(r, 0, 'DecisionTree')
r=r+1
sheet.write(r, 0, 'Accuracy')
sheet.write(r, 1, 'Precision')
sheet.write(r, 2, 'Recall')
sheet.write(r, 3, 'F-measure')
sheet.write(r, 4,'Specificty')
sheet.write(r, 5,'GeometricMean')
sheet.write(r, 6,'AUC')
x = df_final.drop(['Defective'],axis=1)
y = df_final.Defective
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30)
clf = DecisionTreeClassifier()
clf = clf.fit(x_train, y_train)
predictions = clf.predict(x_test)
c=confusion_matrix(y_test, predictions)
print("confusion_matrix:")
print(confusion_matrix(y_test,predictions))
mo=c[1][0] + c[1][1] + c[0][0] + c[0][1]
Page 177 of 201

mo1=c[0][1] + c[1][1]
mo2=c[1][0] + c[1][1]
if mo!=0:
acc = round(((c[1][1] + c[0][0]) / mo),2)
else:
acc=0
if mo1!=0:
pre = round((c[1][1] / mo1),2)
else:
pre =0
if mo2!=0:
rec = round((c[1][1] / mo2),2)
else:
rec=0
print("Accuracy=",acc)
print("Precision=",pre)
print("Recall=",rec)
mo3=pre+rec
if mo3!=0:
fm = round(((2 * pre * rec) / mo3),2)
else:
fm=0
print("F-measure=",fm)
mo4=c[0][0]+c[0][1]
if mo4!=0:
sp = round((c[0][0] / mo4),2)
else:
sp=0
print("Specificity=",sp)
gm = round((math.sqrt(rec * sp)),2)
print("Geometric Mean=",gm)
Page 178 of 201

tp = c[1][1]
fn = c[1][0]
fp = c[0][1]
tn = c[0][0]
c=0
dt_fpr,dt_tpr,threshold = roc_curve(y_test,predictions)
auc = round((auc(dt_fpr,dt_tpr)),2)
print("AUC=",auc)
r=r+1
sheet.write(r, c, acc)
sheet.write(r, c + 1, pre)
sheet.write(r, c + 2, rec)
sheet.write(r, c + 3, fm)
sheet.write(r, c + 4, sp)
sheet.write(r, c + 5, gm)
sheet.write(r, c + 6, auc)
book.close()
Output:
confusion_matrix:
[[252 25]
[ 26 262]]
Accuracy= 0.91
Precision= 0.91
Recall= 0.91
F-measure= 0.91
Specificity= 0.91
Geometric Mean= 0.91
AUC= 0.91
Page 179 of 201

RandomForest:
import math
import xlsxwriter
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
book = xlsxwriter.Workbook("rf.xlsx")
r=0
sheet.write(r, 0, 'RandomForest')
r=r+1
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(x_train, y_train)
predictions = rfc.predict(x_test)
mo=c[1][0] + c[1][1] + c[0][0] + c[0][1]
mo1=c[0][1] + c[1][1]
Page 180 of 201

mo2=c[1][0] + c[1][1]
if mo!=0:
acc = round(((c[1][1] + c[0][0]) / mo),2)
else:
acc=0
if mo1!=0:
pre = round((c[1][1] / mo1),2)
else:
pre =0
if mo2!=0:
rec = round((c[1][1] / mo2),2)
else:
rec=0
mo3=pre+rec
if mo3!=0:
else:
fm=0
mo4=c[0][0]+c[0][1]
if mo4!=0:
sp = round((c[0][0] / mo4),2)
else:
sp=0
tp = c[1][1]
Page 181 of 201

fn = c[1][0]
fp = c[0][1]
tn = c[0][0]
rf_fpr,rf_tpr,thresholds = roc_curve(y_test,predictions)
auc = round((auc(rf_fpr,rf_tpr)),2)
print("AUC=",auc)
c=0
r=r+1
book.close()
Output:
Confusion_matrix:
[[271 14]
[ 35 245]]
Accuracy= 0.91
Precision= 0.95
Recall= 0.88
F-measure= 0.91
Specificity= 0.95
AUC= 0.91
Page 182 of 201

K-Nearest Neighbor(KNN)
import math
import xlsxwriter
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
book = xlsxwriter.Workbook("knn.xlsx")
r=0
sheet.write(r, 0, 'K-Nearest Neighbor(KNN)')
r=r+1
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(x_train, y_train)
predictions = knn.predict(x_test)
mo=c[1][0] + c[1][1] + c[0][0] + c[0][1]
mo1=c[0][1] + c[1][1]
Page 183 of 201

mo2=c[1][0] + c[1][1]
if mo!=0:
acc = round(((c[1][1] + c[0][0]) / mo),2)
else:
acc=0
if mo1!=0:
pre = round((c[1][1] / mo1),2)
else:
pre =0
if mo2!=0:
rec = round((c[1][1] / mo2),2)
else:
rec=0
mo3=pre+rec
if mo3!=0:
else:
fm=0
mo4=c[0][0]+c[0][1]
if mo4!=0:
sp = round((c[0][0] / mo4),2)
else:
sp=0
tp = c[1][1]
Page 184 of 201

fn = c[1][0]
fp = c[0][1]
tn = c[0][0]
c=0
knn_fpr,knn_tpr,threshold = roc_curve(y_test,predictions)
auc = round((auc(knn_fpr,knn_tpr)),2)
print("AUC=",auc)
r=r+1
book.close()
Output:
confusion_matrix:
[[277 14]
[ 36 238]]
Accuracy= 0.91
Precision= 0.94
Recall= 0.87
F-measure= 0.9
Specificity= 0.95
AUC= 0.91
Page 185 of 201

REORD WORK
preparation
b. maintainable
c. extendable
d. testable and
e. robust
Page 186 of 201

RECORD WORK
EP1. Using weather dataset, forecast the “Play” using decision tree algorithm.
EP2. Using customer transaction history, predict the customer decision on product purchase.
Page 187 of 201

Page 188 of 201
Page 189 of 201
Page 190 of 201
1. How KNN differs with Naïve Bayes algorithm?

2. Differentiate decision tree and random forest algorithms.
3. Define Decision tree.
4. Give the advantages of random forest algorithm.
Page 191 of 201

Page 192 of 201
Page 193 of 201
E12. 16. Implement K-means and hierarchical clustering algorithms.

1. to learn the use K-means algorithm.
2. to learn how to how to use hierarchical clustering algorithms.
1. apply k-means algorithm.
2. apply hierarchical clustering algorithms.
CONCEPT AT A GLANCE
K-Means Algorithm
K-means clustering algorithm computes the centroids and iterates until we it finds optimal centroid.
It assumes that the number of clusters are already known. It is also called flat clustering algorithm.
The number of clusters identified from data by algorithm is represented by ‘K’ in K-means.
In this algorithm, the data points are assigned to a cluster in such a manner that the sum of the squared
distance between the data points and centroid would be minimum. It is to be understood that less
variation within the clusters will lead to more similar data points within same cluster.
Hierarchical Clustering
Hierarchical clustering is another unsupervised learning algorithm that is used to group together the
unlabeled data points having similar characteristics. Hierarchical clustering algorithms falls into
following two categories −
Agglomerative hierarchical algorithms − In agglomerative hierarchical algorithms, each data point is

treated as a single cluster and then successively merge or agglomerate (bottom-up approach) the pairs
of clusters. The hierarchy of the clusters is represented as a dendogram or tree structure.
Divisive hierarchical algorithms − On the other hand, in divisive hierarchical algorithms, all the data
points are treated as one big cluster and the process of clustering involves dividing (Top-down
approach) the one big cluster into various small clusters.
K-Means algorithm:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
df = pd.read_csv('/content/iris.csv')
Page 194 of 201

df.head(10)
x = df.iloc[:, [0,1,2,3]].values
kmeans5 = KMeans(n_clusters=5)
y_kmeans5 = kmeans5.fit_predict(x)
print(y_kmeans5)
kmeans5.cluster_centers_
Output:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1111111111111000300030330303003030300
0000033330300033303333303340244232424
4404442204020420042224002440444044404
4 0]
array([[6.20769231, 2.85384615, 4.74615385, 1.56410256],
[5.006 , 3.418 , 1.464 , 0.244 ],
[7.475 , 3.125 , 6.3 , 2.05 ],
[5.508 , 2.6 , 3.908 , 1.204 ],
[6.52916667, 3.05833333, 5.50833333, 2.1625 ]])
Error =[]
for i in range(1, 11):
kmeans = KMeans(n_clusters = i).fit(x)
kmeans.fit(x)
Error.append(kmeans.inertia_)
plt.plot(range(1, 11), Error)
plt.title('Elbow method')
plt.xlabel('No of clusters')
plt.ylabel('Error')
plt.show()
kmeans3 = KMeans(n_clusters=3)
y_kmeans3 = kmeans3.fit_predict(x)
Page 195 of 201

print(y_kmeans3)
kmeans3.cluster_centers_
Output:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0000000000000112111111111111111111111
1112111111111111111111111121222212222
2211222212121221122222122221222122212
2 1]
array([[5.006 , 3.418 , 1.464 , 0.244 ],
[5.9016129 , 2.7483871 , 4.39354839, 1.43387097],
[6.85 , 3.07368421, 5.74210526, 2.07105263]])
Page 196 of 201

REORD WORK
preparation
b. maintainable
c. extendable
d. testable and
e. robust
Page 197 of 201

RECORD WORK
EP1. Cluster the employees based on their salaries using k-means algorithm.
Page 198 of 201

Page 199 of 201
Page 200 of 201
Page 201 of 201
1. Define cluster.
2. List various partitioning algorithms.
3. Give the difference between agglomerative and Divisive hierarchical clustering methods.
4. What is K-Means Algorithm?
5. Give the library file used for K-Means method in python.
Page 202 of 201

Page 203 of 201
Page 204 of 201

Data Mining Python Lab

Uploaded by

Copyright:

Available Formats

Data Mining Python Lab

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Python Lab

Uploaded by

Copyright:

Available Formats

U18IT608

DATA MINING USING PYTHON

Class: B.Tech.(IT) VI-Semester

Academic Year: ………………………… Semester: ……………

DEPARTMENT OF INFORMATION TECHNOLOGY

bearing the Roll No. ………………………… of ………………… Class

during the academic year ………………… under our supervision.

Course Faculty Head of the Department

a) A well-defined learner centric continuous internal evaluation (CIE) will be followed in

1. Learner Centric Lab Manual & Record Book (LMRB):

2. Videos on programming Tasks:

3. Requisite prior knowledge (K):

4. During lab session:

4.2. During Programming Task:

4.3. After completion of Programming Task:

4.3.2. VIVA-VOCE (V): 10 marks

Department vision and mission 1

Program Educational Objectives (PEOs) 2

Program Outcomes (POs) 2

Program Specific Outcomes (PSOs) 3

Instructions to the students 4

Rubrics for Continuous Internal Evaluation (CIE) 5

Make-up laboratory sessions 8

Laboratory programs Calendar 10

List of programs to be performed 15

Mission of the Institute:

DEPARTMENT VISION & MISSION

Mission of the Department:

PROGRAM EDUCATIONAL OBJECTIVES (PEOs)

PROGRAM OUTCOMES (POs)

Program Outcomes Engineering graduates will be able to

PO1 Engineering Apply the knowledge of mathematics, science,

PROGRAM SPECIFIC OUTCOMES (PSOs)

PSO The Information Technology Engineering graduates will be able to

PSO1 Apply analytical and experimental problem-solving skills in the Information

CIE- Assessment for experiments done in every lab Weightage

Requisite Prior Coding Knowledge Knowledge (K) 10%

Participation as an individual while developing programs Participation (P) 10%

A. Before start of Programming Task

B. During Programming Task

Marks shall be awarded as below:

C. Activities to be completed by student after completion of programming tasks:

Marks shall be awarded as below:

(Faculty I/c, Data Mining using Python Laboratory)

Program Development Steps (PDS)

Week # Date Title of the experiment

18.12.2023 to 1. Write a program to perform multidimensional data model using

01.01.2024 to 3. Introduction to Python programming, Basics of Python.

05.02.2024 to 10. Introduction to Data Visualization.

11.03.2024 to 14. Regression and Classification: Linear regression and logistic

18.03.2024 to 15. Implement Decision tree, random forest, k-Nearest Neighbor

Make-up lab sessions - for sessions lost due to unforeseen circumstances

2. Write a program to perform various

3. Introduction to Python programming,

5. List Collection and Tuple Collection.

8. Introduction to NumPy, Operations on

9. Introduction to Pandas, Getting and

10. Introduction to Data Visualization.