Nothing Special   »   [go: up one dir, main page]

B.E Cse Batchno 106

Download as pdf or txt
Download as pdf or txt
You are on page 1of 72

HOUSE PRICE PREDICTION

Submitted in partial fulfillment of the requirements for


the award of
Bachelor of Engineering degree in Computer Science and Engineering

by

K PAVAN (Reg. No. 37110555)

T RAGHUL(Reg. No. 37110613)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SCHOOL


OF COMPUTING

SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY

(DEEMED TO BE UNIVERSITY)
Accredited with Grade “A” by NAAC
JEPPIAAR NAGAR, RAJIV GANDHI SALAI,
CHENNAI – 600 119

MARCH - 2020
i
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY

(DEEMED TO BE UNIVERSITY)
Accredited with “A” grade by NAAC
Jeppiaar Nagar, Rajiv Gandhi Salai, Chennai – 600 119
www.sathyabama.ac.in

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

BONAFIDE CERTIFICATE

This is to certify that this project report is the bonafide work of K PAVAN (Reg. No.
37110555) and T RAGHUL (Reg. No.37110613) who carried out the project entitled
“HOUSE PRICE PREDICTION MODEL” under my supervision from August 2019 to
March 2020.

Internal Guide
Dr..Ashok Kumar.,M.E.,Phd.,

Head of the Department

Submitted for Viva voce Examination held on

Internal Examiner External Examiner

ii
DECLARATION

I K PAVAN and T RAGHUL hereby declare that the Project Report entitled
“HOUSE PRICE PREDICTION MODEL” is done by us under the guidance of
DR.ASHOK KUMAR, M.E.,Phd Department of Computer Science and Engineering
at Sathyabama Institute of Science and Technology is submitted in partial
fulfillment of the requirements for the award of Bachelor of Engineering degree in
Computer Science and Engineering.

DATE:

PLACE: CHENNAI SIGNATURE OF THE CANDIDATE

iii
ACKNOWLEDGEMENT

I am pleased to acknowledge my sincere thanks to Board of Management of


SATHYABAMA for their kind encouragement in doing this project and for
completing it successfully. I am grateful to them.

I convey my thanks to Dr. T. Sasikala, M.E., Ph.D., Dean, School of Computing, Dr. S.
Vigneswari, M.E., Ph.D., and Dr. L. Lakshmanan, M.E., Ph.D., Heads of the Department
of Computer Science and Engineering for providing me necessary support and details at
the right time during the progressive reviews.

I would like to express my sincere and deep sense of gratitude to my Project Guide
Dr.ASHOK KUMAR, M.E.,Phd Professor, for his valuable guidance, suggestions and
constant encouragement paved way for the successful completion of my project work.

I wish to express my thanks to all Teaching and Non-teaching staff members of the
Department of Computer Science and Engineering who were helpful in many ways for
the completion of the project.

iv
Abstract:

Usually, House price index represents the summarized price changes of residential
housing.To make it more easier for a family to search for a house we have made it more
precise by asking the required square feet, no of bedrooms and bathrooms required.
With preloaded dataset and data features, a practical data pre-processing, creative
feature engineering method is examined in this paper. The paper also proposes
regression technique in machine learning to predict house price.

Keywords: House Price, Regression Technique, Machine Learning

v
TABLE OF CONTENTS

ABSTRACT v
LIST OF FIGURES viii

CHAPTER No. TITLE PAGE No.

1. INTRODUCTION 1
1.1 MACHINE LEARNING 1
1.2 ADVANTAGES AND APPLICATIONS 2

2. LITREATURE SURVEY 7

3. AIM AND SCOPE OF PROJECT 13


3.1 EXISTING SYSTEM 13
3.2 PROPOSED SYSTEM 13
3.3 FEASIBILITY STUDY 14

4. EXPERIMENTAL METHODS AND


ALGORITHMS

4.1 HARDWARE REQUIREMENTS 16


4.2 SOFTWARE REQUIREMENTS 16
4.3 PYTHON 17
4.4 ANACONDA 19
4.5 SYSTEM DESIGN 25
4.6 USE CASE DIAGRAM 27
4.7 SEQUENCE DIAGRAM 28
4.8 ACTIVITY DIAGRAM 29

vi
5. RESULTS AND DISCUSSION 30
6.1 MODULE IMPLEMENTATION 30
6.2 SOFTWARE TESTION 35
6.3 RESULTS

6. CONCLUSION AND FUTURE WORK 39

REFERENCES 40

APPENDIX
A. PAPER ACCEPTANCE MAIL 42
B. PLAGIARISM REPORT 43
C. JOURNAL PAPER 44
D. SOURCE CODE 49

LIST OF FIGURES

FIGURE No. FIGURE NAME PAGE No.

4.1 ANACONDA 19
4.2 ANACONDA NAVIGATOR 19
4.3 SYSTEM DESIGN 25
4.4 USE CASE DIAGRAM 27
4.5 SEQUENCE DIAGRAM 28
4.6 ACTIVITY DIAGRAM 29

vii
CHAPTER 1

INTRODUCTION:

Data is at the heart of technical innovations, achieving any result is now


possible using predictive models. Machine learning is extensively used in this
approach. Machine learning means providing valid dataset and further on
predictions are based on that, the machine itself learns how much importance a
particular event may have on the entire system supported its pre-loaded data
and accordingly predicts the result. Various modern applications of this
technique include predicting stock prices, predicting the possibility of an
earthquake, predicting company sales and the list has endless possibilities.

Our aim is to predict a house price based on their needs and priorities.. By
analyzing previous market trends and price ranges, and also upcoming
developments future prices will be predicted.The functioning involves a website
which accepts customers specifications and then combines the application of
neuralnetwork.

Machine Learning
It is a subset of artificial intelligence (AI).It provides system the ability to
automatically learn and improve by itself.It focuses on the development of computer
programs that can access data learn by themselves. The process of learning
begins with observations based on the examples that we provide. The aim is to
make computers to learn by itself without the need of a human.
Machine Learning Methods
Machine learning can be classified into three types namely the supervised,
unsupervised and reinforcement learning. Supervised machine learning
algorithms can apply what has been learned in the past to new data predict future
events. It analysis from a known training dataset, and produces a functions to
predict outputs.

1
The system will provide outputs for inputs after training. The system will compare
with the correct, intended output and find errors and modify it to make the model
more practical and useful.

In contrast, unsupervised machine learning algorithms are the ones


which does not require any supervision.It is used when when the sample data
used to train is classified .As name suggests it, the model itself finds the hidden
patterns and insights. The system may or may not produce right output, but it
explores the data and can draw inferences from datasets by its own.

Semi-supervised machine learning algorithms is a combination of both


supervised and unsupervised learning, In semi-supervised learning, an algorithm
learns from a dataset that includes both labeled and unlabeled data, usually mostly
unlabeled.Generally it is chosen when the sample data requires skilled resources
in order to train from it. Otherwise, It doesn’t require additional resources.

Reinforcement machine learning algorithms is a learning method that


works based on feedback . Reinforcement learning differs from supervised learning
in not needing labelled input/output pairs be presented. It is studied in various
disciplines such as statistics,information theory etc.

Advantages of Machine Learning


It helps to manage a large amount of data .There is no need for human
interference.it can also perform complex operations by its own.It is extremely useful
for those who are in the field of e commerce or even healthcare.It is extremely useful
in manufacturing industry.

2
While even experts often cannot be sure where and by which correlation a
production error in a plant fleet arises, Machine Learning offers the possibility to
identify the error early this saves down times and money. Machine learning are now
used in the medical field. In the future, after collecting huge amounts of data apps
will be able to warn in case his doctor wants to prescribe a drug that he cannot
tolerate.The app can also suggest alternative options by taking into account the
genetics of patient.

Applications of Machine Learning


1. Virtual Personal Assistants
There are many personal assistants available like apple’s siri, google’s google
assistant and amazon’s alexa.The only work for them is to find information when
customers asks to find it over voice. To ask any questions we need to activate them
and ask “What is the time in London?” or similar questions.To answer,it your
personal assistant looks out for the information in browser, or collect it from phone
apps. You can even ask your assistants for certain tasks like “Set a reminder for
tomorrow”, “Remind me to wish my friends birthday”. Here the personal assistants
uses machine learning to respond to users task or questions. These assistants are
also integrated in various other devices such as televisions(smart tv) and
speakers.These assistants make these devices a more smarter one.

2.Traffic Predictions :

Whenever we visit to a new place or when we are not sure about the route we
generally use maps it shows the distance, the amout of time it takes to cover the
distance and also it provides the information regarding traffic congestion ,By making
use of machine learning it predicts the traffic in the particular route by analyzing the
previous days traffic on the route on the same time .hence machine learning helps
us in predicting traffic.

3
3.VIDEO SURVEILLENCE:
A single person cannot be monitoring multiple cameras at single time that’s where

machine learning is used nowadays video cameras are powered by AI henceit

helps us by tracking unusual behaiviour for example if a person is standing

motionless for long time or if a person is stumbling then it alerts the attendant who

is looking after the camera .It has been used extensively in video survillence and it

has been extremely useful.

4.EMAIL SPAM AND FILTERING

Machine learning has been extensively used in checking spam and malware

emails .It detects new malware and protects users against it.It can detect various

malwares and can protect us .

5.online customer support:


Many websites are providing customers with a chatbox to answer their queries and
doubts but most of the time there will not be any executive behind chatbox.These
chatbox are powered by Ai and machine learning makes them to get better.These
chatboxes gets better with time.

6.Search Engine Result Refining:

Whenever we search for anything in web the search engine for example if it is
google then it will keep track of what users are opening after the reults are shown.it
checks whether the users are clicking the top search result or the bottom
ones.Machine learning helps and makes the search engine better with time.

4
7.Product recommendations:
Every time when a product is recommended for you ,be it after you purchase a
certain product from the website or it’s a new product machine learning is the one
that helps in recommending products to customers.

8.Online fraud detection:

It helps in detecting money fraud in online .many payment gateways have started to
implement this technique to prevent fraud .company like paypal uses machine
learning to detect fraud.

5
INTRODUCTION TO PROJECT

Housing is one of the most valuable economic assets an individual can purchase
during his adult life. Hence we need to be extremely careful before buying a house
we need to spend correct money to buy a house.

In the following, we explore different machine learning techniques and


methodologies to predict house prices. The data contains a train and a test dataset.
Our objective is, to predict house prices based on users requirements and
needs .Our model predicts the price of a house from the sample data that has been
given.

6
CHAPTER 2

LITERATURE SURVEY

Literature Survey

1. Housing Price Prediction Using Machine Learning


Algorithms: The Case of Melbourne City, Australia

Author: The Danh Phan, 2018

House price Prediction is a crucial topic of land . The literature attempts to get
useful knowledge from historical data of property markets. Machine learning
techniques are applied to research historical property transactions in
Australia to get useful models for house buyers and sellers. Revealed is the
the high discrepancy between house prices within the costliest and most
affordable places within Melbourne city. Moreover, experiments demonstrate
that the mixture of Stepwise and Support Vector
Machine that's supported mean squared error measurement may be
a competitive approach.

2. Predicting Sales Prices of the Houses Using Regression


Methods of Machine Learning

Authors: Parasich Andrey Viktorovich ; Parasich Viktor


Aleksandrovich ; Kaftannikov Igor Leopoldovich ; Parasich Irina
Vasilevna, 2018

This article we'll describe our solution for “House Prices: Advanced Regression
Techniques” machine learning competition, which was persisted Kaggle
platform. The goal is to predict house sale price by attributes like house
area,year of building etc. In our solution, we use classic machine learning
algorithms, and our original methods, which may be described here. At the
highest of the competition, we took 18th place among 2124 participants from
whole world.

7
3. Real Estate Value Prediction Using Linear Regression

Authors: Nehal N Ghosalkar ; Sudhir N Dhage, 2018

The real estate market may be a standout amongst the foremost focused
regarding pricing and keeps fluctuating. It is one among the prime fields to
use the ideas of machine learning on the way to enhance and foresee the
prices with high accuracy. There are three factors that influence the price of a
house which includes physical conditions, concepts and location. The current
framework includes estimating the worth of homes with none expectations of
market prices and price increment. The objective of the paper is prediction of
residential prices for the purchasers considering their financial plans and
wishes . By breaking down past market patterns and value ranges, and coming
advancements future costs are going to be anticipated. This examination
means to predict house prices in Mumbai city with Linear Regression. It will help
clients to place resources into a gift without moving toward a broker. The result
from this research proved linear regression gives minimum prediction error
which is 0.3713.

4. Predicting Housing Market Trends Using Twitter Data

Authors: Marlon Velthorst ; Cicek Güven, 2019

In this study, we attempt to predict the Dutch housing market trends using text
mining and machine learning as an application of knowledge science methods
in finance. Our main goal is to predict the short term upward or downward trend
of the average house price in the Dutch market by using text data collected from
Twitter. Twitter is widely used also and has been proven to be a helpful
source of knowledge . However, Twitter, text mining (tokenization,
bag-of-words, n-grams, weighted term frequencies) and machine learning
(classification algorithms) have not been combined yet in order to predict the
housing market trends in short term. In this study, tweets including predefined
search words are collected counting on domain knowledge, and therefore
the corresponding text is grouped by month as documents. Then words and
word sequences are transformed into numerical values. These values served as
attributes to predict whether the housing market moves up or down,

8
i.e. we approached this as a binomial classification problem relating text data of
a month with (up or down) trends for the subsequent month.
Our main results reveal there's a correlation between the (weighted) frequency
of words and short term housing trends, in other words, we were ready to make
accurate predictions of trends in short term using multiple machine learning and
text mining techniques combined.

5. House Price Prediction Using Machine Learning and Neural


Networks

Authors: Ayush Varma ; Abhijit Sarma ; Sagar Doshi ; Rohini


Nair, 2018

Real estate is that the least transparent industry in our ecosystem. Housing
prices keep changing day in and outing and sometimes are hyped instead
of being supported valuation. Predicting housing prices with real factors is that
the main crux of our scientific research . Here we aim to form our
evaluations supported every basic parameter that's considered while
determining the worth . We use various regression techniques during
this pathway, and our results aren't sole determination of 1 technique
rather it's the weighted mean of varied techniques to offer most accurate
results. The results proved that this approach yields minimum error and
maximum accuracy than individual algorithms applied. We also propose to use
real-time neighborhood details using Google maps to urge exact real-world
valuations.

6. Forecasting house price index of China using dendritic


neuron model

Authors: Ying Yu ; Shuangbao Song ; Tianle Zhou ; Hanaki


Yachi ; Shangce Gao, 2016

The results of Chinese housing market continues to prosper or not is said to the
event of China, and further it also has an impression on the planet finance.
Thus forecasting the house price level is extremely important and
challenging. during this paper we propose an unsupervised learnable neuron

9
model (DNM) by including the nonlinear interactions between excitation and
inhibition on dendrites.
We use DNM to suit the House price level (HPI) data then forecast the trends of
Chinese housing market. To verify the effectiveness of the DNM, we use a
standard statistical model (i.e., the exponential smoothing (ES) model) to
form a performance comparison. Three quantitative statistical metrics including
normalized mean square error, absolute percentage of error, and coefficient of
correlation are wont to evaluate the forecasting performance of the 2 models.
Experimental results demonstrate that the proposed DNM is best than
ES altogether of the three quantitative statistical metrics.

7. Prediction of real estate price variation based on economic


parameters

Authors: Li Li ; Kai-Hsuan Chu, 2017

It is documented that a lot of economic parameters may more or less


influence the important estate price variation. additionally , the banker and
investor also are interesting to understand the important estate price future
change. There had not appropriate model for including these factors for price
prediction. Here, the influences of most macroeconomic parameters
on land price variation are investigated before establishing the worth fluctuation
prediction model. Here, back propagation neural network (BPN) and radial
basis function neural network (RBF) two schemes are employed to
determine the nonlinear model for real estates price variation prediction of
Taipei, Taiwan supported leading and simultaneous economic indices. Those
prediction results are compared with the general public Cathay House price
level or the Sinyi Home price level . The mean absolute error and root mean
square error two indices of the worth variation are selected because
the performance index. the general public related data of Taipei,
Taiwan land variation during 2005 ~ 2015 are adopted for analysis and
prediction comparison.

10
8. Predicting house sale price using fuzzy logic, Artificial
Neural Network and K-Nearest Neighbor

Authors: Muhammad Fahmi Mukhlishin ; Ragil Saputra ; Adi


Wibowo, 2017

Determining the worth of land and residential are regularly determined at the
earliest by the vendor , however determining the proper price within the sales
process will affect the buyer's desire to elect and bid. Special characteristics in
Indonesia, tax object value (NJOP) and site parameters are high influence
to the worth . during this paper we proposed the prediction of land and house
value using several methods. symbolic logic , Artificial Neural Network and
K-Nearest Neighbor are compared during this paper to get the
foremost appropriate method which will be used as a reference for
determining the worth by the sellers. Google Maps is employed to represent the
spatial data for prediction parameter. The variables that utilized in the methods
are NJOP of land, the locations, the age, NJOP of house, and therefore
the valuable location of the land. The experimental methods are tested by
comparing between the important price transaction and therefore the prediction
using MAPE formula.

9. Comprehensive Analysis of Housing Price Prediction in


Pune Using Multi-Featured Random Forest Approach

Authors: Rushab Sawant ; Yashwant Jangid ; Tushar


Tiwari ; Saurabh Jain ; Ankita Gupta, 2018

The housing sector in India has been predicted to grow at 30-35%


over subsequent decade. In terms of employment provided, it's second only to
the agricultural sector. Housing is one among the main domain of land . Pune is
emerging together of the main metropolitan cities of India and has many
prestigious Educational institutions and IT parks. This makes it a
perfect place to shop for homes. Vagueness among the costs of homes makes
it challenging for the customer to pick their dream house.

11
The interests of both buyer and seller should be satisfied in order that they are
doing not overestimate or underestimate price. This housing price prediction
model acts as a hand for buyer and seller or a true realtor to form a
better-informed decision. to realize this, diverse features are selected as input
from feature set and various algorithms are applied like Random Forest and
Decision Tree.

10. Time-Aware Latent Hierarchical Model for Predicting House


Prices

Authors: Fei Tan ; Chaoran Cheng ; Zhi Wei, 2017

It is widely acknowledged that the worth of a home is the mixture of an


outsized number of characteristics. House price prediction thus presents a
singular set of challenges in practice. While an outsized body of works are
dedicated to the present task, their performance and applications are limited by
the shortage of while span of transaction data, the absence of real-world
settings and therefore the insufficiency of housing features. to the present end,
a time-aware latent hierarchical model is introduced to capture underlying
spatiotemporal interactions behind the evolution of house prices. The
hierarchical perspective obviates the necessity for historical transaction data of
exactly same houses when temporal effects are considered. The proposed
framework is examined on a large-scale dataset of the property transaction in
Beijing. the entire procedure strictly complies with the real-world scenario. The
empirical evaluation results demonstrate the outperformance of our approach
over alternative competitive methods.

12
CHAPTER 3

AIM AND SCOPE OF THE PRESENT SYSTEM

EXISTING SYSTEM

Multi Linear Regression

Multiple Linear Regression. It shows the relationship between two or more


explanatory variables and scalar response variable .Independent variable value is
associated with dependent variable value

Limitations

The dependent variable y must be continuous.. The independent variables can be


of any type. The dependent variable is usually affected by the independent
variables.

Proposed System

Linear Regression is a technique that helps to identify the relationship between a


scalar response (or dependent variable) and one or more explanatory variables (or
independent variables). The case of one explanatory variable is called simple linear
regression.

Advantages

 Space complexity is very low it just needs to save the weights at the end of
training. hence it's a high latency algorithm.

 Its very simple to understand

 Good interpretability

 Feature importance is generated at the time model building. With the help
of hyperparameter lamba, you can handle features selection hence we can
achieve dimensionality reduction

13
FEASIBILITY STUDY

The feasibility of the project is analyzed in this phase and business proposal is put
forth with a very general plan for the project and some cost estimates. The
feasibility study of the proposed system is carried out. It is carried out to ensure
that the proposed system is not a burden to the company. Economic feasibility

1. Economical feasibility

2. Technical feasibility

3. Social feasibility

ECONOMICAL FEASIBILITY

This study is generally carried out to check whether right amount of funds are
invested in the model.this study is done to eliminate excess amount of money
poured into a single model.It makes sure whether the model is well within the
budget.It is extremely important to spend only right amount of funds to a model.

TECHNICAL FEASIBILITY

It makes sure whether the technical requirements are limited to what we can
offerd.Any system developed should not have high demand on technical resources
since it puts burden on client,It also checks the projects potential what it can do
once developed.

14
SOCIAL FEASIBILITY

It is carried out check how a system acts with other systems.It checks the level of
acceptance of the system by the user. It trains the user to use the system efficiently.
it is a necessity. Since a client is the final user of the system he can critizise the
system but it should be in a disciplined and meaningful manner.

15
CHAPTER 4

EXPERIMENTAL METHODS AND ALGORITHMS

HARDWARE REQUIREMENTS

The most common set of requirements defined by any operating system or


software application is the physical computer resources, also known as hardware.
A hardware requirements list is often accompanied by a hardware compatibility list,
especially in case of operating systems. The minimal hardware requirements are
as follows,

1. PROCESSOR : PENTIUM IV

2. RAM : 8 GB

3. PROCESSOR : 2.4 GHZ

4. MAIN MEMORY : 8GB RAM

5. PROCESSING SPEED : 600 MHZ

6. HARD DISK DRIVE : 1TB

7. KEYBOARD :104 KEYS

SOFTWARE REQUIREMENTS

Software requirements deals with defining resource requirements and prerequisites


that needs to be installed on a computer to provide functioning of an application.
These requirements are need to be installed separately before the software is
installed. The minimal software requirements are as follows,

1. FRONT END :PYTHON

2. IDE : ANACONDA

3. OPERATING SYSTEM :WINDOWS 10

16
Python Language

 Python is an object-oriented programming language


 It is created by Guido Rossum in 1989.
 It is ideally designed for rapid prototyping of complex applications.
 It is extensible to C or C++.
 Companies like google and nasa also uses python language
 It is majorly used in AI

Python Programming Characteristics

 It provides rich data types

 Syntax is simple

 It is a platform independent scripted language

 Compared to other programming languages, it allows more run-time flexibility

 A module in Python may have one or more classes and free functions

 Libraries in Pythons can also run in Linux and Windows

 For building large applications, Python can be compiled to byte-code

 It supports functional and structured programming

 It supports interactive mode that allows interacting Testing and debugging of


snippets of code

 In Python editing, debugging and testing is fast.

17
Applications of Python Programming

Web Applications

We can create web apps in python by using frameworks and CMS. We can create
web applications using Django, Flask, Pyramid, Plone, Django CMS. Sites like
Mozilla, Reddit, Instagram and PBS are written in Python.

Scientific and Numeric Computing

There are many number of libraries in python that can be used for scientific and
numeric computing . SciPy and NumPy that are used in general purpose computing.
EarthPy is used for earth science, AstroPy is used for Astronomy and so on. It is
also used in machine learning, data mining and deep learning.

Creating software Prototypes

Python is slow but is great for creating prototypes. For example: You can use
Pygame which is used to create game prototype. If you are satisfied with the
prototype then you can build the app using C or C++.

Good Language to Teach Programming

Python has been used by many students.There are several companies teaching
python to their employees. It has a lot of features and capabilities. The syntax is
simple and it is one of the easiest language to learn.

About Opencv Package

Python is a general purpose programming language started by Guido van


Rossum,It became very popular because of its simplicity and code readability. It
helps the programmer to express his ideas in fewer lines of code .

Compared to other languages like C/C++, Python is slower. Python can be easily
extended with C/C++. We can write codes in C/C++ and create a python wrapper.

18
This gives us two advantages: first, our code is as fast as original C/C++ code and
second, it is very easy to code in Python. Hence OpenCV-Python is a Python
wrapper around original C++ implementation.

Python also supports Numpy. It gives a MATLAB-style syntax.The OpenCV array


structures are converted to-and-from Numpy arrays. Whatever operations you can
do in Numpy, you can combine it with OpenCV, which increases number of
weapons in your arsenal. Besides that, several other libraries like SciPy, Matplotlib
which supports Numpy can be used with this.

So OpenCV-Python is an appropriate tool for fast prototyping of computer vision


problems.

FEATURES OF ANACONDA NAVIGATOR

Anaconda is free

It is open source, easy to install distribution of Python and R programming


languages.

It is used for scientific computing, data science, statistical analysis and machine
learning.

The latest distribution of Anaconda is Anaconda 5.3 .

19
What is Anaconda Navigator?

Anaconda Navigator may be a desktop graphical interface (GUI) included within


the Anaconda distribution. It allows us to launch applications provided within the
Anaconda distribution and simply manage conda packages, environments and
channels without the utilization of command-line commands. It is available for
Windows, macOS and Linux.

20
Applications Provided In Anaconda Distribution

The Anaconda distribution comes with the subsequent applications along side
Anaconda Navigator.

1. JupyterLab

2. Jupyter Notebook

3. Qt Console

4. Spyder

5. Glueviz

6. Orange3

7. RStudio

8. Visual Studio Code

> JupyterLab: This is the extensible working environment for interactive and the
reproducible computing, supported the Jupyter Notebook and Architecture.

>Jupyter Notebook: This is an web-based, interactive computing notebook

environment. we will able to edit and runs in human-readable docs while

describing the info

analysis.

> Qt Console: It is an PyQt GUI that supports inline figures, proper multiline

editing with syntax highlighting, graphical calltips and etc..,

Spyder: Spyder is an scientific Python Development Environment. It is a powerful


Python IDE of advanced editing, interactive testing, debugging and the
introspection
features.

VS Code: It is an streamlined code editor within the support for development


operations like debugging, task running and version control.

21
Glueviz: It is used for multidimensional data visualization across the files. It is
explored in relationships within and among related datasets.

Orange 3: It is an component-based on data mining framework. it can be used for


the
data visualization and data analysis. The workflows under Orange 3 is very
interactive and provides a large toolbox.

Rstudio: This is a set of integrated tools designed for help you to be more
productive
by R.Then it includes R essentials and notebooks.

22
New Features of Anaconda 5.3

Compiled by Latest Python release: Anaconda 5.3 is compiled by Python 3.7,


taking advantage of Python’s speed and feature improvements.
• Better Reliability: The reliability of Anaconda is improved in the latest
release by capturing and storing the package metadata for the installed
packages.
Users deploying Tensorflow can make usefull by MKL 2019 for Deep Neural
Networks.These Python binary packages are provided to realize the high CPU
performance.
• New packages has been added: These pakages are over 230 packages
which is updated and added in the new release.
• add Progress: there's a casting bug in Numpy with Python 3.7 but the
team is currently performing on patching it until Numpy is updated.

23
Flask
Flask is an API of Python that permits to create up web-applications. It was
developed by Armin Ronacher. Flask’s framework is more explicit than Django’s
framework and it is also easier to learn because it has the less base code to
implement a simple web-Application.

A Web-Application Framework or Web Framework is the


collection of modules and libraries that helps the developer to write down
applications
without writing the low-level codes like protocols, thread management, etc.
Flask is predicated on WSGI(Web Server Gateway Interface) toolkit and Jinja2
template engine.

METHOD DESCRIPTION

GET This is used to send the data in an


without encryption of the form to the
server.

HEAD provides response body to the form

POST Sends the form data to server. Data


received by POST method is not cached
by server.

PUT Replaces current representation of


target resource with URL.

DELETE Deletes the target resource of a given


URL

24
SYSTEM DESIGN

Architecture

Collection of Data Loading and Determine


Dataset Pre-Procesing Dependent and
Independent
Value

Get the results and Calculate the variable


using Regression
calculate coefficient Technique

Determine the
Calculate the Prediction
Prediction Results

25
UML DIAGRAMS
o UML stands for Unified Modeling Language.
o It is used in the field of object-oriented software engineering.
o The goal is for UML to become a common language for creating models of
object oriented computer software.
o It consists of two components: a Meta-model and a notation..
 The Unified Modeling Language is a standard language for specifying,
Visualization, Constructing and documenting the artifacts of software system,
as well as for business modeling and other non-software systems.
o It has been proven successful in the modeling of large and complex systems.
o The UML is a very important part of developing objects oriented software and
the software development process. It uses graphical notations to show the
design of software projects.
GOALS:
The Primary goals are as follows:
1. Provide users a ready-to-use, expressive visual modeling Language so that
they can develop and exchange meaningful models.
2. Provide extendibility and specialization mechanisms to extend the core
concepts.
3. Be independent of particular programming languages and development
process.
4. Provide a formal basis for understanding the modeling language.
5. Encourage the growth of OO tools market.
6. Support higher level development concepts such as collaborations,
frameworks, patterns and components.
7. Integrate best practices.

26
USE CASE DIAGRAM:
A use case diagram is a behaivioural diagram. Its purpose is to present a
graphical overview of the functionality provided by a system in terms of actors, their
goals (represented as use cases), and any dependencies between those use
cases. The main purpose of a use case diagram is to show what system functions
are performed for which actor. Roles of the actors in the system can be depicted.

27
SEQUENCE DIAGRAM:
A sequence diagram in Unified Modeling Language (UML) is a interaction diagram
that shows how processes operate with one another and in what order.. Sequence
diagrams are sometimes called event diagrams, event scenarios, and timing
diagrams.

28
ACTIVITY DIAGRAM:
Activity diagrams are graphical representations of workflows of stepwise activities
and actions with support for choice, iteration and concurrency. Activity diagrams
can be used to describe the business and operational step-by-step workflows of
components in a system. An activity diagram shows the overall flow of control.

29
CHAPTER 5

RESULTS AND PERFORMANCE ANALYSIS

Module Implementation

Collection of Dataset

The dataset used in this project was Parameters such as Area in square meters,
Location, no of bedrooms and no of bathrooms in that particular property. Selling
price is a dependent variable on several other independent variables.

Data Preprocessing

It is a process of transforming the raw, complex data into systematic


understandable knowledge. It will find out missing and redundant data in the
dataset. Thus, this brings uniformity in the dataset. However in our dataset, there
was no missing values .

Import Libraries

A library is a collection of modules the first step is to import the libraries that we
require in our system.There are functions for them, which can be invoked without
writing the required code. This is a list for most popular Python libraries for Data
Science. We have imported pandas library and named it as pd.

30
Import the Dataset

A lot of datasets come in CSV formats.At first We have to locate direcotory of csv
file and read it using a method called read_csv which may be found in the library
called pandas.

Encoding categorical data

Sometimes our data is in qualitative form, that is we have texts as our data. We can
find categories in text form. Now it gets complicated for machines to know texts and
process them, rather than numbers, since the models are based on mathematical
equations and calculations. Therefore, we have to encode the categorical data.

Split Dataset into Training and Test Set

Now we should split our dataset into two sets — a Training set and a Test set. We
will train our machine learning models on our training set, i.e our machine learning
models will try to understand any correlations in our training set and then we will
test the models on our test set to check how accurately it can predict. In general we
need to allocate 80% of the dataset to training set and the remaining 20% to test
set.

Dependent and independent variable in regression

Regression analysis describes the relationship between independent variables and


the dependent variable. It predicts value of dependent variable by analyzing the
value of independent variables.

Regression coefficient

It is same as slope of the line of the regression equation.

31
Prediction

Prediction is nothing but the output of an algorithm after being trained on a


dataset and applied to new data and predicts the output. Finally our model will
predict the house price based on user inputs.

SOFTWARE TESTING

General

In a generalized way, we can say that the system testing is a type of testing in
which the main aim is to make sure that system performs efficiently and seamlessly.
The process of testing is applied to a program with the main aim to discover an
unprecedented error, an error which otherwise could have damaged the future of
the software. Test cases which brings up a high possibility of discovering and error
is considered successful. This successful test helps to answer the still unknown
errors.

TEST CASE

Testing, as already explained earlier, is the process of discovering all possible


weak-points in the finalized software product. Testing helps to counter the working
of sub-assemblies, components, assembly and the complete result. The software is
taken through different exercises with the main aim of making sure that software
meets the business requirement and user-expectations and doesn’t fails abruptly.
Several types of tests are used today. Each test type addresses a specific testing
requirement.

Testing Techniques

A test plan is a document which describes approach, its scope, its resources and
the schedule of aimed testing exercises. It helps to identify almost other test item,
the features which are to be tested, its tasks, how will everyone do each task, how
much the tester is independent, the environment in which the test is taking place,
its technique of design plus the both the end criteria which is used, also rational of
choice of theirs, and whatever kind of risk which requires emergency planning. It
can be also referred to as the record of the process of test planning. Test plans are
usually prepared with signification input from test engineers.

32
(I) UNIT TESTING

In unit testing, the design of the test cases is involved that helps in the validation of
the internal program logic. The validation of all the decision branches and internal
code takes place. After the individual unit is completed it takes place. Plus it is
taken into account after the individual united is completed before integration. The
unit test thus performs the basic level test at its component stage and test the
particular business process, system configurations etc. The unit test ensures that
the particular unique path of the process gets performed precisely to the
documented specifications and contains clearly defined inputs with the results
which are expected.

(II) INTEGRATION TESTING

These tests are designed to test the integrated software items to

determine whether if they really execute as a single program or application. The


testing is event driven and thus is concerned with the basic outcome of field. The
Integration tests demonstrate that the components were individually satisfaction, as
already represented by successful unit testing, the components are apt and fine.
This type of testing is specially aimed to expose the issues that come-up by the
components combination.

(III) FUNCTIONAL TESTING

The functional tests help in providing the systematic representation that functions
tested are available and specified by technical requirement, documentation of the
system and the user manual.

(IV) SYSTEM TESTING

System testing, as the name suggests, is the type of testing in which ensure that
the software system meet the business requirements and aim. Testing of the
configuration is taken place here to ensure predictable result and thus analysis of
it.System testing is relied on the description of process and its flow, stressing on
pre driven process and the points of integration .

33
V) WHITE BOX TESTING

The white box testing is the type of testing in which the internal components of the
system software is open and can be processed by the tester. It is therefore a
complex type of testing process. All the data structure, components etc. are tested
by the tester himself to find out a possible bug or error. It is used in situation in
which the black box is incapable of finding out a bug. It is a complex type of testing
which takes more time to get applied.

(VI) BLACK BOX TESTING

The black box testing is the type of testing in which the internal components of the
software is hidden and only the input and output of the system is the key for the
tester to find out a bug. It is therefore a simple type of testing. A programmer with
basic knowledge can also process this type of testing. It is less time consuming as
compared to the white box testing. It is very successful for software which are less
complex are straight-forward in nature. It is also less costly than white box testing.

(V) ACCEPTANCE TESTING


User Acceptance Testing is a critical phase of any project and requires significant
participation by the end user. It also make sures that the system meets the
functional requirement.

34
RESULTS:

SAMPLE DATA SET:

area_type availability
location size society total_sqft bath balcony price
Super built-up Electronic City
Area 19-Dec Phase II 2 BHK Coomee 1056 2 1 39.07
Ready To 4
Plot Area Move Chikka Tirupathi Bedroom Theanmp 2600 5 3 120
Built-up Ready To
Area Move Uttarahalli 3 BHK 1440 2 3 62
Super built-up Ready To
Area Move Lingadheeranahalli 3 BHK Soiewre 1521 3 1 95
Super built-up Ready To
Area Move Kothanur 2 BHK 1200 2 1 51
Super built-up Ready To
Area Move Whitefield 2 BHK DuenaTa 1170 2 1 38
Super built-up
Area 18-May Old Airport Road 4 BHK Jaades 2732 4 204
Super built-up Ready To
Area Move Rajaji Nagar 4 BHK Brway G 3300 4 600
Super built-up Ready To
Area Move Marathahalli 3 BHK 1310 3 1 63.25
Ready To
Plot Area Move Gandhi Bazar 6 Bedroom 1020 6 370
Super built-up
Area 18-Feb Whitefield 3 BHK 1800 2 2 70
Ready To 4
Plot Area Move Whitefield Bedroom Prrry M 2785 5 3 295
Super built-up Ready To 7th Phase JP
Area Move Nagar 2 BHK Shncyes 1000 2 1 38
Built-up Ready To
Area Move Gottigere 2 BHK 1100 2 2 40
Ready To 3
Plot Area Move Sarjapur Bedroom Skityer 2250 3 2 148
Super built-up Ready To
Area Move Mysore Road 2 BHK PrntaEn 1175 2 2 73.5
Super built-up Ready To
Area Move Bisuvanahalli 3 BHK Prityel 1180 3 2 48
Super built-up Ready To Raja Rajeshwari
Area Move Nagar 3 BHK GrrvaGr 1540 3 3 60

These are the sample for preloaded data sets in our model

35
Graph:
Before deleting anamolies:

After deleting anamolis(we don’t have any unwanted data):

This graph shows bathrooms per property This graph represents property price by square feet

Importing libraries:
We use pandas library to read the train and test files.
import pandas as pd ( used for data analysis)

import numpy as np (Used for computations)

import matplotlib.pyplot as plt ( used to plot values in graph)

36
Data preprocessing :

It gets the count of area type in dataset and removes unwanted


columns

Encoding categorical data:

Splitting dataset into train and test data:

We are taking 80% of our data as training data and 20% as test data.

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=
train_test_split(X,y,test_size=0.2,random_state=10)

Dependent and independent variable in regression:

Eg:

Locality Area Bedrooms Bathrooms Price


Electronic 1056 2 1 40
city
Whitefield 1170 2 1 38

Dependent variable in our model is price(since it relies on other factors for its value)

Independent variables in our model are locality,Area,Bedrooms and bathrooms


since it doesn’t depand on other variables for its value.

Using linear regression it predicts the value of output.

37
Linear regression:

It predicts the result value from user defined datasets

def predict_price(location,sqft,bath,bhk):
loc_index = np.where(X.columns == location)[0][0]

x = np.zeros(len(X.columns))
x[0] = sqft
x[1] = bath
x[2] = bhk
if loc_index >= 0:
x[loc_index] = 1
return regressor.predict([x])[0]

Screenshot:

Fig:The final output of our model

38
CHAPTER 6

Conclusion and Future Work

In this paper, several tests have been performed using linear regression algorithm
to perform house price prediction. This algorithm is to predict prices of new
properties that are going to be listed by taking some input variables and predicting
the correct and justified price.It was a great learning experience building this
predictive Sale Price model. In Future Using different methods that match the
time-series data will be used in the research to obtain smaller error prediction
values and using more data to get the better result.

39
References

1. Housing Price Prediction Using Machine Learning Algorithms:


The Case of Melbourne City, Australia, The Danh Phan, 2018
International Conference on Machine Learning and Data
Engineering (iCMLDE)

2. Predicting Sales Prices of the Houses Using Regression


Methods of Machine Learning, Parasich Andrey
Viktorovich ; Parasich Viktor Aleksandrovich ; Kaftannikov Igor
Leopoldovich ; Parasich Irina Vasilevna, 2018 3rd Russian-Pacific
Conference on Computer Technology and Applications (RPC)

3. Real Estate Value Prediction Using Linear Regression, Nehal N


Ghosalkar ; Sudhir N Dhage, 2018 Fourth International Conference
on Computing Communication Control and Automation (ICCUBEA)

4. Predicting Housing Market Trends Using Twitter Data, Marlon


Velthorst ; Cicek Güven, 2019 6th Swiss Conference on Data
Science (SDS)

5. House Price Prediction Using Machine Learning and Neural


Networks, Ayush Varma ; Abhijit Sarma ; Sagar Doshi ; Rohini
Nair, 2018 Second International Conference on Inventive
Communication and Computational Technologies (ICICCT)

6. Forecasting house price index of China using dendritic neuron


model, Ying Yu ; Shuangbao Song ; Tianle Zhou ; Hanaki
Yachi ; Shangce Gao, 2016 International Conference on Progress
in Informatics and Computing (PIC)

40
7. Prediction of real estate price variation based on economic
parameters, Li Li ; Kai-Hsuan Chu, 2017 International Conference
on Applied System Innovation (ICASI)

8. Predicting house sale price using fuzzy logic, Artificial Neural


Network and K-Nearest Neighbor, Muhammad Fahmi
Mukhlishin ; Ragil Saputra ; Adi Wibowo, 2017 1st International
Conference on Informatics and Computational Sciences (ICICoS)

9. Comprehensive Analysis of Housing Price Prediction in Pune


Using Multi-Featured Random Forest Approach, Rushab
Sawant ; Yashwant Jangid ; Tushar Tiwari ; Saurabh
Jain ; Ankita Gupta, 2018 Fourth International Conference on
Computing Communication Control and Automation (ICCUBEA)

10. Time-Aware Latent Hierarchical Model for Predicting House Prices,


Fei Tan ; Chaoran Cheng ; Zhi Wei, 2017 IEEE International
Conference on Data Mining (ICDM)

41
Paper Acceptance mail:

42
Plagiarism report:

43
C.Journal Paper
House Price Prediction using machine learning
K Pavan,T Raghul
Abstract:

Usually, House price index represents the summarized price changes of residential housing.To make it more easier for
a family to search for a house we have made it more precise by asking the required square feet, no of bedrooms and
bathrooms required. With preloaded dataset and data features, a practical data pre-processing, creative feature
engineering method is examined in this paper. The paper also proposes regression technique in machine learning to
predict house price.

Keywords: House Price, Regression Technique, Machine Learning

1.INTRODUCTION:
1.2 Machine Learning Methods
Data is at the heart of technical innovations, Machine learning can be classified into three types namely
achieving any result is now possible using predictive the supervised,unsupervised and reinforcement
models. Machine learning is extensively used in this learning.Supervised machine learning algorithms can
approach. Machine learning means providing valid apply what has been learned in the past to new data predict
dataset and further on predictions are based on that, future events. It analysis from a known training dataset,
the machine itself learns how much importance a and produces a functions to predict outputs.
particular event may have on the entire system The system will provide outputs for inputs after training.
supported its pre-loaded data and accordingly predicts The system will compare with the correct, intended output
the result. Various modern applications of this and find errors and modify it to make the model more
technique include predicting stock prices, predicting practical and useful.
the possibility of an earthquake, predicting company
sales and the list has endless possibilities. In contrast, unsupervised machine learning
algorithms are the ones which does not require any
Our aim is to predict a house price based on their
supervision.It is used when when the sample data used to
needs and priorities.. By analyzing previous market
train is classified .As name suggests it, the model itself
trends and price ranges, and also upcoming
finds the hidden patterns and insights. The system may or
developments future prices will be predicted.The
may not produce right output, but it explores the data and
functioning involves a website which accepts
can draw inferences from datasets by its own.
customers specifications and then combines the
application of neuralnetwork.
Semi-supervised machine learning algorithms is a
1.1 Machine Learning combination of both supervised and unsupervised
It is a subsetof artificial intelligence (AI).It provides learning, In semi-supervised learning, an algorithm learns
system the ability to automatically learn and improve by from a dataset that includes both labeled and unlabeled
itself.It focuses on the development of computer data, usually mostly unlabeled.Generally it is chosen
programs that can access data learn by themselves. The when the sample data requires skilled resources in order
process of learning begins with observations based on the to train from it. Otherwise, It doesn’t require additional
examples that we provide. The aim is to make computers resources.
to learn by itself without the need of a human.

44
Reinforcement machine learning algorithm is a Advantages
learning method that works based on feedback .
 Space complexity is very low it just needs to save the
Reinforcement learning differs from supervised learning
weights at the end of training. hence it's a high
in not needing labelled input/output pairs be presented. It
latency algorithm.
is studied in various disciplines such as
statistics,information theory etc.  Its very simple to understand

 Good interpretability
The method that we have used here is Supervised
machine learning.  Feature importance is generated at the time model

1.3 PROBLEM STATEMENT:  building. With the help of hyperparameter lamba,

Buying a House is one of the most valuable asset an you can handle features selection hence we can

individual can purchase during his life. Hence we need to achieve dimensionality reduction

be extremely careful before buying a house we need to


3.REQUIRED SYSTEM
spend correct money to buy a house.
3.1 HARDWARE REQUIREMENTS

In the following paper, we explore different machine


The most common set of requirements defined by any
learning techniques and methods to predict prices of
operating system or software application is the physical
house. The data contains the train and the test dataset. Our
computer resources, also known as hardware. A
objective is, to predict house prices based on users
hardware requirements list is often accompanied by a
requirements and needs .Our model predicts the price of a
hardware compatibility list, especially in case of
house from the sample data that has been given.
operating systems. The minimal hardware requirements

2 EXISTING AND PROPOSED SYSTEM are as follows,

2.1 EXISTING SYSTEM 1. PROCESSOR : PENTIUM IV

2. RAM : 8 GB
Multi Linear Regression
3. PROCESSOR : 2.4 GHZ
Multiple Linear Regression. It shows the relationship
4. MAIN MEMORY : 8GB RAM
between two or more explanatory variables and scalar
5. PROCESSING SPEED : 600 MHZ
response variable .Independent variable value is
6. HARD DISK DRIVE : 1TB
associated with dependent variable value
7. KEYBOARD :104 KEYS
Limitations
3.2 SOFTWARE REQUIREMENTS
The dependent variable y must be continuous.. The
Software requirements deals with defining resource
independent variables can be of any type. The dependent
requirements and prerequisites that needs to be installed
variable is usually dependent on independent variables.
on a computer to provide functioning of an application.
2.2 Proposed System These requirements are need to be installed separately.
The minimal software requirements are as follows,
Linear Regression is a technique that helps to identify the
relationship between a dependent variable and  FRONT END :PYTHON
independent variable. The regression technique that we  IDE : ANACONDA
used here is linear regression.
 OPERATING SYSTEM :WINDOWS 10

45
4.ARCHITECTURE OF PROPOSED SYSTEM:

5.2 Data Preprocessing

It is a process of transforming the raw, complex data into


Collection Data
Determine systematic understandable knowledge. It will find out
of Dataset Loading
Dependent and
and Pre- missing and redundant data in the dataset. Thus, this
Independent
Procesing
Value brings uniformity in the dataset. But in our dataset, there
was no missing values .

5.3 Import Libraries

A library is a collection of modules the first step is to


Get the results and Calculate the
variable using import the libraries that we require in our system.There are
calculate coefficient
Regression functions for them, which can be invoked without writing
Technique
the required code. This is a list for most popular Python
libraries for Data Science. We have imported pandas
library and named it as pd.

Calculate the Determine the


5.4 Import the Dataset
Prediction Prediction Results
A lot of datasets come in CSV formats.At first We have to
locate direcotory of csv file and read it using a method
called read_csv which may be found in the library

4.1Description of The Architecture called pandas.

5.5 Encoding categorical data


 Sample dataset is collected
 Sample Data is loaded. Sometimes we have texts as our data. We can find
 It determines the dependent value it is nothing
but the value that is being dependant on other categories in text form. Now it gets tougher for machines
values here the dependent value is price to know texts and process them,hence we are changing
 It also determines the independent values it is
them to numbers. Therefore, we have to encode the
nothing but the value that doesnot depand on
other value here square feet,area,no of categorical data.
bedrroms,bathrooms are independent value
 Using linear regression it will calculate the 5.6 Split Dataset into Training and Test Set
variables.
 When a user determines the requiremnets it will
Now we should split our dataset into two sets — a
predict and shows the results.
 Training set and a Test set. We will train our machine
learning models on our loaded data trainning set, i.e our

5.Module Implementation machine learning models will understand the relationships


in our training set and then we will test the models on our
5.1 Collection of Dataset test set to check how it predicts. In general we need to

The dataset used in this project was Parameters such as allocate 80% of the dataset to training set and the

Area in square meters, Location, no of bedrooms and no remaining 20% to test set.

of bathrooms in that particular property. Selling price is a


dependent variable on several other independent variables.

46
47
7.Conclusion

In this paper, several tests have been performed using


linear regression algorithm to perform house price
prediction. This algorithm is to predict prices of new
properties that are going to be listed by taking some input
variables and predicting the correct and justified price.It
was a great learning experience building this predictive
Sale Price model. In Future Using different methods that
match the time-series data will be used in the research to
obtain smaller error prediction values and using more data
to get the better result.

References
1. Housing Price Prediction Using Machine
Learning Algorithms: The Case of Melbourne City,
Australia, The Danh Phan.

2. Predicting Sales Prices of the Houses Using


Regression Methods of Machine Learning,
Parasich Andrey Viktorovich ; Parasich Viktor
Aleksandrovich ; Kaftannikov Igor Leopoldovich
; Parasich Irina Vasilevna.

3. Real Estate Value Prediction Using Linear


Regression, Nehal N Ghosalkar ; Sudhir N Dhage.

4. Predicting Housing Market Trends Using Twitter


Data, Marlon Velthorst ; Cicek Güven.

5. House Price Prediction Using Machine Learning


and Neural Networks, Ayush Varma ; Abhijit
Sarma ; Sagar Doshi ; Rohini Nair.

48
CODING:

# importing libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import matplotlib

matplotlib.rcParams["figure.figsize"] = (20,10)

# importing the dataset

dataset = pd.read_csv('..\dataset\Bengaluru_House_Data.csv')

print(dataset.head(10))

print(dataset.shape)

# Data preprocessing

## getting the count of area type in the dataset

print(dataset.groupby('area_type')['area_type'].agg('count'))

## droping unnecessary columns

dataset.drop(['area_type','society','availability','balcony'],
axis='columns', inplace=True)

print(dataset.shape)

49
## data cleaning

print(dataset.isnull().sum())

dataset.dropna(inplace=True)

print(dataset.shape)

### data engineering

print(dataset['size'].unique())

dataset['bhk'] = dataset['size'].apply(lambda x: float(x.split(' ')[0]))

### exploring 'total_sqft' column

print(dataset['total_sqft'].unique())

#### defining a function to check whether the value is float or not

def is_float(x):

try:

float(x)

except :

return False

return True

print(dataset[~dataset['total_sqft'].apply(is_float)].head(10))

50
#### defining a function to convert the range of column values to a
single value

def convert_sqft_to_num(x):

tokens = x.split('-')

if len(tokens) == 2:

return (float(tokens[0]) + float(tokens[1]))/2

try:

return float(x)

except:

return None

#### testing the function

print(convert_sqft_to_num('290'))

print(convert_sqft_to_num('2100 - 2850'))

print(convert_sqft_to_num('4.46Sq. Meter'))

#### applying this function to the dataset

dataset['total_sqft'] = dataset['total_sqft'].apply(convert_sqft_to_num)

print(dataset['total_sqft'].head(10))

print(dataset.loc[30])

## feature engineering

51
print(dataset.head(10))

### creating new colomn 'price_per_sqft' as we know

### in real estate market, price per sqft matters alot.

dataset['price_per_sqft'] = dataset['price']*100000/dataset['total_sqft']

print(dataset['price_per_sqft'])

### exploring 'location' column

print(len(dataset['location'].unique()))

dataset['location'] = dataset['location'].apply(lambda x: x.strip())

location_stats =
dataset.groupby('location')['location'].agg('count').sort_values(ascendin
g=False)

print(location_stats[0:10])

#### creating 'location_stats' to get the location with total count or


occurance

#### occurance, and 'location_stats_less_than_10' to get the location


with <= 10

#### occurance

52
print(len(location_stats[location_stats <= 10]))

location_stats_less_than_10 = location_stats[location_stats <= 10]

print(location_stats_less_than_10)

#### redefining the 'location' column as 'other' value where location


count

#### is <= 10

dataset['location'] = dataset['location'].apply(lambda x: 'other' if x in


location_stats_less_than_10 else x)

print(dataset['location'].head(10))

print(len(dataset['location'].unique()))

## Outlier detection and removal

### checking that 'total_sqft'/'bhk', if it's very less than there is some

### anomaly and we have to remove these outliers

print(dataset[dataset['total_sqft'] / dataset['bhk'] <


300].sort_values(by='total_sqft').head(10))

print(dataset.shape)

dataset = dataset[~(dataset['total_sqft'] / dataset['bhk'] < 300)]

print(dataset.shape)

53
### checking columns where 'price_per_sqft' is very low

### where it should not be that low, so it's an anomaly and

### we have to remove those rows

print(dataset['price_per_sqft'].describe())

### function to remove these extreme cases of very high or low values

### of 'price_per_sqft' based on std()

def remove_pps_outliers(df):

df_out = pd.DataFrame()

for key, subdf in df.groupby('location'):

mean = np.mean(subdf['price_per_sqft'])

std = np.std(subdf['price_per_sqft'])

reduced_df = subdf[(subdf['price_per_sqft'] > (mean - std)) &


(subdf['price_per_sqft'] <= (mean + std))]

df_out = pd.concat([df_out, reduced_df], ignore_index=True)

return df_out

dataset = remove_pps_outliers(dataset)

print(dataset.shape)

### plotting graoh where we can visualize that properties with same
location

54
### and the price of 3 bhk properties with higher 'total_sqft' is less than

### 2 bhk properties with lower 'total_sqft'

def plot_scatter_chart(df,location):

bhk2 = df[(df['location'] == location) & (df['bhk'] == 2)]

bhk3 = df[(df['location'] == location) & (df['bhk'] == 3)]

matplotlib.rcParams['figure.figsize'] = (15,10)

plt.scatter(bhk2['total_sqft'],

bhk2['price'],

color='blue',

label='2 BHK',

s=50

plt.scatter(bhk3['total_sqft'],

bhk3['price'],

marker='+',

color='green',

label='3 BHK',

s=50

plt.xlabel('Total Square Feet Area')

plt.ylabel('Price')

plt.title(location)

55
plt.legend()

plt.show()

plot_scatter_chart(dataset,"Hebbal")

plot_scatter_chart(dataset,"Rajaji Nagar")

### defining a funcion where we can get the rows where 'bhk' &
'location'

### is same but the property with less 'bhk' have more price than the
property

### which have more 'bhk'. So, it's also an anomalu and we have to
remove these

### properties

def remove_bhk_outliers(df):

exclude_indices = np.array([])

for location, location_df in df.groupby('location'):

bhk_stats = {}

for bhk, bhk_df in location_df.groupby('bhk'):

bhk_stats[bhk] = {

'mean': np.mean(bhk_df['price_per_sqft']),

'std': np.std(bhk_df['price_per_sqft']),

'count': bhk_df.shape[0]

}
56
for bhk, bhk_df in location_df.groupby('bhk'):

stats = bhk_stats.get(bhk-1)

if stats and stats['count'] > 5:

exclude_indices = np.append(exclude_indices,
bhk_df[bhk_df['price_per_sqft'] < (stats['mean'])].index.values)

return df.drop(exclude_indices, axis='index')

dataset = remove_bhk_outliers(dataset)

print(dataset.shape)

def plot_scatter_chart(df,location):

bhk2 = df[(df['location'] == location) & (df['bhk'] == 2)]

bhk3 = df[(df['location'] == location) & (df['bhk'] == 3)]

matplotlib.rcParams['figure.figsize'] = (15,10)

plt.scatter(bhk2['total_sqft'],

bhk2['price'],

color='blue',

label='2 BHK',

s=50

plt.scatter(bhk3['total_sqft'],

bhk3['price'],

57
marker='+',

color='green',

label='3 BHK',

s=50

plt.xlabel('Total Square Feet Area')

plt.ylabel('Price')

plt.title(location)

plt.legend()

plt.show()

plot_scatter_chart(dataset,"Hebbal")

plot_scatter_chart(dataset,"Rajaji Nagar")

### histogram for properties per sqaure feet area

matplotlib.rcParams['figure.figsize'] = (20,10)

plt.hist(dataset['price_per_sqft'], rwidth=0.8)

plt.xlabel('Price Per Square Feet')

plt.ylabel('Count')

plt.title('Histogram of Properties by Price Per Square Feet')

plt.show()

58
### exploring bathroom feature

print(dataset['bath'].unique())

#### having 10 bedrooms and bathroom > 10 is unusual

#### so, we will remove these anomalies

print(dataset[dataset['bath'] > 10])

#### plotting histogram of bathroom

plt.hist(dataset['bath'], rwidth=0.8, color='red')

plt.xlabel('Number of Bathrooms')

plt.ylabel('Count')

plt.title('Histogram of Bathroom per Property')

plt.show()

print(dataset[dataset['bath'] > dataset['bhk'] + 2])

dataset = dataset[dataset['bath'] < dataset['bhk'] + 2]

print(dataset.shape)

### after removing outliers, dropping unwanted features

dataset.drop(['size','price_per_sqft'], axis='columns', inplace=True)

print(dataset.head())

59
## one hot encoding the 'location' column

dummies = pd.get_dummies(dataset['location'])

print(dummies.head())

dataset = pd.concat([dataset,dummies.drop('other', axis='columns')],


axis='columns')

dataset.drop('location', axis=1, inplace=True)

print(dataset.head())

print(dataset.shape)

## distributing independent features in 'X' and dependent feature in 'y'

X = dataset.drop(['price'],axis= 'columns')

y = dataset['price']

print(X.shape)

print(y.shape)

## splitting the dataset into training set and test set

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test =
train_test_split(X,y,test_size=0.2,random_state=10)

## training the model

60
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()

regressor.fit(X_train,y_train)

print(regressor.score(X_test,y_test))

## k-fold cross validation

from sklearn.model_selection import ShuffleSplit, cross_val_score

cv = ShuffleSplit(n_splits=5, test_size = 0.2, random_state=0)

cross_val_score(regressor,X,y,cv=cv)

## grid search, hyper parameter tuning

from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import Lasso

from sklearn.tree import DecisionTreeRegressor

def find_best_model_using_gridsearch(X,y):

algos = {

'linear_regression': {

'model': LinearRegression(),

'params': { 'normalize': [True, False]}

},

'lasso': {

61
'model': Lasso(),

'params': {

'alpha': [1,2],

'selection': ['random','cyclic']

},

'decision_tree':{

'model': DecisionTreeRegressor(),

'params': {

'criterion': ['mse','friedman_mse'],

'splitter': ['best','random']

scores = []

cv = ShuffleSplit(n_splits=5,test_size=0.2,random_state=0)

for algo_name,config in algos.items():

gs = GridSearchCV(config['model'],

config['params'],

cv=cv,

n_jobs=-1,

return_train_score=False

62
)

gs.fit(X,y)

scores.append({

'model': algo_name,

'best_score': gs.best_score_,

'best_params': gs.best_params_

})

return
pd.DataFrame(scores,columns=['model','best_score','best_params'])

model_scores = find_best_model_using_gridsearch(X,y)

print(model_scores)

### so after running grid search, linear regression model have the best
score

### so i will use linear regression model on the whole dataset

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()

regressor.fit(X,y)

## evaluating the model

def predict_price(location,sqft,bath,bhk):

63
loc_index = np.where(X.columns == location)[0][0]

x = np.zeros(len(X.columns))

x[0] = sqft

x[1] = bath

x[2] = bhk

if loc_index >= 0:

x[loc_index] = 1

return regressor.predict([x])[0]

print(predict_price('1st Phase JP Nagar',1000,2,2))

print(predict_price('1st Phase JP Nagar',1000,3,3))

print(predict_price('Indira Nagar',1000,3,3))

# saving the model

import pickle

with open('bangalore_home_prices_model.pickle','wb') as f:

pickle.dump(regressor,f)

# exporting columns

import json

columns = {'data_columns': [col.lower() for col in X.columns]}

64
with open("columns.json","w") as f:

f.write(json.dumps(columns))

65

You might also like