B.E Cse Batchno 106
B.E Cse Batchno 106
B.E Cse Batchno 106
by
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)
Accredited with Grade “A” by NAAC
JEPPIAAR NAGAR, RAJIV GANDHI SALAI,
CHENNAI – 600 119
MARCH - 2020
i
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)
Accredited with “A” grade by NAAC
Jeppiaar Nagar, Rajiv Gandhi Salai, Chennai – 600 119
www.sathyabama.ac.in
BONAFIDE CERTIFICATE
This is to certify that this project report is the bonafide work of K PAVAN (Reg. No.
37110555) and T RAGHUL (Reg. No.37110613) who carried out the project entitled
“HOUSE PRICE PREDICTION MODEL” under my supervision from August 2019 to
March 2020.
Internal Guide
Dr..Ashok Kumar.,M.E.,Phd.,
ii
DECLARATION
I K PAVAN and T RAGHUL hereby declare that the Project Report entitled
“HOUSE PRICE PREDICTION MODEL” is done by us under the guidance of
DR.ASHOK KUMAR, M.E.,Phd Department of Computer Science and Engineering
at Sathyabama Institute of Science and Technology is submitted in partial
fulfillment of the requirements for the award of Bachelor of Engineering degree in
Computer Science and Engineering.
DATE:
iii
ACKNOWLEDGEMENT
I convey my thanks to Dr. T. Sasikala, M.E., Ph.D., Dean, School of Computing, Dr. S.
Vigneswari, M.E., Ph.D., and Dr. L. Lakshmanan, M.E., Ph.D., Heads of the Department
of Computer Science and Engineering for providing me necessary support and details at
the right time during the progressive reviews.
I would like to express my sincere and deep sense of gratitude to my Project Guide
Dr.ASHOK KUMAR, M.E.,Phd Professor, for his valuable guidance, suggestions and
constant encouragement paved way for the successful completion of my project work.
I wish to express my thanks to all Teaching and Non-teaching staff members of the
Department of Computer Science and Engineering who were helpful in many ways for
the completion of the project.
iv
Abstract:
Usually, House price index represents the summarized price changes of residential
housing.To make it more easier for a family to search for a house we have made it more
precise by asking the required square feet, no of bedrooms and bathrooms required.
With preloaded dataset and data features, a practical data pre-processing, creative
feature engineering method is examined in this paper. The paper also proposes
regression technique in machine learning to predict house price.
v
TABLE OF CONTENTS
ABSTRACT v
LIST OF FIGURES viii
1. INTRODUCTION 1
1.1 MACHINE LEARNING 1
1.2 ADVANTAGES AND APPLICATIONS 2
2. LITREATURE SURVEY 7
vi
5. RESULTS AND DISCUSSION 30
6.1 MODULE IMPLEMENTATION 30
6.2 SOFTWARE TESTION 35
6.3 RESULTS
REFERENCES 40
APPENDIX
A. PAPER ACCEPTANCE MAIL 42
B. PLAGIARISM REPORT 43
C. JOURNAL PAPER 44
D. SOURCE CODE 49
LIST OF FIGURES
4.1 ANACONDA 19
4.2 ANACONDA NAVIGATOR 19
4.3 SYSTEM DESIGN 25
4.4 USE CASE DIAGRAM 27
4.5 SEQUENCE DIAGRAM 28
4.6 ACTIVITY DIAGRAM 29
vii
CHAPTER 1
INTRODUCTION:
Our aim is to predict a house price based on their needs and priorities.. By
analyzing previous market trends and price ranges, and also upcoming
developments future prices will be predicted.The functioning involves a website
which accepts customers specifications and then combines the application of
neuralnetwork.
Machine Learning
It is a subset of artificial intelligence (AI).It provides system the ability to
automatically learn and improve by itself.It focuses on the development of computer
programs that can access data learn by themselves. The process of learning
begins with observations based on the examples that we provide. The aim is to
make computers to learn by itself without the need of a human.
Machine Learning Methods
Machine learning can be classified into three types namely the supervised,
unsupervised and reinforcement learning. Supervised machine learning
algorithms can apply what has been learned in the past to new data predict future
events. It analysis from a known training dataset, and produces a functions to
predict outputs.
1
The system will provide outputs for inputs after training. The system will compare
with the correct, intended output and find errors and modify it to make the model
more practical and useful.
2
While even experts often cannot be sure where and by which correlation a
production error in a plant fleet arises, Machine Learning offers the possibility to
identify the error early this saves down times and money. Machine learning are now
used in the medical field. In the future, after collecting huge amounts of data apps
will be able to warn in case his doctor wants to prescribe a drug that he cannot
tolerate.The app can also suggest alternative options by taking into account the
genetics of patient.
2.Traffic Predictions :
Whenever we visit to a new place or when we are not sure about the route we
generally use maps it shows the distance, the amout of time it takes to cover the
distance and also it provides the information regarding traffic congestion ,By making
use of machine learning it predicts the traffic in the particular route by analyzing the
previous days traffic on the route on the same time .hence machine learning helps
us in predicting traffic.
3
3.VIDEO SURVEILLENCE:
A single person cannot be monitoring multiple cameras at single time that’s where
motionless for long time or if a person is stumbling then it alerts the attendant who
is looking after the camera .It has been used extensively in video survillence and it
Machine learning has been extensively used in checking spam and malware
emails .It detects new malware and protects users against it.It can detect various
Whenever we search for anything in web the search engine for example if it is
google then it will keep track of what users are opening after the reults are shown.it
checks whether the users are clicking the top search result or the bottom
ones.Machine learning helps and makes the search engine better with time.
4
7.Product recommendations:
Every time when a product is recommended for you ,be it after you purchase a
certain product from the website or it’s a new product machine learning is the one
that helps in recommending products to customers.
It helps in detecting money fraud in online .many payment gateways have started to
implement this technique to prevent fraud .company like paypal uses machine
learning to detect fraud.
5
INTRODUCTION TO PROJECT
Housing is one of the most valuable economic assets an individual can purchase
during his adult life. Hence we need to be extremely careful before buying a house
we need to spend correct money to buy a house.
6
CHAPTER 2
LITERATURE SURVEY
Literature Survey
House price Prediction is a crucial topic of land . The literature attempts to get
useful knowledge from historical data of property markets. Machine learning
techniques are applied to research historical property transactions in
Australia to get useful models for house buyers and sellers. Revealed is the
the high discrepancy between house prices within the costliest and most
affordable places within Melbourne city. Moreover, experiments demonstrate
that the mixture of Stepwise and Support Vector
Machine that's supported mean squared error measurement may be
a competitive approach.
This article we'll describe our solution for “House Prices: Advanced Regression
Techniques” machine learning competition, which was persisted Kaggle
platform. The goal is to predict house sale price by attributes like house
area,year of building etc. In our solution, we use classic machine learning
algorithms, and our original methods, which may be described here. At the
highest of the competition, we took 18th place among 2124 participants from
whole world.
7
3. Real Estate Value Prediction Using Linear Regression
The real estate market may be a standout amongst the foremost focused
regarding pricing and keeps fluctuating. It is one among the prime fields to
use the ideas of machine learning on the way to enhance and foresee the
prices with high accuracy. There are three factors that influence the price of a
house which includes physical conditions, concepts and location. The current
framework includes estimating the worth of homes with none expectations of
market prices and price increment. The objective of the paper is prediction of
residential prices for the purchasers considering their financial plans and
wishes . By breaking down past market patterns and value ranges, and coming
advancements future costs are going to be anticipated. This examination
means to predict house prices in Mumbai city with Linear Regression. It will help
clients to place resources into a gift without moving toward a broker. The result
from this research proved linear regression gives minimum prediction error
which is 0.3713.
In this study, we attempt to predict the Dutch housing market trends using text
mining and machine learning as an application of knowledge science methods
in finance. Our main goal is to predict the short term upward or downward trend
of the average house price in the Dutch market by using text data collected from
Twitter. Twitter is widely used also and has been proven to be a helpful
source of knowledge . However, Twitter, text mining (tokenization,
bag-of-words, n-grams, weighted term frequencies) and machine learning
(classification algorithms) have not been combined yet in order to predict the
housing market trends in short term. In this study, tweets including predefined
search words are collected counting on domain knowledge, and therefore
the corresponding text is grouped by month as documents. Then words and
word sequences are transformed into numerical values. These values served as
attributes to predict whether the housing market moves up or down,
8
i.e. we approached this as a binomial classification problem relating text data of
a month with (up or down) trends for the subsequent month.
Our main results reveal there's a correlation between the (weighted) frequency
of words and short term housing trends, in other words, we were ready to make
accurate predictions of trends in short term using multiple machine learning and
text mining techniques combined.
Real estate is that the least transparent industry in our ecosystem. Housing
prices keep changing day in and outing and sometimes are hyped instead
of being supported valuation. Predicting housing prices with real factors is that
the main crux of our scientific research . Here we aim to form our
evaluations supported every basic parameter that's considered while
determining the worth . We use various regression techniques during
this pathway, and our results aren't sole determination of 1 technique
rather it's the weighted mean of varied techniques to offer most accurate
results. The results proved that this approach yields minimum error and
maximum accuracy than individual algorithms applied. We also propose to use
real-time neighborhood details using Google maps to urge exact real-world
valuations.
The results of Chinese housing market continues to prosper or not is said to the
event of China, and further it also has an impression on the planet finance.
Thus forecasting the house price level is extremely important and
challenging. during this paper we propose an unsupervised learnable neuron
9
model (DNM) by including the nonlinear interactions between excitation and
inhibition on dendrites.
We use DNM to suit the House price level (HPI) data then forecast the trends of
Chinese housing market. To verify the effectiveness of the DNM, we use a
standard statistical model (i.e., the exponential smoothing (ES) model) to
form a performance comparison. Three quantitative statistical metrics including
normalized mean square error, absolute percentage of error, and coefficient of
correlation are wont to evaluate the forecasting performance of the 2 models.
Experimental results demonstrate that the proposed DNM is best than
ES altogether of the three quantitative statistical metrics.
10
8. Predicting house sale price using fuzzy logic, Artificial
Neural Network and K-Nearest Neighbor
Determining the worth of land and residential are regularly determined at the
earliest by the vendor , however determining the proper price within the sales
process will affect the buyer's desire to elect and bid. Special characteristics in
Indonesia, tax object value (NJOP) and site parameters are high influence
to the worth . during this paper we proposed the prediction of land and house
value using several methods. symbolic logic , Artificial Neural Network and
K-Nearest Neighbor are compared during this paper to get the
foremost appropriate method which will be used as a reference for
determining the worth by the sellers. Google Maps is employed to represent the
spatial data for prediction parameter. The variables that utilized in the methods
are NJOP of land, the locations, the age, NJOP of house, and therefore
the valuable location of the land. The experimental methods are tested by
comparing between the important price transaction and therefore the prediction
using MAPE formula.
11
The interests of both buyer and seller should be satisfied in order that they are
doing not overestimate or underestimate price. This housing price prediction
model acts as a hand for buyer and seller or a true realtor to form a
better-informed decision. to realize this, diverse features are selected as input
from feature set and various algorithms are applied like Random Forest and
Decision Tree.
12
CHAPTER 3
EXISTING SYSTEM
Limitations
Proposed System
Advantages
Space complexity is very low it just needs to save the weights at the end of
training. hence it's a high latency algorithm.
Good interpretability
Feature importance is generated at the time model building. With the help
of hyperparameter lamba, you can handle features selection hence we can
achieve dimensionality reduction
13
FEASIBILITY STUDY
The feasibility of the project is analyzed in this phase and business proposal is put
forth with a very general plan for the project and some cost estimates. The
feasibility study of the proposed system is carried out. It is carried out to ensure
that the proposed system is not a burden to the company. Economic feasibility
1. Economical feasibility
2. Technical feasibility
3. Social feasibility
ECONOMICAL FEASIBILITY
This study is generally carried out to check whether right amount of funds are
invested in the model.this study is done to eliminate excess amount of money
poured into a single model.It makes sure whether the model is well within the
budget.It is extremely important to spend only right amount of funds to a model.
TECHNICAL FEASIBILITY
It makes sure whether the technical requirements are limited to what we can
offerd.Any system developed should not have high demand on technical resources
since it puts burden on client,It also checks the projects potential what it can do
once developed.
14
SOCIAL FEASIBILITY
It is carried out check how a system acts with other systems.It checks the level of
acceptance of the system by the user. It trains the user to use the system efficiently.
it is a necessity. Since a client is the final user of the system he can critizise the
system but it should be in a disciplined and meaningful manner.
15
CHAPTER 4
HARDWARE REQUIREMENTS
1. PROCESSOR : PENTIUM IV
2. RAM : 8 GB
SOFTWARE REQUIREMENTS
2. IDE : ANACONDA
16
Python Language
Syntax is simple
A module in Python may have one or more classes and free functions
17
Applications of Python Programming
Web Applications
We can create web apps in python by using frameworks and CMS. We can create
web applications using Django, Flask, Pyramid, Plone, Django CMS. Sites like
Mozilla, Reddit, Instagram and PBS are written in Python.
There are many number of libraries in python that can be used for scientific and
numeric computing . SciPy and NumPy that are used in general purpose computing.
EarthPy is used for earth science, AstroPy is used for Astronomy and so on. It is
also used in machine learning, data mining and deep learning.
Python is slow but is great for creating prototypes. For example: You can use
Pygame which is used to create game prototype. If you are satisfied with the
prototype then you can build the app using C or C++.
Python has been used by many students.There are several companies teaching
python to their employees. It has a lot of features and capabilities. The syntax is
simple and it is one of the easiest language to learn.
Compared to other languages like C/C++, Python is slower. Python can be easily
extended with C/C++. We can write codes in C/C++ and create a python wrapper.
18
This gives us two advantages: first, our code is as fast as original C/C++ code and
second, it is very easy to code in Python. Hence OpenCV-Python is a Python
wrapper around original C++ implementation.
Anaconda is free
It is used for scientific computing, data science, statistical analysis and machine
learning.
19
What is Anaconda Navigator?
20
Applications Provided In Anaconda Distribution
The Anaconda distribution comes with the subsequent applications along side
Anaconda Navigator.
1. JupyterLab
2. Jupyter Notebook
3. Qt Console
4. Spyder
5. Glueviz
6. Orange3
7. RStudio
> JupyterLab: This is the extensible working environment for interactive and the
reproducible computing, supported the Jupyter Notebook and Architecture.
analysis.
> Qt Console: It is an PyQt GUI that supports inline figures, proper multiline
21
Glueviz: It is used for multidimensional data visualization across the files. It is
explored in relationships within and among related datasets.
Rstudio: This is a set of integrated tools designed for help you to be more
productive
by R.Then it includes R essentials and notebooks.
22
New Features of Anaconda 5.3
23
Flask
Flask is an API of Python that permits to create up web-applications. It was
developed by Armin Ronacher. Flask’s framework is more explicit than Django’s
framework and it is also easier to learn because it has the less base code to
implement a simple web-Application.
METHOD DESCRIPTION
24
SYSTEM DESIGN
Architecture
Determine the
Calculate the Prediction
Prediction Results
25
UML DIAGRAMS
o UML stands for Unified Modeling Language.
o It is used in the field of object-oriented software engineering.
o The goal is for UML to become a common language for creating models of
object oriented computer software.
o It consists of two components: a Meta-model and a notation..
The Unified Modeling Language is a standard language for specifying,
Visualization, Constructing and documenting the artifacts of software system,
as well as for business modeling and other non-software systems.
o It has been proven successful in the modeling of large and complex systems.
o The UML is a very important part of developing objects oriented software and
the software development process. It uses graphical notations to show the
design of software projects.
GOALS:
The Primary goals are as follows:
1. Provide users a ready-to-use, expressive visual modeling Language so that
they can develop and exchange meaningful models.
2. Provide extendibility and specialization mechanisms to extend the core
concepts.
3. Be independent of particular programming languages and development
process.
4. Provide a formal basis for understanding the modeling language.
5. Encourage the growth of OO tools market.
6. Support higher level development concepts such as collaborations,
frameworks, patterns and components.
7. Integrate best practices.
26
USE CASE DIAGRAM:
A use case diagram is a behaivioural diagram. Its purpose is to present a
graphical overview of the functionality provided by a system in terms of actors, their
goals (represented as use cases), and any dependencies between those use
cases. The main purpose of a use case diagram is to show what system functions
are performed for which actor. Roles of the actors in the system can be depicted.
27
SEQUENCE DIAGRAM:
A sequence diagram in Unified Modeling Language (UML) is a interaction diagram
that shows how processes operate with one another and in what order.. Sequence
diagrams are sometimes called event diagrams, event scenarios, and timing
diagrams.
28
ACTIVITY DIAGRAM:
Activity diagrams are graphical representations of workflows of stepwise activities
and actions with support for choice, iteration and concurrency. Activity diagrams
can be used to describe the business and operational step-by-step workflows of
components in a system. An activity diagram shows the overall flow of control.
29
CHAPTER 5
Module Implementation
Collection of Dataset
The dataset used in this project was Parameters such as Area in square meters,
Location, no of bedrooms and no of bathrooms in that particular property. Selling
price is a dependent variable on several other independent variables.
Data Preprocessing
Import Libraries
A library is a collection of modules the first step is to import the libraries that we
require in our system.There are functions for them, which can be invoked without
writing the required code. This is a list for most popular Python libraries for Data
Science. We have imported pandas library and named it as pd.
30
Import the Dataset
A lot of datasets come in CSV formats.At first We have to locate direcotory of csv
file and read it using a method called read_csv which may be found in the library
called pandas.
Sometimes our data is in qualitative form, that is we have texts as our data. We can
find categories in text form. Now it gets complicated for machines to know texts and
process them, rather than numbers, since the models are based on mathematical
equations and calculations. Therefore, we have to encode the categorical data.
Now we should split our dataset into two sets — a Training set and a Test set. We
will train our machine learning models on our training set, i.e our machine learning
models will try to understand any correlations in our training set and then we will
test the models on our test set to check how accurately it can predict. In general we
need to allocate 80% of the dataset to training set and the remaining 20% to test
set.
Regression coefficient
31
Prediction
SOFTWARE TESTING
General
In a generalized way, we can say that the system testing is a type of testing in
which the main aim is to make sure that system performs efficiently and seamlessly.
The process of testing is applied to a program with the main aim to discover an
unprecedented error, an error which otherwise could have damaged the future of
the software. Test cases which brings up a high possibility of discovering and error
is considered successful. This successful test helps to answer the still unknown
errors.
TEST CASE
Testing Techniques
A test plan is a document which describes approach, its scope, its resources and
the schedule of aimed testing exercises. It helps to identify almost other test item,
the features which are to be tested, its tasks, how will everyone do each task, how
much the tester is independent, the environment in which the test is taking place,
its technique of design plus the both the end criteria which is used, also rational of
choice of theirs, and whatever kind of risk which requires emergency planning. It
can be also referred to as the record of the process of test planning. Test plans are
usually prepared with signification input from test engineers.
32
(I) UNIT TESTING
In unit testing, the design of the test cases is involved that helps in the validation of
the internal program logic. The validation of all the decision branches and internal
code takes place. After the individual unit is completed it takes place. Plus it is
taken into account after the individual united is completed before integration. The
unit test thus performs the basic level test at its component stage and test the
particular business process, system configurations etc. The unit test ensures that
the particular unique path of the process gets performed precisely to the
documented specifications and contains clearly defined inputs with the results
which are expected.
The functional tests help in providing the systematic representation that functions
tested are available and specified by technical requirement, documentation of the
system and the user manual.
System testing, as the name suggests, is the type of testing in which ensure that
the software system meet the business requirements and aim. Testing of the
configuration is taken place here to ensure predictable result and thus analysis of
it.System testing is relied on the description of process and its flow, stressing on
pre driven process and the points of integration .
33
V) WHITE BOX TESTING
The white box testing is the type of testing in which the internal components of the
system software is open and can be processed by the tester. It is therefore a
complex type of testing process. All the data structure, components etc. are tested
by the tester himself to find out a possible bug or error. It is used in situation in
which the black box is incapable of finding out a bug. It is a complex type of testing
which takes more time to get applied.
The black box testing is the type of testing in which the internal components of the
software is hidden and only the input and output of the system is the key for the
tester to find out a bug. It is therefore a simple type of testing. A programmer with
basic knowledge can also process this type of testing. It is less time consuming as
compared to the white box testing. It is very successful for software which are less
complex are straight-forward in nature. It is also less costly than white box testing.
34
RESULTS:
area_type availability
location size society total_sqft bath balcony price
Super built-up Electronic City
Area 19-Dec Phase II 2 BHK Coomee 1056 2 1 39.07
Ready To 4
Plot Area Move Chikka Tirupathi Bedroom Theanmp 2600 5 3 120
Built-up Ready To
Area Move Uttarahalli 3 BHK 1440 2 3 62
Super built-up Ready To
Area Move Lingadheeranahalli 3 BHK Soiewre 1521 3 1 95
Super built-up Ready To
Area Move Kothanur 2 BHK 1200 2 1 51
Super built-up Ready To
Area Move Whitefield 2 BHK DuenaTa 1170 2 1 38
Super built-up
Area 18-May Old Airport Road 4 BHK Jaades 2732 4 204
Super built-up Ready To
Area Move Rajaji Nagar 4 BHK Brway G 3300 4 600
Super built-up Ready To
Area Move Marathahalli 3 BHK 1310 3 1 63.25
Ready To
Plot Area Move Gandhi Bazar 6 Bedroom 1020 6 370
Super built-up
Area 18-Feb Whitefield 3 BHK 1800 2 2 70
Ready To 4
Plot Area Move Whitefield Bedroom Prrry M 2785 5 3 295
Super built-up Ready To 7th Phase JP
Area Move Nagar 2 BHK Shncyes 1000 2 1 38
Built-up Ready To
Area Move Gottigere 2 BHK 1100 2 2 40
Ready To 3
Plot Area Move Sarjapur Bedroom Skityer 2250 3 2 148
Super built-up Ready To
Area Move Mysore Road 2 BHK PrntaEn 1175 2 2 73.5
Super built-up Ready To
Area Move Bisuvanahalli 3 BHK Prityel 1180 3 2 48
Super built-up Ready To Raja Rajeshwari
Area Move Nagar 3 BHK GrrvaGr 1540 3 3 60
These are the sample for preloaded data sets in our model
35
Graph:
Before deleting anamolies:
This graph shows bathrooms per property This graph represents property price by square feet
Importing libraries:
We use pandas library to read the train and test files.
import pandas as pd ( used for data analysis)
36
Data preprocessing :
We are taking 80% of our data as training data and 20% as test data.
X_train,X_test,y_train,y_test=
train_test_split(X,y,test_size=0.2,random_state=10)
Eg:
Dependent variable in our model is price(since it relies on other factors for its value)
37
Linear regression:
def predict_price(location,sqft,bath,bhk):
loc_index = np.where(X.columns == location)[0][0]
x = np.zeros(len(X.columns))
x[0] = sqft
x[1] = bath
x[2] = bhk
if loc_index >= 0:
x[loc_index] = 1
return regressor.predict([x])[0]
Screenshot:
38
CHAPTER 6
In this paper, several tests have been performed using linear regression algorithm
to perform house price prediction. This algorithm is to predict prices of new
properties that are going to be listed by taking some input variables and predicting
the correct and justified price.It was a great learning experience building this
predictive Sale Price model. In Future Using different methods that match the
time-series data will be used in the research to obtain smaller error prediction
values and using more data to get the better result.
39
References
40
7. Prediction of real estate price variation based on economic
parameters, Li Li ; Kai-Hsuan Chu, 2017 International Conference
on Applied System Innovation (ICASI)
41
Paper Acceptance mail:
42
Plagiarism report:
43
C.Journal Paper
House Price Prediction using machine learning
K Pavan,T Raghul
Abstract:
Usually, House price index represents the summarized price changes of residential housing.To make it more easier for
a family to search for a house we have made it more precise by asking the required square feet, no of bedrooms and
bathrooms required. With preloaded dataset and data features, a practical data pre-processing, creative feature
engineering method is examined in this paper. The paper also proposes regression technique in machine learning to
predict house price.
1.INTRODUCTION:
1.2 Machine Learning Methods
Data is at the heart of technical innovations, Machine learning can be classified into three types namely
achieving any result is now possible using predictive the supervised,unsupervised and reinforcement
models. Machine learning is extensively used in this learning.Supervised machine learning algorithms can
approach. Machine learning means providing valid apply what has been learned in the past to new data predict
dataset and further on predictions are based on that, future events. It analysis from a known training dataset,
the machine itself learns how much importance a and produces a functions to predict outputs.
particular event may have on the entire system The system will provide outputs for inputs after training.
supported its pre-loaded data and accordingly predicts The system will compare with the correct, intended output
the result. Various modern applications of this and find errors and modify it to make the model more
technique include predicting stock prices, predicting practical and useful.
the possibility of an earthquake, predicting company
sales and the list has endless possibilities. In contrast, unsupervised machine learning
algorithms are the ones which does not require any
Our aim is to predict a house price based on their
supervision.It is used when when the sample data used to
needs and priorities.. By analyzing previous market
train is classified .As name suggests it, the model itself
trends and price ranges, and also upcoming
finds the hidden patterns and insights. The system may or
developments future prices will be predicted.The
may not produce right output, but it explores the data and
functioning involves a website which accepts
can draw inferences from datasets by its own.
customers specifications and then combines the
application of neuralnetwork.
Semi-supervised machine learning algorithms is a
1.1 Machine Learning combination of both supervised and unsupervised
It is a subsetof artificial intelligence (AI).It provides learning, In semi-supervised learning, an algorithm learns
system the ability to automatically learn and improve by from a dataset that includes both labeled and unlabeled
itself.It focuses on the development of computer data, usually mostly unlabeled.Generally it is chosen
programs that can access data learn by themselves. The when the sample data requires skilled resources in order
process of learning begins with observations based on the to train from it. Otherwise, It doesn’t require additional
examples that we provide. The aim is to make computers resources.
to learn by itself without the need of a human.
44
Reinforcement machine learning algorithm is a Advantages
learning method that works based on feedback .
Space complexity is very low it just needs to save the
Reinforcement learning differs from supervised learning
weights at the end of training. hence it's a high
in not needing labelled input/output pairs be presented. It
latency algorithm.
is studied in various disciplines such as
statistics,information theory etc. Its very simple to understand
Good interpretability
The method that we have used here is Supervised
machine learning. Feature importance is generated at the time model
Buying a House is one of the most valuable asset an you can handle features selection hence we can
individual can purchase during his life. Hence we need to achieve dimensionality reduction
2. RAM : 8 GB
Multi Linear Regression
3. PROCESSOR : 2.4 GHZ
Multiple Linear Regression. It shows the relationship
4. MAIN MEMORY : 8GB RAM
between two or more explanatory variables and scalar
5. PROCESSING SPEED : 600 MHZ
response variable .Independent variable value is
6. HARD DISK DRIVE : 1TB
associated with dependent variable value
7. KEYBOARD :104 KEYS
Limitations
3.2 SOFTWARE REQUIREMENTS
The dependent variable y must be continuous.. The
Software requirements deals with defining resource
independent variables can be of any type. The dependent
requirements and prerequisites that needs to be installed
variable is usually dependent on independent variables.
on a computer to provide functioning of an application.
2.2 Proposed System These requirements are need to be installed separately.
The minimal software requirements are as follows,
Linear Regression is a technique that helps to identify the
relationship between a dependent variable and FRONT END :PYTHON
independent variable. The regression technique that we IDE : ANACONDA
used here is linear regression.
OPERATING SYSTEM :WINDOWS 10
45
4.ARCHITECTURE OF PROPOSED SYSTEM:
The dataset used in this project was Parameters such as allocate 80% of the dataset to training set and the
Area in square meters, Location, no of bedrooms and no remaining 20% to test set.
46
47
7.Conclusion
References
1. Housing Price Prediction Using Machine
Learning Algorithms: The Case of Melbourne City,
Australia, The Danh Phan.
48
CODING:
# importing libraries
import pandas as pd
import numpy as np
import matplotlib
matplotlib.rcParams["figure.figsize"] = (20,10)
dataset = pd.read_csv('..\dataset\Bengaluru_House_Data.csv')
print(dataset.head(10))
print(dataset.shape)
# Data preprocessing
print(dataset.groupby('area_type')['area_type'].agg('count'))
dataset.drop(['area_type','society','availability','balcony'],
axis='columns', inplace=True)
print(dataset.shape)
49
## data cleaning
print(dataset.isnull().sum())
dataset.dropna(inplace=True)
print(dataset.shape)
print(dataset['size'].unique())
print(dataset['total_sqft'].unique())
def is_float(x):
try:
float(x)
except :
return False
return True
print(dataset[~dataset['total_sqft'].apply(is_float)].head(10))
50
#### defining a function to convert the range of column values to a
single value
def convert_sqft_to_num(x):
tokens = x.split('-')
if len(tokens) == 2:
try:
return float(x)
except:
return None
print(convert_sqft_to_num('290'))
print(convert_sqft_to_num('2100 - 2850'))
print(convert_sqft_to_num('4.46Sq. Meter'))
dataset['total_sqft'] = dataset['total_sqft'].apply(convert_sqft_to_num)
print(dataset['total_sqft'].head(10))
print(dataset.loc[30])
## feature engineering
51
print(dataset.head(10))
dataset['price_per_sqft'] = dataset['price']*100000/dataset['total_sqft']
print(dataset['price_per_sqft'])
print(len(dataset['location'].unique()))
location_stats =
dataset.groupby('location')['location'].agg('count').sort_values(ascendin
g=False)
print(location_stats[0:10])
#### occurance
52
print(len(location_stats[location_stats <= 10]))
print(location_stats_less_than_10)
#### is <= 10
print(dataset['location'].head(10))
print(len(dataset['location'].unique()))
### checking that 'total_sqft'/'bhk', if it's very less than there is some
print(dataset.shape)
print(dataset.shape)
53
### checking columns where 'price_per_sqft' is very low
print(dataset['price_per_sqft'].describe())
### function to remove these extreme cases of very high or low values
def remove_pps_outliers(df):
df_out = pd.DataFrame()
mean = np.mean(subdf['price_per_sqft'])
std = np.std(subdf['price_per_sqft'])
return df_out
dataset = remove_pps_outliers(dataset)
print(dataset.shape)
### plotting graoh where we can visualize that properties with same
location
54
### and the price of 3 bhk properties with higher 'total_sqft' is less than
def plot_scatter_chart(df,location):
matplotlib.rcParams['figure.figsize'] = (15,10)
plt.scatter(bhk2['total_sqft'],
bhk2['price'],
color='blue',
label='2 BHK',
s=50
plt.scatter(bhk3['total_sqft'],
bhk3['price'],
marker='+',
color='green',
label='3 BHK',
s=50
plt.ylabel('Price')
plt.title(location)
55
plt.legend()
plt.show()
plot_scatter_chart(dataset,"Hebbal")
plot_scatter_chart(dataset,"Rajaji Nagar")
### defining a funcion where we can get the rows where 'bhk' &
'location'
### is same but the property with less 'bhk' have more price than the
property
### which have more 'bhk'. So, it's also an anomalu and we have to
remove these
### properties
def remove_bhk_outliers(df):
exclude_indices = np.array([])
bhk_stats = {}
bhk_stats[bhk] = {
'mean': np.mean(bhk_df['price_per_sqft']),
'std': np.std(bhk_df['price_per_sqft']),
'count': bhk_df.shape[0]
}
56
for bhk, bhk_df in location_df.groupby('bhk'):
stats = bhk_stats.get(bhk-1)
exclude_indices = np.append(exclude_indices,
bhk_df[bhk_df['price_per_sqft'] < (stats['mean'])].index.values)
dataset = remove_bhk_outliers(dataset)
print(dataset.shape)
def plot_scatter_chart(df,location):
matplotlib.rcParams['figure.figsize'] = (15,10)
plt.scatter(bhk2['total_sqft'],
bhk2['price'],
color='blue',
label='2 BHK',
s=50
plt.scatter(bhk3['total_sqft'],
bhk3['price'],
57
marker='+',
color='green',
label='3 BHK',
s=50
plt.ylabel('Price')
plt.title(location)
plt.legend()
plt.show()
plot_scatter_chart(dataset,"Hebbal")
plot_scatter_chart(dataset,"Rajaji Nagar")
matplotlib.rcParams['figure.figsize'] = (20,10)
plt.hist(dataset['price_per_sqft'], rwidth=0.8)
plt.ylabel('Count')
plt.show()
58
### exploring bathroom feature
print(dataset['bath'].unique())
plt.xlabel('Number of Bathrooms')
plt.ylabel('Count')
plt.show()
print(dataset.shape)
print(dataset.head())
59
## one hot encoding the 'location' column
dummies = pd.get_dummies(dataset['location'])
print(dummies.head())
print(dataset.head())
print(dataset.shape)
X = dataset.drop(['price'],axis= 'columns')
y = dataset['price']
print(X.shape)
print(y.shape)
X_train,X_test,y_train,y_test =
train_test_split(X,y,test_size=0.2,random_state=10)
60
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,y_train)
print(regressor.score(X_test,y_test))
cross_val_score(regressor,X,y,cv=cv)
def find_best_model_using_gridsearch(X,y):
algos = {
'linear_regression': {
'model': LinearRegression(),
},
'lasso': {
61
'model': Lasso(),
'params': {
'alpha': [1,2],
'selection': ['random','cyclic']
},
'decision_tree':{
'model': DecisionTreeRegressor(),
'params': {
'criterion': ['mse','friedman_mse'],
'splitter': ['best','random']
scores = []
cv = ShuffleSplit(n_splits=5,test_size=0.2,random_state=0)
gs = GridSearchCV(config['model'],
config['params'],
cv=cv,
n_jobs=-1,
return_train_score=False
62
)
gs.fit(X,y)
scores.append({
'model': algo_name,
'best_score': gs.best_score_,
'best_params': gs.best_params_
})
return
pd.DataFrame(scores,columns=['model','best_score','best_params'])
model_scores = find_best_model_using_gridsearch(X,y)
print(model_scores)
### so after running grid search, linear regression model have the best
score
regressor = LinearRegression()
regressor.fit(X,y)
def predict_price(location,sqft,bath,bhk):
63
loc_index = np.where(X.columns == location)[0][0]
x = np.zeros(len(X.columns))
x[0] = sqft
x[1] = bath
x[2] = bhk
if loc_index >= 0:
x[loc_index] = 1
return regressor.predict([x])[0]
print(predict_price('Indira Nagar',1000,3,3))
import pickle
with open('bangalore_home_prices_model.pickle','wb') as f:
pickle.dump(regressor,f)
# exporting columns
import json
64
with open("columns.json","w") as f:
f.write(json.dumps(columns))
65