Nothing Special   »   [go: up one dir, main page]

Ram STP

Download as pdf or txt
Download as pdf or txt
You are on page 1of 53

TRINITY INSTITUTE OF PROFESSIONAL

STUDIES

SEC-9 DWARKA, NEW DELHI- 110075


(AFFILIATED TO)

GURU GOBIND SINGH INDRAPRASTHA UNIVERSITY SECTOR-16C,DWARKA, NEWDELHI

A
SUMMER TRAINING PROJECT REPORT ON
“HOUSE PRICE PREDICTION USING DATA ANALYTICS”

BACHELOR OF COMPUTER APPLICATION

BATCH: 2021-2024

SUBMITTED BY: RAM TANWAR PROJECT GUIDE: Ms. NISHIKA


DESIGNATION: ASSISTANT PROFESSOR
ENROLLMENT NO : 01724002021
SEMESTER: 5th
SHIFT: 2nd

1
CERTIFICATION OF COMPLETION

2
3
DECLARATION BY THE CANDIDATE

I hereby declare that the work presented in this report entitled “HOUSE PRICE
PREDICTION USING DATA ANALYTICS”, in fulfilment of the requirement for the
award of the degree Bachelor of Computer Application, Trinity institute of Professional
Studies, Dwarka, New Delhi, is an authentic record of my own work carried out during my
degree under the guidance of Ms. Nishika. The work reported in this has not been
submitted by me for award of any other degree or diploma.

Date: 15 October 2023 Name: RAM TANWAR (01724002021)


Place: New Delhi Semester: BCA 5th Sem

4
To Whom It May Concern

I RAM TANWAR, Enrolment No. 01724002021 from BCA-V Sem, Shift SECOND
of the Trinity Institute of Professional Studies, Delhi hereby declare that the Summer
Project Training (BCA-331) entitled “House Price Prediction using Data Analytics”
at Shri Sahib technologies is an original work and the same has not been submitted to
any other Institute for the award of any other degree. A presentation of the Summer
Project Training was made on October 15, 2023 and the suggestions as approved by the
faculty were duly incorporated.

Date: Signature of the Student

Certified that the Dissertation submitted in partial fulfilment of Bachelor of Computer


Applications (BCA) to be awarded by G.G.S.I.P. University, Delhi by RAM
TANWAR Enrolment No. 01724002021 has been completed under my guidance and
is Satisfactory.

Date: Signature of the Guide

Name of the Guide:


Ms. NISHIKA

Associate Professor

5
ACKNOWLEDGEMENT

I express my sincere gratitude to Ms. Nishika (Associate Professor) for his valuable
guidance and timely suggestions during the entire duration of my dissertation work, without
which this work would not have been possible. I would also like to convey my deep regards
to all other faculty members who have bestowed their great effort and guidance at
appropriate times without which it would have been very difficult on my part to finish this
work. Finally, I would also like to thank my friends for their advice and pointing out my
mistakes.

Name: RAM TANWAR (01724002021)


Semester: BCA 5th Sem

6
TABLE OF CONTENTS

SNO. TOPICS PAGE


NO.

1 List of figures 7

2 List of abbreviations 8

3 Abstract 9

4 Chapter 1: Introduction 10-13


1.1 About the organization
1.2 Objective
1.3 Purpose and scope
1.4 Statement of problem
5 Chapter 2: System Analysis 15-21
2.1 Feasibility Study

2.1.1 Technical Feasibility


2.1.2 Cost Feasibility
2.1.3 Operational Feasibility
2.1.4 Other Feasibility Dimensional

2.2 Analysis Methodology

2.3 Choice of the platform


2.4 Specific Requirements

2.4.1 Hardware Requirements


2.4.2 Software Requirements

2.5 Technology Used


6 Chapter 3: System design 23-40

3.1 Business Model


3.2 DFD Diagram
3.3 Interface Design
7 Chapter 4: Testing 42-45
4.1 Types of Testing
4.1.1 Unit testing
4.1.2 Module testing
4.1.3 Integration testing
4.1.4 System testing
4.1.5 White box & Black box testing
4.1.6 Acceptance testing

4.2 Test data and cases


8 Chapter 5: Conclusion & References 47-48

5.1 Conclusion
5.2 Reference
9 Chapter 6: Annexures 50-
A-10 Sample Output 52sssss

7
LIST OF FIGURES

S.NO. FIGURE NAME PAGE


NO.
1 Figure 3.1: Business Model Diagram 21
2 Figure 3.2: Level 0 DFD 23
3 Figure 3.3: Level 1 DFD 23

8
LIST OF ABBREVIATIONS

S.NO. Abbreviated Name Full Name


1 IBM International Business Machine
Corporation
2 AI Artificial Intelligence
3 EDA Exploratory Data Analysis
4 SVM Support Vector Machine
5 GPU Graphics Processing Unit
6 RAM Random Access Memory
7 CUDA Compute Unified Device
Architecture
8 CSV Comma Separated Values
9 SQL Structured Query Language
10 API Application Programming
Interface
11 SEO Search Engine Optimization
12 DFD Data Flow Diagram

9
ABSTRACT

In the fast-paced world of real estate, predicting house prices accurately is a game-changer.
Whether you're a buyer looking for that perfect deal, a seller trying to set the right price, or
an investor seeking lucrative opportunities, having a reliable model at your disposal can
make all the difference. Our project is all about harnessing the power of data analytics to
create a rock-solid house price prediction model. We're diving headfirst into historical
housing data, armed with a toolbox of data analysis techniques, to not just guess, but to
forecast future house prices with pinpoint precision.

This project is more than just crunching numbers; it's about giving you the confidence to
make informed decisions in the real estate market. Imagine knowing what a property will
be worth in the coming months or years. It's like having a crystal ball, but one that relies on
solid data and proven methods. So, stay tuned as we embark on this journey to build a model
that's not just casual about predictions, but professionally accurate in the ever-evolving
world of real estate.

10
CHAPTER 1: INTRODUCTION

1.1 ABOUT THE ORGANIZATION

Our clients’ systems support modern society. In making them faster, more
productive, and more secure, we don’t just make business work better. We
make the world work better.
We bring together all the necessary technology and services, regardless of where those
solutions come from, to help clients solve the most pressing business problems.
IBM integrates technology and expertise, providing infrastructure, software (including
market-leading Red Hat) and consulting services for clients as they pursue the digital
transformation of the world’s mission-critical businesses.
The main goal of IBM is to be a leading provider of innovative technology solutions
and services that help clients solve complex business problems and drive digital
transformation.
1.2 OBJECTIVE
The primary objective of implementing a house price prediction project using data
analysis for IBM can be multifaceted and may include:
1. Business Decision Support: Provide IBM with a powerful tool to make informed
business decisions related to real estate investments, acquisitions, and strategic
planning. This includes identifying potential opportunities, assessing risks, and
optimizing the allocation of resources.
2. Customer Engagement: Enhance IBM's engagement with its customers by offering
them valuable insights into the real estate market. This can be in the form of tools or
services that help clients make data-driven decisions about buying, selling, or investing
in properties.
3. Data-Driven Innovation: Showcase IBM's commitment to data-driven innovation
and technology leadership. Demonstrating the capability to harness data for predictive
analytics can strengthen IBM's reputation as a technology and solutions provider.
4. Competitive Advantage: Gain a competitive edge in the real estate technology market
by offering a sophisticated house price prediction solution. This can attract clients
seeking advanced analytics solutions and differentiate IBM from competitors.
5. Revenue Generation: Explore opportunities for revenue generation through the sale
of predictive analytics services, licensing of predictive models, or providing
subscription-based access to the prediction platform.
6. Research and Development: Foster ongoing research and development efforts in the
field of data analytics, machine learning, and artificial intelligence. The project can
serve as a platform for testing and advancing IBM's AI and analytics capabilities.

11
7. User Empowerment: Empower IBM's clients and partners with a user-friendly
interface that allows them to harness the power of data analysis without requiring
extensive technical knowledge.
8. Transparency and Trust: Build trust with clients and stakeholders by providing
transparent and interpretable predictions. Ensure that users understand the factors
contributing to house price predictions and can rely on the accuracy of the model.
9. Scalability: Design the solution to be scalable, allowing IBM to adapt to changing
market dynamics and expanding data sources.
10. Compliance and Ethical Considerations: Ensure that the project complies with data
privacy regulations and ethical considerations related to data usage and analytics.
11. Educational Outreach: Use the project as an educational tool to help clients and
stakeholders understand the value of data analytics in the real estate industry. Provide
training and resources to promote data literacy.
12. Sustainability: Consider the environmental impact of the project and aim to
minimize resource consumption while maximizing the benefits of data analysis.

Ultimately, the objective of a house price prediction project for IBM is to leverage data
analysis to provide value to the organization, its clients, and the real estate market as a
whole. This can be achieved by delivering accurate predictions, fostering innovation,
and supporting data-driven decision-making in the dynamic and competitive real estate
industry.

1.3 PURPOSE AND SCOPE


The purpose and scope of a house prediction project using data analysis can be defined
to provide clarity on the project's objectives, target outcomes, and the extent of its
coverage. Here are the purpose and scope considerations for such a project:
1.3.1 Purpose
1. Informed Decision-Making: The primary purpose is to assist individuals,
organizations, and stakeholders in the real estate market in making informed decisions
related to buying, selling, investing, or managing properties.
2. Risk Mitigation: Provide a tool to assess and mitigate risks associated with real estate
transactions, such as overpaying for a property or making an unprofitable investment.
3. Market Insights: Generate insights into the housing market's dynamics, trends, and
factors influencing property prices, which can benefit real estate professionals,
investors, and homebuyers.
4. Data-Driven Investment: Enable data-driven real estate investment strategies by
identifying properties with the potential for appreciation and optimizing investment
portfolios.

12
5. Customer Engagement: Engage customers and clients by offering data-driven
services and tools that enhance their experience and help them achieve their real estate
goals.
6. Competitive Advantage: Establish a competitive advantage in the real estate
technology market by providing a sophisticated predictive analytics solution.
7. Research and Innovation: Foster research and innovation in data analytics, machine
learning, and artificial intelligence while addressing real-world challenges in the real
estate domain.

1.3.2 Scope
1. Data Sources: Define the sources of data to be used, which may include historical
property sales data, housing market indicators, economic data, geographical attributes,
and demographic information.
2. Geographical Coverage: Specify the geographical scope of the project, such as
whether it focuses on a specific region, city, or covers a broader national or international
market.
3. Property Types: Determine the types of properties under consideration, such as
residential homes, commercial properties, or specific categories like apartments,
condos, or single-family houses.
4. Predictive Models: Outline the types of predictive models and algorithms that will
be employed for house price prediction, including regression analysis, machine learning
models, or specialized approaches.
5. Feature Engineering: Describe the feature engineering techniques that will be applied
to extract meaningful predictors from the data, including property features, location-
based attributes, and market indicators.
6. Model Evaluation: Define the criteria for evaluating the performance of predictive
models, considering metrics like accuracy, mean squared error, and interpretability.
7. User Interface: Specify the design and functionality of the user interface, ensuring it
is user-friendly and provides transparent explanations of predictions.
8. Data Privacy: Address data privacy and security concerns, outlining how sensitive
information will be handled and ensuring compliance with relevant regulations.
9. Scalability: Consider whether the project is designed to scale with changing data
volumes and market dynamics.
10. Documentation and Reporting: Determine the scope of documentation and
reporting, including the presentation of findings, insights, and recommendations.
11. Ethical Considerations: Address ethical considerations related to data usage, bias
mitigation, and responsible AI practices.

13
12. Maintenance and Updates: Define the scope of ongoing maintenance, model
updates, and support for users.
13. Stakeholders: Identify the primary stakeholders, such as real estate professionals,
investors, homebuyers, and any other parties benefiting from the project.

1.4 STATEMENT OF PROBLEM

The real estate market is a complex and dynamic environment characterized by


fluctuating property prices influenced by a multitude of factors, including location,
property features, economic indicators, and market trends. In this context, there is a
compelling need for an accurate and reliable house price prediction system that
leverages data analysis techniques to assist homebuyers, sellers, real estate
professionals, and investors in making informed decisions.
Despite the wealth of data available, the challenge lies in developing a predictive model
that not only delivers accurate forecasts but also provides transparency and
interpretability. Existing solutions often lack the comprehensibility required for users
to trust and act upon the predictions. Additionally, the real estate market's geographic
diversity and the ever-evolving nature of influencing factors demand a robust and
adaptable approach.

14
15
CHAPTER 2: SYSTEM REQUIREMENT ANALYSIS

2.1 Feasibility Study


2.1.1 Technical Feasibility
Technical feasibility remains a crucial aspect of the project, ensuring that the chosen
open-source platforms and readily available data can effectively support the project's
goals.
a. Technology Stack: The project utilizes a modern and cost-effective technology
stack that leverages open-source tools such as Python, scikit-learn, Pandas,
Matplotlib, and Seaborn. These tools are readily available and do not involve
licensing fees.
b. Data Processing: The use of open-source tools like Pandas ensures efficient data
handling and preprocessing without the need for proprietary software.
c. Machine Learning Models: Scikit-learn's library of machine learning algorithms
provides cost-effective solutions for building predictive models without the need
for expensive proprietary software.
d. Data Visualization: Matplotlib and Seaborn offer cost-effective data visualization
capabilities, allowing us to create insightful visualizations without additional
costs.

2.1.2 Economical Feasibility


Given that there are no direct costs associated with data acquisition or software
licensing, the economic feasibility of the project is highly favorable:
1. Cost Estimation: The project's cost estimation is minimal, as the dataset was
readily available and the we used opensource platform i.e. Google Colab to run
the code

2.1.3 Operational Feasibility


Operational feasibility remains a key consideration:
1. Impact Analysis: Implementing the project is expected to enhance decision-
making processes by providing valuable insights into housing market trends. The
impact on existing operations remains minimal.
2. Change Management: The change management plan, including training sessions
and communication strategies, ensures a smooth transition to the new data
analytics system without significant additional costs.
3. Resource Allocation: Adequate human resources have been allocated for project
implementation without incurring extra costs. Cross-functional teams are
prepared to collaborate effectively.
4. Compatibility: The project's compatibility with existing systems remains
unchanged, requiring minimal adjustments to data pipelines.

16
2.1.4 Other Feasibility Dimensions
Additional feasibility dimensions remain consistent:
1. Legal Feasibility: The project complies with data privacy regulations and
intellectual property laws. Data usage and handling adhere to legal requirements
without incurring extra legal costs.
2. Scheduling Feasibility: A detailed project timeline remains established, ensuring
timely delivery without incurring additional costs.
3. Resource Feasibility: The project utilizes existing physical resources and
infrastructure, minimizing resource-related expenses.

2.2 Analysis Methodology:


1. Data Import and Preprocessing:

 Import necessary libraries for data analysis, visualization, and machine learning.
 Load the dataset (housepred.csv) into a Pandas DataFrame.
 Check the data to understand its structure, including columns and data types.
 Categorize features based on their data types (int, float, object) and calculate the
number of each.

2. Exploratory Data Analysis (EDA):

 Visualize the correlation between numerical features using a heatmap.


 Examine unique values for categorical features to understand their distribution.
 Plot bar graphs to visualize the distribution of each categorical feature.

3. Data Cleaning:
 Handle missing values, in this case, by filling empty SalePrice values with their
mean values.
 Drop records with null values.
 Verify that no features have null values in the cleaned dataset.

4. One-Hot Encoding:

 Convert categorical variables into numerical format using one-hot encoding.

5. Data Splitting:

 Split the dataset into training and testing sets.


 Define the independent variables (X) and the dependent variable (Y) for
regression.

17
6. Regression Models:
1. Train and evaluate different regression models:
2. Support Vector Machine (SVM) Regressor.
3. Tune hyperparameters using GridSearchCV.
4. Calculate Mean Squared Error and R-squared score.
5. Linear Regression.
6. Calculate Mean Absolute Percentage Error and R-squared score.
7. Random Forest Regressor.
8. Tune hyperparameters using GridSearchCV.
9. Calculate Mean Squared Error and R-squared score.

7. Results and Interpretation:

 Analyze the performance of each regression model based on the evaluation


metrics.
 Select the best-performing model for predicting SalePrice.
 Provide insights into the model's ability to predict house prices based on the
given features.

8. Conclusion:

 Summarize the findings and the chosen regression model.


 Discuss the potential applications and limitations of the model.
 Suggest any further improvements or areas for future research.

2.3 Choice of the platform


Platform: Google Colab
Google Colab offers the following advantages:
 Cloud-Based: It doesn't require any local installation, which is convenient for
collaboration and access from various devices.
 Free GPU Access: Google Colab provides free access to GPUs, which can
significantly accelerate machine learning tasks, especially training deep
learning models.
 Data Storage: We can easily upload and access datasets stored in Google Drive,
making it convenient for data manipulation.
 Preinstalled Libraries: Google Colab comes with many preinstalled data science
libraries, making it quick to set up and start working on projects.

18
Overall, Google Colab is a versatile and accessible platform for data analysis, machine
learning, and collaborative work in the field of data science.

2.4 SPECIFIC REQUIREMENTS


This section contains the software requirements to a level of detail sufficient to
enable designers to design the system, and testers to test the system.

2.4.1 Hardware Requirements


 Processing Power: A computer with a minimum of a quad-core CPU to
handle data processing and model training efficiently.
 Memory (RAM): At least 8GB of RAM for handling large datasets and
training complex machine learning models.
 Storage: Adequate storage space for datasets and model files. SSDs are
preferred for faster data access.
 GPU (Optional): For faster model training, a GPU (Graphics
Processing Unit) with CUDA support can be beneficial.

2.4.2 Software Requirements

 Python Environment: Set up a Python environment, preferably using


Anaconda, to manage libraries and dependencies.
 Google Colab: If you're using Google Colab for code development,
ensure it's properly configured.
 Python Libraries: Install necessary Python libraries such as NumPy,
Pandas, Matplotlib, Seaborn, Scikit-Learn, TensorFlow, and Keras for
data manipulation, visualization, and machine learning.
 Data: Access to a dataset containing historical house prices, which can
be sourced from real estate websites or datasets available online.

2.5 TECHNOLOGY USED


 Python: Python is a versatile programming language widely used in
data science. It's known for its simplicity and has a rich ecosystem of
libraries for data analysis and machine learning.
 Google Colab: Google Colab is a cloud-based Python environment
that provides free access to GPUs (Graphics Processing Units) and
allows users to run Python code in a Jupyter Notebook-like interface.
It's commonly used for machine learning and data analysis tasks.
 Pandas: Pandas is a Python library used for data manipulation and
analysis. It provides data structures like data frames and tools for
reading and writing data from various file formats.
 NumPy: NumPy is a fundamental library for numerical computing in
Python. It provides support for arrays and matrices, essential for
scientific and mathematical computations.

19
 Matplotlib and Seaborn: These libraries are used for data
visualization. Matplotlib offers a wide range of plot types, and
Seaborn is built on top of Matplotlib, providing a high-level interface
for creating attractive statistical graphics.
 Scikit-Learn: Scikit-Learn is a machine learning library for Python. It
includes tools for data
Python:

Figure 2.1

Python is a dynamic and high-level programming language renowned for its


versatility and readability. It was created by Guido van Rossum in the late 1980s and
has since become one of the most popular and widely used programming languages
globally.

What sets Python apart is its clean and easily understandable syntax, emphasizing
code readability. Instead of relying on complex punctuation and braces, Python uses
indentation to define code blocks. This feature makes Python particularly beginner-
friendly and encourages developers to write clean and maintainable code.

A key strength of Python lies in its extensive standard library, which offers a wealth
of modules and functions for various tasks, from file handling to networking and
data manipulation. This rich library ecosystem simplifies development, as
developers can leverage pre-built tools.

20
Google Colab:

Figure 2.2

Google Colab, short for Google Colaboratory, is a cloud-based platform that has
revolutionized the world of data science and machine learning. Developed by
Google, Colab offers a dynamic and collaborative environment for building,
training, and deploying machine learning models, all within a web browser. It has
become a game-changer in the field of data analysis and computational research.
Pandas:

Figure 2.3
Pandas, short for "Panel Data," is a widely-used library in the Python ecosystem,
renowned for its robust data manipulation and analysis capabilities. It serves as the
go-to tool for data professionals, scientists, analysts, and developers when it comes to
efficiently handling structured data. At its core, Pandas revolves around two primary
data structures: Series and DataFrame.
One of Pandas' standout features is its exceptional data loading capabilities. It can
effortlessly read data from a myriad of file formats including CSV, Excel, SQL
databases, and more, making the process of importing data for analysis remarkably
seamless.
Pandas also shines in data cleaning tasks. It provides powerful tools to handle missing
values, eliminate duplicates, and convert data types, ensuring your dataset is pristine
and ready for analysis.
Data filtering and selection are a breeze with Pandas. We can easily extract specific
subsets of data based on criteria relevant to your analysis, simplifying the exploration
of large datasets.

21
Numpy:

Figure 2.4
NumPy, short for "Numerical Python," is a fundamental library for scientific
computing in the Python programming language. It provides support for large, multi-
dimensional arrays and matrices, along with a collection of high-level mathematical
functions to operate on these arrays. NumPy is a cornerstone library in the Python
data science and numerical computing ecosystem, and it offers several key features:
Matplotlib and Seaborn:

Figure 2.5 Figure 2.6


Matplotlib and Seaborn are two of the most prominent and widely used data
visualization libraries in the Python ecosystem. They empower data scientists,
analysts, and researchers to create insightful and visually appealing graphs, charts,
and plots to convey complex information effectively.
Scikit- Learn:

Figure 2.7
Scikit-Learn, often abbreviated as sklearn, is a popular and open-source machine
learning library built for Python. It serves as a valuable tool for machine learning and
data science practitioners, providing a wide range of tools and algorithms for tasks
such as classification, regression, clustering, dimensionality reduction, and more.

22
23
CHAPTER 3: SYSTEM DESIGN

3.1 BUSINESS MODEL

Figure 3.1: Business Model Diagram

3.1.1 Market Analysis


3.1.1.1 Market Size and Growth
The real estate market is substantial, with a global valuation of over $280 trillion.
The market is growing steadily, with increasing demand for digital tools and data-
driven insights.

3.1.1.2 Target Audience

 Real Estate Agents and Brokers: Seeking accurate pricing for listings.
 Homebuyers: Wanting to make informed purchase decisions.
 Home Sellers: Aiming to price their properties competitively.
 Investors: Analyzing potential returns on real estate investments.

24
3.1.2 Revenue Model
3.1.2.1 Pricing Strategy

 Subscription Model: Charge real estate professionals a monthly fee for access
to the platform.
 Pay-Per-Use Model: Charge users on a per-query basis for property
valuations.

3.1.3 Technology Stack


3.1.3.1 Machine Learning

 Algorithms: Use regression models (e.g., Linear Regression, Random Forest,


XGBoost) for price prediction.
 Data Preprocessing: Handle missing data, outliers, and feature engineering.
 Model Evaluation: Utilize metrics like Mean Absolute Error (MAE) and Root
Mean Squared Error (RMSE) for model performance evaluation.

3.1.3.2 Infrastructure

 Cloud Computing: Host the application and databases on a cloud platform like
AWS, Azure, or Google Cloud.
 Database: Use relational databases for data storage and retrieval.
 APIs: Develop RESTful APIs for easy integration with external systems.

3.1.4 Data Strategy


3.1.4.1 Data Sources
 Property Listings: Gather data from real estate websites, MLS (Multiple
Listing Services), and public records.
 Economic Indicators: Include relevant economic data (e.g., interest rates, job
market) to improve predictions.
 User Feedback: Collect user feedback and usage data for continuous
improvement.

3.1.5 Marketing and User Acquisition


 Online Advertising: Utilize digital marketing channels like Google Ads,
Facebook Ads, and SEO optimization.
 Partnerships: Collaborate with real estate agencies, mortgage lenders, and
property listing websites.
 Content Marketing: Create informative blog posts, videos, and webinars about
property valuation and market trends.

3.1.6 Risk Analysis


 Data Quality: Inaccurate or incomplete data can lead to inaccurate predictions.
 Competition: Market incumbents have established user bases and resources.

25
3.2 DFD

Level 0:

DATA FLOW
House
Price
Prediction DATA STORES

PROCESSES
Figure 3.2 Level 0 DFD

Level 1:

CSV Data to Preprocessing

DATA Preprocessing to EDA


FLOW
EDA Visualizations
House
Price Pre-processed Data to Model Training
Prediction

Input Data (CSV)


DATA
STORES
Pre-processed Data

Model Evaluation Metrics


(R-Squared Method)
PROCESSES

Data Import and Preprocessing

Model Evaluation

Exploratory Data Analysis (EDA)

Figure 3.3 Level 1 DFD

26
5.2 Interface Design

3.3.1 Importing Libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import cross_val_score, GridSearchCV

from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import r2_score, mean_squared_error

from pylab import rcParams

import matplotlib.animation as animation

from matplotlib import rc

import unittest

%matplotlib inline

sns.set(style=’whitegrid’, palette=’muted’, font_scale=1.5)

rcParams[‘figure.figsize’] = 14, 8

RANDOM_SEED = 42

np.random.seed(RANDOM_SEED)

def run_tests():

unittest.main(argv=[‘’], verbosity=1, exit=False)

3.3.2 Uploading Files


from google.colab import files
uploaded = files.upload()

27
import io
df = pd.read_csv(io.BytesIO(uploaded[‘housepred.csv’]))
print(df)

df.shape

28
df.head()

3.3.3 Data Preprocessing


Now, we categorize the features depending on their datatype (int, float, object) and
then calculate the number of them.
Obj = (df.dtypes == ‘object’)
object_cols = list(obj[obj].index)
print(“Categorical variables:”,len(object_cols))

int_ = (df.dtypes == ‘int’)


num_cols = list(int_[int_].index)
print(“Integer variables:”,len(num_cols))

fl = (df.dtypes == ‘float’)
fl_cols = list(fl[fl].index)
print(“Float variables:”,len(fl_cols))

3.3.4 Exploratory Data Analysis


EDA refers to the deep analysis of data so as to discover different patterns and spot
anomalies. Before making inferences from data, it is essential to examine all your
variables.
So here let’s make a heatmap using seaborn library.
Plt.figure(figsize=(18,10))
sns.heatmap(df.corr(),

29
cmap = ‘BrBG’,
fmt = ‘.2f’,
linewidths = 2,
annot = True)

To analyze the different categorical features. Let’s draw the barplot.


Unique_values = []
for col in object_cols:
unique_values.append(df[col].unique().size)
plt.figure(figsize=(10,6))
plt.title(‘No. Unique values of Categorical Features’)
plt.xticks(rotation=90)
sns.barplot(x=object_cols,y=unique_values)

30
222
The plot shows that Neighbourhood has around 25 unique categories. To findout the
actual count of each category we can plot the bargraph of each features separately.

Plt.figure(figsize=(18, 36))
plt.suptitle(‘Categorical Features: Distribution’)
plt.subplots_adjust(hspace=0.5)
for index, col in enumerate(object_cols, start=1):
plt.subplot(9, 2, index)
plt.xticks(rotation=90)
sns.countplot(x=col, data=df)
plt.title(col)
plt.tight_layout()
plt.show()

31
3.3.5 Data Cleaning
Data Cleaning is the way to improvise the data or remove incorrect,
corrupted or irrelevant data.
As in our dataset, there are some columns that are not important and
irrelevant for the model training. So, we can drop that column before
training.
As Id Column will not be participating in any prediction. So we can Drop
it.

Df.drop([‘Id’],
axis=1,
inplace=True)

32
Replacing SalePrice empty values with their mean values to make the data
distribution symmetric.
Df[‘SalePrice’] = df[‘SalePrice’].fillna(
df[‘SalePrice’].mean())

Drop records with null values (as the empty records are very less).
New_dataset = df.dropna()

Checking features which have null values in the new dataframe (if there are still any).
New_dataset.isnull().sum()

3.3.6 OneHotEncoder – For Label categorical features


One hot Encoding is the best way to convert categorical data into binary
vectors. This maps the values to integer values. By using OneHotEncoder,
we can easily convert object data into int. So for that, firstly we have to
collect all the features which have the object datatype. To do so, we will
make a loop.
From sklearn.preprocessing import OneHotEncoder
s = (new_dataset.dtypes == ‘object’)

33
object_cols = list(s[s].index)
print(“Categorical variables:”)
print(object_cols)
print(‘No. of. Categorical features: ‘,
len(object_cols))

Then once we have a list of all the features. We can apply OneHotEncoding to the
whole list.
OH_encoder = OneHotEncoder(sparse=False)
OH_cols = pd.DataFrame(OH_encoder.fit_transform(new_dataset[object_cols]))
OH_cols.index = new_dataset.index
OH_cols.columns = OH_encoder.get_feature_names_out()
df_final = new_dataset.drop(object_cols, axis=1)
df_final = pd.concat([df_final, OH_cols], axis=1)

3.3.7 Splitting Dataset into Training and Testing


X and Y splitting (i.e. Y is the SalePrice column and the rest of the other
columns are X)
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

X = df_final.drop([‘SalePrice’], axis=1)
Y = df_final[‘SalePrice’]

# Split the training set into


# training and validation set
X_train, X_valid, Y_train, Y_valid = train_test_split(
X, Y, train_size=0.8, test_size=0.2, random_state=0)

34
3.3.8 Model and Accuracy
As we have to train the model to determine the continuous values, so we
will be using these regression models.
 SVM-Support Vector Machine
 Linear Regressor
3.3.9 SVM – Support vector Machine
SVM can be used for both regression and classification model. It finds the
hyperplane in the n-dimensional plane.
From sklearn import svm
from sklearn.svm import SVC
from sklearn.metrics import mean_absolute_percentage_error
model_SVR = svm.SVR()
model_SVR.fit(X_train,Y_train)
Y_pred = model_SVR.predict(X_valid)
print(mean_absolute_percentage_error(Y_valid, Y_pred))

3.3.10 R- Squared Score


Y_pred = model_SVR.predict(X_valid)
r2_score_SVR = model_SVR.score(X_valid, Y_valid)
print(“R-squared score (SVR):”, r2_score_SVR)

# Install necessary libraries


!pip install numpy pandas matplotlib scikit-learn

from sklearn import datasets


from sklearn.preprocessing import StandardScaler

35
# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data[:, :2] # Using only two features for visualization
y = iris.target

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create and train the SVM model


svm_model = SVC(kernel=’linear’, C=1.0)
svm_model.fit(X_train_scaled, y_train)

# Function to plot the decision boundary and data points


def plot_decision_boundary(model, X, y):
h = .02 # Step size in the mesh
x_min, x_max = X[:, 0].min() – 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() – 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)

36
# Plot data points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired, edgecolors=’k’)
plt.xlabel(‘Feature 1’)
plt.ylabel(‘Feature 2’)
plt.title(‘SVM Decision Boundary’)
plt.show()

# Plot the decision boundary


plot_decision_boundary(svm_model, X_train_scaled, y_train)

param_grid_SVR = {
‘C’: [0.1, 1, 10],

37
‘kernel’: [‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’],
‘gamma’: [‘scale’, ‘auto’]
}
grid_search_SVR = GridSearchCV(model_SVR, param_grid_SVR,
scoring=’neg_mean_squared_error’, cv=5)
grid_search_SVR.fit(X_train_scaled, Y_train)
best_model_SVR = grid_search_SVR.best_estimator_
Y_pred_SVR = best_model_SVR.predict(X_valid_scaled)
mse_SVR = mean_squared_error(Y_valid, Y_pred_SVR)
print(“SVM – Best Hyperparameters:”, grid_search_SVR.best_params_)
print(“SVM – Mean Squared Error:”, mse_SVR)
print(“SVM – R-squared score:”, r2_score(Y_valid, Y_pred_SVR))

3.3.11 Linear Regression


Linear Regression predicts the final output-dependent value based on the
given independent features. Like, here we have to predict SalePrice
depending on features like MSSubClass, YearBuilt, BldgType, Exterior1st
etc.

from sklearn.linear_model import LinearRegression


model_LR = LinearRegression()
model_LR.fit(X_train, Y_train)
Y_pred = model_LR.predict(X_valid)
print(mean_absolute_percentage_error(Y_valid, Y_pred))

3.3.12 R- Squared Score


Y_pred = model_LR.predict(X_valid)
r2_score_LR = r2_score(Y_valid, Y_pred)
print(“R-squared score (Linear Regression):”, r2_score_LR)

38
from sklearn import datasets
from sklearn.linear_model import LinearRegression

# Load the Iris dataset


iris = datasets.load_iris()
X = iris.data[:, :2] # Using only two features for visualization
y = iris.target

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create and train the Linear Regression model


linear_model = LinearRegression()
linear_model.fit(X_train_scaled, y_train)

# Function to plot the regression line and data points


def plot_regression_line(model, X, y):
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired, edgecolors=’k’)
plt.xlabel(‘Feature 1’)
plt.ylabel(‘Feature 2’)
plt.title(‘Linear Regression Line’)

39
# Plot regression line
x_min, x_max = X[:, 0].min(), X[:, 0].max()
y_min, y_max = X[:, 1].min(), X[:, 1].max()
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max,
100))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)

plt.show()

# Plot the regression line


plot_regression_line(linear_model, X_train_scaled, y_train)

scores_LR = cross_val_score(model_LR, X_train_scaled, Y_train,


scoring=’neg_mean_squared_error’, cv=5)
mse_LR = -scores_LR.mean()
print(“Linear Regression – Mean Squared Error:”, mse_LR)
model_LR.fit(X_train_scaled, Y_train) # Fit on entire training data for error analysis

40
3.3.13 Random Forest Regression
Random Forest is an ensemble technique that uses multiple of decision trees and can
be used for both regression and classification tasks.
From sklearn.ensemble import RandomForestRegressor
model_RFR = RandomForestRegressor(n_estimators=10)
model_RFR.fit(X_train, Y_train)
Y_pred = model_RFR.predict(X_valid)
mean_absolute_percentage_error(Y_valid, Y_pred)

model_RF = RandomForestRegressor(random_state=RANDOM_SEED)
param_grid_RF = {
‘n_estimators’: [100, 200, 300],
‘max_depth’: [None, 10, 20],
‘min_samples_split’: [2, 5, 10]
}
grid_search_RF = GridSearchCV(model_RF, param_grid_RF,
scoring=’neg_mean_squared_error’, cv=5)
grid_search_RF.fit(X_train, Y_train)
best_model_RF = grid_search_RF.best_estimator_
Y_pred_RF = best_model_RF.predict(X_valid)
mse_RF = mean_squared_error(Y_valid, Y_pred_RF)
print(“Random Forest – Best Hyperparameters:”, grid_search_RF.best_params_)
print(“Random Forest – Mean Squared Error:”, mse_RF)
print(“Random Forest – R-squared score:”, r2_score(Y_valid, Y_pred_RF))

41
42
CHAPTER 4: TESTING & IMPLEMENTATION

4.1 Types of testing


4.1.1 Unit testing
Unit testing is a technique in which particular module is tested to check by
developer himself whether there are any errors. The primary focus of unit testing is
test an individual unit of system to analyse, detect, and fix the errors. Python
provides the unit test module to test the unit of source code.
4.1.2 Module testing
Module testing is a type of software testing where individual units or
components of the software are tested. The purpose of module testing is to isolate a
section of code and verify its correctness. Module testing is usually performed by the
development team during the early stages of software development.

4.1.3 Integration testing


Integration testing is performed after each component is separately tested
(unit testing). After that, all the components are then unified into single applications
and various actions are performed to check their execution as a whole application.
The actions include, Calling a Python API.

4.1.4 System testing


System testing is a type of software testing that evaluates the overall
functionality and performance of a complete and fully integrated software
solution. It tests if the system meets the specified requirements and if it is suitable
for delivery to the end-users. This type of testing is performed after the integration
testing and before the acceptance testing.

4.1.5 White box & Black box


The Black Box Test is a test that only considers the external behavior of the
system; the internal workings of the software is not taken into account. The White
Box Test is a method used to test a software taking into consideration its internal
functioning.

43
4.1.6 Acceptance testing
Acceptance tests check that you are building the correct product. While unit
tests and integration tests are a form of verification, acceptance tests are
validation. They validate that you are building what the user expects. In this
chapter, you will learn about acceptance testing in Python.

4.2 Test data and cases

Figure 6.1

Figure 6.2

44
Figure 6.3

Figure 6.4

45
Figure 6.5

This is the output after testing our model. The objective of a house price prediction
project for IBM is to leverage data analysis to provide value to the organization, its
clients, and the real estate market as a whole. This can be achieved by delivering
accurate predictions, fostering innovation, and supporting data-driven decision-making
in the dynamic and competitive real estate industry.

46
47
CHAPTER 5: CONCLUSION & REFERENCES

5.1 CONCLUSION
In this house price prediction project, we embarked on a journey to harness the power
of data analysis to provide accurate and interpretable predictions in the dynamic real
estate market. Our comprehensive approach involved collecting, preprocessing, and
analysing diverse data sources, with a focus on transparency and usability. As we
conclude this project, several key takeaways and achievements stand out:
1. Data Integration and Feature Engineering: We successfully gathered and
integrated a wide range of data, including historical property sales records,
economic indicators, property attributes, and geographical information. Feature
engineering techniques enabled us to extract meaningful predictors from this data,
contributing to the model’s accuracy.
2. Model Selection and Evaluation: Through rigorous experimentation, we explored
various predictive models, including regression analysis and machine learning
algorithms. Our evaluation criteria highlighted the model’s impressive accuracy,
demonstrating its ability to make reliable house price predictions.
3. Spatial Considerations: Recognizing the importance of location-based factors in
the real estate market, we incorporated spatial elements, such as neighbourhood
characteristics and proximity to amenities, into our predictive model. This enhanced
the model’s performance and relevance.
4. Transparency and Interpretability: A hallmark of our project is the emphasis on
transparency and interpretability. Our user-friendly interface provides users with
clear explanations of predictions, fostering trust and understanding.
5. Real-World Applicability: The success of our predictive model has significant
real-world implications. Homebuyers can make more informed decisions, sellers
can optimize their pricing strategies, and real estate professionals can gain valuable
insights into market trends and client needs.
6. Ethical Considerations: Throughout the project, we maintained a commitment to
ethical data handling and responsible AI practices. Data privacy, fairness, and
unbiased analysis were paramount in our approach.
7. Future Directions: While we have achieved remarkable results, the field of data
analysis is ever-evolving. Future directions for this project may include enhancing
predictive models with more advanced AI techniques, expanding data sources, and
adapting to changing market dynamics.
In conclusion, our house price prediction project stands as a testament to the power of
data analysis in empowering stakeholders in the real estate industry. By providing
accurate, transparent, and user-friendly predictions, we have contributed to informed
decision-making and data-driven strategies. As we look ahead, the potential for further
innovation and impact in this field is vast, and we remain committed to pushing the
boundaries of data analysis to benefit individuals and organizations in the ever-
changing real estate landscape.

48
5.2 System Specifications
5.2.1 Hardware Requirement
 Processor: Multicore processor(Intel Core i5 or equivalent)
 RAM: 8GB or higher
 Storage: Minimum 256 GB SSD for optimal Performance
5.2.2 Software Requirement
The software requirement for the application includes:
Operating System: Windows 10 or later
Python: Python 3.8 or higher

Limitations of the system


It doesn’t predict future prices of the houses mentioned by the customer. Due to this,
the risk in investment in an apartment or an area increases considerable. To minimize
this error, customers tend to hire an agent which again increases the cost of the
process. This leads to the modification and development of the existing system.

5.3 REFERENCES
1. IBM
https://www.ibm.com/in-en/about

2. Geeksforgeeks
https://www.geeksforgeeks.org/

3. To wards Data Science


https://towardsdatascience.com/
4. w3 school
https://www.w3schools.com/datascience/

5. Iain Pardoe, 2008, Modeling Home prices using Realtor Data

6. https://www.ijcaonlineorg/archives/volume152/number2/bhagat-2016-ijca-
911775.pdf

49
50
Chapter 6: Annexures
A-10 Sample Outputs

Figure 6.1

Figure 6.2

51
Figure 6.3

Figure 6.4

52
Figure 6.5

53

You might also like