Ram STP
Ram STP
Ram STP
STUDIES
A
SUMMER TRAINING PROJECT REPORT ON
“HOUSE PRICE PREDICTION USING DATA ANALYTICS”
BATCH: 2021-2024
1
CERTIFICATION OF COMPLETION
2
3
DECLARATION BY THE CANDIDATE
I hereby declare that the work presented in this report entitled “HOUSE PRICE
PREDICTION USING DATA ANALYTICS”, in fulfilment of the requirement for the
award of the degree Bachelor of Computer Application, Trinity institute of Professional
Studies, Dwarka, New Delhi, is an authentic record of my own work carried out during my
degree under the guidance of Ms. Nishika. The work reported in this has not been
submitted by me for award of any other degree or diploma.
4
To Whom It May Concern
I RAM TANWAR, Enrolment No. 01724002021 from BCA-V Sem, Shift SECOND
of the Trinity Institute of Professional Studies, Delhi hereby declare that the Summer
Project Training (BCA-331) entitled “House Price Prediction using Data Analytics”
at Shri Sahib technologies is an original work and the same has not been submitted to
any other Institute for the award of any other degree. A presentation of the Summer
Project Training was made on October 15, 2023 and the suggestions as approved by the
faculty were duly incorporated.
Associate Professor
5
ACKNOWLEDGEMENT
I express my sincere gratitude to Ms. Nishika (Associate Professor) for his valuable
guidance and timely suggestions during the entire duration of my dissertation work, without
which this work would not have been possible. I would also like to convey my deep regards
to all other faculty members who have bestowed their great effort and guidance at
appropriate times without which it would have been very difficult on my part to finish this
work. Finally, I would also like to thank my friends for their advice and pointing out my
mistakes.
6
TABLE OF CONTENTS
1 List of figures 7
2 List of abbreviations 8
3 Abstract 9
5.1 Conclusion
5.2 Reference
9 Chapter 6: Annexures 50-
A-10 Sample Output 52sssss
7
LIST OF FIGURES
8
LIST OF ABBREVIATIONS
9
ABSTRACT
In the fast-paced world of real estate, predicting house prices accurately is a game-changer.
Whether you're a buyer looking for that perfect deal, a seller trying to set the right price, or
an investor seeking lucrative opportunities, having a reliable model at your disposal can
make all the difference. Our project is all about harnessing the power of data analytics to
create a rock-solid house price prediction model. We're diving headfirst into historical
housing data, armed with a toolbox of data analysis techniques, to not just guess, but to
forecast future house prices with pinpoint precision.
This project is more than just crunching numbers; it's about giving you the confidence to
make informed decisions in the real estate market. Imagine knowing what a property will
be worth in the coming months or years. It's like having a crystal ball, but one that relies on
solid data and proven methods. So, stay tuned as we embark on this journey to build a model
that's not just casual about predictions, but professionally accurate in the ever-evolving
world of real estate.
10
CHAPTER 1: INTRODUCTION
Our clients’ systems support modern society. In making them faster, more
productive, and more secure, we don’t just make business work better. We
make the world work better.
We bring together all the necessary technology and services, regardless of where those
solutions come from, to help clients solve the most pressing business problems.
IBM integrates technology and expertise, providing infrastructure, software (including
market-leading Red Hat) and consulting services for clients as they pursue the digital
transformation of the world’s mission-critical businesses.
The main goal of IBM is to be a leading provider of innovative technology solutions
and services that help clients solve complex business problems and drive digital
transformation.
1.2 OBJECTIVE
The primary objective of implementing a house price prediction project using data
analysis for IBM can be multifaceted and may include:
1. Business Decision Support: Provide IBM with a powerful tool to make informed
business decisions related to real estate investments, acquisitions, and strategic
planning. This includes identifying potential opportunities, assessing risks, and
optimizing the allocation of resources.
2. Customer Engagement: Enhance IBM's engagement with its customers by offering
them valuable insights into the real estate market. This can be in the form of tools or
services that help clients make data-driven decisions about buying, selling, or investing
in properties.
3. Data-Driven Innovation: Showcase IBM's commitment to data-driven innovation
and technology leadership. Demonstrating the capability to harness data for predictive
analytics can strengthen IBM's reputation as a technology and solutions provider.
4. Competitive Advantage: Gain a competitive edge in the real estate technology market
by offering a sophisticated house price prediction solution. This can attract clients
seeking advanced analytics solutions and differentiate IBM from competitors.
5. Revenue Generation: Explore opportunities for revenue generation through the sale
of predictive analytics services, licensing of predictive models, or providing
subscription-based access to the prediction platform.
6. Research and Development: Foster ongoing research and development efforts in the
field of data analytics, machine learning, and artificial intelligence. The project can
serve as a platform for testing and advancing IBM's AI and analytics capabilities.
11
7. User Empowerment: Empower IBM's clients and partners with a user-friendly
interface that allows them to harness the power of data analysis without requiring
extensive technical knowledge.
8. Transparency and Trust: Build trust with clients and stakeholders by providing
transparent and interpretable predictions. Ensure that users understand the factors
contributing to house price predictions and can rely on the accuracy of the model.
9. Scalability: Design the solution to be scalable, allowing IBM to adapt to changing
market dynamics and expanding data sources.
10. Compliance and Ethical Considerations: Ensure that the project complies with data
privacy regulations and ethical considerations related to data usage and analytics.
11. Educational Outreach: Use the project as an educational tool to help clients and
stakeholders understand the value of data analytics in the real estate industry. Provide
training and resources to promote data literacy.
12. Sustainability: Consider the environmental impact of the project and aim to
minimize resource consumption while maximizing the benefits of data analysis.
Ultimately, the objective of a house price prediction project for IBM is to leverage data
analysis to provide value to the organization, its clients, and the real estate market as a
whole. This can be achieved by delivering accurate predictions, fostering innovation,
and supporting data-driven decision-making in the dynamic and competitive real estate
industry.
12
5. Customer Engagement: Engage customers and clients by offering data-driven
services and tools that enhance their experience and help them achieve their real estate
goals.
6. Competitive Advantage: Establish a competitive advantage in the real estate
technology market by providing a sophisticated predictive analytics solution.
7. Research and Innovation: Foster research and innovation in data analytics, machine
learning, and artificial intelligence while addressing real-world challenges in the real
estate domain.
1.3.2 Scope
1. Data Sources: Define the sources of data to be used, which may include historical
property sales data, housing market indicators, economic data, geographical attributes,
and demographic information.
2. Geographical Coverage: Specify the geographical scope of the project, such as
whether it focuses on a specific region, city, or covers a broader national or international
market.
3. Property Types: Determine the types of properties under consideration, such as
residential homes, commercial properties, or specific categories like apartments,
condos, or single-family houses.
4. Predictive Models: Outline the types of predictive models and algorithms that will
be employed for house price prediction, including regression analysis, machine learning
models, or specialized approaches.
5. Feature Engineering: Describe the feature engineering techniques that will be applied
to extract meaningful predictors from the data, including property features, location-
based attributes, and market indicators.
6. Model Evaluation: Define the criteria for evaluating the performance of predictive
models, considering metrics like accuracy, mean squared error, and interpretability.
7. User Interface: Specify the design and functionality of the user interface, ensuring it
is user-friendly and provides transparent explanations of predictions.
8. Data Privacy: Address data privacy and security concerns, outlining how sensitive
information will be handled and ensuring compliance with relevant regulations.
9. Scalability: Consider whether the project is designed to scale with changing data
volumes and market dynamics.
10. Documentation and Reporting: Determine the scope of documentation and
reporting, including the presentation of findings, insights, and recommendations.
11. Ethical Considerations: Address ethical considerations related to data usage, bias
mitigation, and responsible AI practices.
13
12. Maintenance and Updates: Define the scope of ongoing maintenance, model
updates, and support for users.
13. Stakeholders: Identify the primary stakeholders, such as real estate professionals,
investors, homebuyers, and any other parties benefiting from the project.
14
15
CHAPTER 2: SYSTEM REQUIREMENT ANALYSIS
16
2.1.4 Other Feasibility Dimensions
Additional feasibility dimensions remain consistent:
1. Legal Feasibility: The project complies with data privacy regulations and
intellectual property laws. Data usage and handling adhere to legal requirements
without incurring extra legal costs.
2. Scheduling Feasibility: A detailed project timeline remains established, ensuring
timely delivery without incurring additional costs.
3. Resource Feasibility: The project utilizes existing physical resources and
infrastructure, minimizing resource-related expenses.
Import necessary libraries for data analysis, visualization, and machine learning.
Load the dataset (housepred.csv) into a Pandas DataFrame.
Check the data to understand its structure, including columns and data types.
Categorize features based on their data types (int, float, object) and calculate the
number of each.
2. Exploratory Data Analysis (EDA):
3. Data Cleaning:
Handle missing values, in this case, by filling empty SalePrice values with their
mean values.
Drop records with null values.
Verify that no features have null values in the cleaned dataset.
4. One-Hot Encoding:
5. Data Splitting:
17
6. Regression Models:
1. Train and evaluate different regression models:
2. Support Vector Machine (SVM) Regressor.
3. Tune hyperparameters using GridSearchCV.
4. Calculate Mean Squared Error and R-squared score.
5. Linear Regression.
6. Calculate Mean Absolute Percentage Error and R-squared score.
7. Random Forest Regressor.
8. Tune hyperparameters using GridSearchCV.
9. Calculate Mean Squared Error and R-squared score.
8. Conclusion:
18
Overall, Google Colab is a versatile and accessible platform for data analysis, machine
learning, and collaborative work in the field of data science.
19
Matplotlib and Seaborn: These libraries are used for data
visualization. Matplotlib offers a wide range of plot types, and
Seaborn is built on top of Matplotlib, providing a high-level interface
for creating attractive statistical graphics.
Scikit-Learn: Scikit-Learn is a machine learning library for Python. It
includes tools for data
Python:
Figure 2.1
What sets Python apart is its clean and easily understandable syntax, emphasizing
code readability. Instead of relying on complex punctuation and braces, Python uses
indentation to define code blocks. This feature makes Python particularly beginner-
friendly and encourages developers to write clean and maintainable code.
A key strength of Python lies in its extensive standard library, which offers a wealth
of modules and functions for various tasks, from file handling to networking and
data manipulation. This rich library ecosystem simplifies development, as
developers can leverage pre-built tools.
20
Google Colab:
Figure 2.2
Google Colab, short for Google Colaboratory, is a cloud-based platform that has
revolutionized the world of data science and machine learning. Developed by
Google, Colab offers a dynamic and collaborative environment for building,
training, and deploying machine learning models, all within a web browser. It has
become a game-changer in the field of data analysis and computational research.
Pandas:
Figure 2.3
Pandas, short for "Panel Data," is a widely-used library in the Python ecosystem,
renowned for its robust data manipulation and analysis capabilities. It serves as the
go-to tool for data professionals, scientists, analysts, and developers when it comes to
efficiently handling structured data. At its core, Pandas revolves around two primary
data structures: Series and DataFrame.
One of Pandas' standout features is its exceptional data loading capabilities. It can
effortlessly read data from a myriad of file formats including CSV, Excel, SQL
databases, and more, making the process of importing data for analysis remarkably
seamless.
Pandas also shines in data cleaning tasks. It provides powerful tools to handle missing
values, eliminate duplicates, and convert data types, ensuring your dataset is pristine
and ready for analysis.
Data filtering and selection are a breeze with Pandas. We can easily extract specific
subsets of data based on criteria relevant to your analysis, simplifying the exploration
of large datasets.
21
Numpy:
Figure 2.4
NumPy, short for "Numerical Python," is a fundamental library for scientific
computing in the Python programming language. It provides support for large, multi-
dimensional arrays and matrices, along with a collection of high-level mathematical
functions to operate on these arrays. NumPy is a cornerstone library in the Python
data science and numerical computing ecosystem, and it offers several key features:
Matplotlib and Seaborn:
Figure 2.7
Scikit-Learn, often abbreviated as sklearn, is a popular and open-source machine
learning library built for Python. It serves as a valuable tool for machine learning and
data science practitioners, providing a wide range of tools and algorithms for tasks
such as classification, regression, clustering, dimensionality reduction, and more.
22
23
CHAPTER 3: SYSTEM DESIGN
Real Estate Agents and Brokers: Seeking accurate pricing for listings.
Homebuyers: Wanting to make informed purchase decisions.
Home Sellers: Aiming to price their properties competitively.
Investors: Analyzing potential returns on real estate investments.
24
3.1.2 Revenue Model
3.1.2.1 Pricing Strategy
Subscription Model: Charge real estate professionals a monthly fee for access
to the platform.
Pay-Per-Use Model: Charge users on a per-query basis for property
valuations.
3.1.3.2 Infrastructure
Cloud Computing: Host the application and databases on a cloud platform like
AWS, Azure, or Google Cloud.
Database: Use relational databases for data storage and retrieval.
APIs: Develop RESTful APIs for easy integration with external systems.
25
3.2 DFD
Level 0:
DATA FLOW
House
Price
Prediction DATA STORES
PROCESSES
Figure 3.2 Level 0 DFD
Level 1:
Model Evaluation
26
5.2 Interface Design
import pandas as pd
import numpy as np
import unittest
%matplotlib inline
rcParams[‘figure.figsize’] = 14, 8
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
def run_tests():
27
import io
df = pd.read_csv(io.BytesIO(uploaded[‘housepred.csv’]))
print(df)
df.shape
28
df.head()
fl = (df.dtypes == ‘float’)
fl_cols = list(fl[fl].index)
print(“Float variables:”,len(fl_cols))
29
cmap = ‘BrBG’,
fmt = ‘.2f’,
linewidths = 2,
annot = True)
30
222
The plot shows that Neighbourhood has around 25 unique categories. To findout the
actual count of each category we can plot the bargraph of each features separately.
Plt.figure(figsize=(18, 36))
plt.suptitle(‘Categorical Features: Distribution’)
plt.subplots_adjust(hspace=0.5)
for index, col in enumerate(object_cols, start=1):
plt.subplot(9, 2, index)
plt.xticks(rotation=90)
sns.countplot(x=col, data=df)
plt.title(col)
plt.tight_layout()
plt.show()
31
3.3.5 Data Cleaning
Data Cleaning is the way to improvise the data or remove incorrect,
corrupted or irrelevant data.
As in our dataset, there are some columns that are not important and
irrelevant for the model training. So, we can drop that column before
training.
As Id Column will not be participating in any prediction. So we can Drop
it.
Df.drop([‘Id’],
axis=1,
inplace=True)
32
Replacing SalePrice empty values with their mean values to make the data
distribution symmetric.
Df[‘SalePrice’] = df[‘SalePrice’].fillna(
df[‘SalePrice’].mean())
Drop records with null values (as the empty records are very less).
New_dataset = df.dropna()
Checking features which have null values in the new dataframe (if there are still any).
New_dataset.isnull().sum()
33
object_cols = list(s[s].index)
print(“Categorical variables:”)
print(object_cols)
print(‘No. of. Categorical features: ‘,
len(object_cols))
Then once we have a list of all the features. We can apply OneHotEncoding to the
whole list.
OH_encoder = OneHotEncoder(sparse=False)
OH_cols = pd.DataFrame(OH_encoder.fit_transform(new_dataset[object_cols]))
OH_cols.index = new_dataset.index
OH_cols.columns = OH_encoder.get_feature_names_out()
df_final = new_dataset.drop(object_cols, axis=1)
df_final = pd.concat([df_final, OH_cols], axis=1)
X = df_final.drop([‘SalePrice’], axis=1)
Y = df_final[‘SalePrice’]
34
3.3.8 Model and Accuracy
As we have to train the model to determine the continuous values, so we
will be using these regression models.
SVM-Support Vector Machine
Linear Regressor
3.3.9 SVM – Support vector Machine
SVM can be used for both regression and classification model. It finds the
hyperplane in the n-dimensional plane.
From sklearn import svm
from sklearn.svm import SVC
from sklearn.metrics import mean_absolute_percentage_error
model_SVR = svm.SVR()
model_SVR.fit(X_train,Y_train)
Y_pred = model_SVR.predict(X_valid)
print(mean_absolute_percentage_error(Y_valid, Y_pred))
35
# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data[:, :2] # Using only two features for visualization
y = iris.target
# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
36
# Plot data points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired, edgecolors=’k’)
plt.xlabel(‘Feature 1’)
plt.ylabel(‘Feature 2’)
plt.title(‘SVM Decision Boundary’)
plt.show()
param_grid_SVR = {
‘C’: [0.1, 1, 10],
37
‘kernel’: [‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’],
‘gamma’: [‘scale’, ‘auto’]
}
grid_search_SVR = GridSearchCV(model_SVR, param_grid_SVR,
scoring=’neg_mean_squared_error’, cv=5)
grid_search_SVR.fit(X_train_scaled, Y_train)
best_model_SVR = grid_search_SVR.best_estimator_
Y_pred_SVR = best_model_SVR.predict(X_valid_scaled)
mse_SVR = mean_squared_error(Y_valid, Y_pred_SVR)
print(“SVM – Best Hyperparameters:”, grid_search_SVR.best_params_)
print(“SVM – Mean Squared Error:”, mse_SVR)
print(“SVM – R-squared score:”, r2_score(Y_valid, Y_pred_SVR))
38
from sklearn import datasets
from sklearn.linear_model import LinearRegression
# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
39
# Plot regression line
x_min, x_max = X[:, 0].min(), X[:, 0].max()
y_min, y_max = X[:, 1].min(), X[:, 1].max()
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max,
100))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
plt.show()
40
3.3.13 Random Forest Regression
Random Forest is an ensemble technique that uses multiple of decision trees and can
be used for both regression and classification tasks.
From sklearn.ensemble import RandomForestRegressor
model_RFR = RandomForestRegressor(n_estimators=10)
model_RFR.fit(X_train, Y_train)
Y_pred = model_RFR.predict(X_valid)
mean_absolute_percentage_error(Y_valid, Y_pred)
model_RF = RandomForestRegressor(random_state=RANDOM_SEED)
param_grid_RF = {
‘n_estimators’: [100, 200, 300],
‘max_depth’: [None, 10, 20],
‘min_samples_split’: [2, 5, 10]
}
grid_search_RF = GridSearchCV(model_RF, param_grid_RF,
scoring=’neg_mean_squared_error’, cv=5)
grid_search_RF.fit(X_train, Y_train)
best_model_RF = grid_search_RF.best_estimator_
Y_pred_RF = best_model_RF.predict(X_valid)
mse_RF = mean_squared_error(Y_valid, Y_pred_RF)
print(“Random Forest – Best Hyperparameters:”, grid_search_RF.best_params_)
print(“Random Forest – Mean Squared Error:”, mse_RF)
print(“Random Forest – R-squared score:”, r2_score(Y_valid, Y_pred_RF))
41
42
CHAPTER 4: TESTING & IMPLEMENTATION
43
4.1.6 Acceptance testing
Acceptance tests check that you are building the correct product. While unit
tests and integration tests are a form of verification, acceptance tests are
validation. They validate that you are building what the user expects. In this
chapter, you will learn about acceptance testing in Python.
Figure 6.1
Figure 6.2
44
Figure 6.3
Figure 6.4
45
Figure 6.5
This is the output after testing our model. The objective of a house price prediction
project for IBM is to leverage data analysis to provide value to the organization, its
clients, and the real estate market as a whole. This can be achieved by delivering
accurate predictions, fostering innovation, and supporting data-driven decision-making
in the dynamic and competitive real estate industry.
46
47
CHAPTER 5: CONCLUSION & REFERENCES
5.1 CONCLUSION
In this house price prediction project, we embarked on a journey to harness the power
of data analysis to provide accurate and interpretable predictions in the dynamic real
estate market. Our comprehensive approach involved collecting, preprocessing, and
analysing diverse data sources, with a focus on transparency and usability. As we
conclude this project, several key takeaways and achievements stand out:
1. Data Integration and Feature Engineering: We successfully gathered and
integrated a wide range of data, including historical property sales records,
economic indicators, property attributes, and geographical information. Feature
engineering techniques enabled us to extract meaningful predictors from this data,
contributing to the model’s accuracy.
2. Model Selection and Evaluation: Through rigorous experimentation, we explored
various predictive models, including regression analysis and machine learning
algorithms. Our evaluation criteria highlighted the model’s impressive accuracy,
demonstrating its ability to make reliable house price predictions.
3. Spatial Considerations: Recognizing the importance of location-based factors in
the real estate market, we incorporated spatial elements, such as neighbourhood
characteristics and proximity to amenities, into our predictive model. This enhanced
the model’s performance and relevance.
4. Transparency and Interpretability: A hallmark of our project is the emphasis on
transparency and interpretability. Our user-friendly interface provides users with
clear explanations of predictions, fostering trust and understanding.
5. Real-World Applicability: The success of our predictive model has significant
real-world implications. Homebuyers can make more informed decisions, sellers
can optimize their pricing strategies, and real estate professionals can gain valuable
insights into market trends and client needs.
6. Ethical Considerations: Throughout the project, we maintained a commitment to
ethical data handling and responsible AI practices. Data privacy, fairness, and
unbiased analysis were paramount in our approach.
7. Future Directions: While we have achieved remarkable results, the field of data
analysis is ever-evolving. Future directions for this project may include enhancing
predictive models with more advanced AI techniques, expanding data sources, and
adapting to changing market dynamics.
In conclusion, our house price prediction project stands as a testament to the power of
data analysis in empowering stakeholders in the real estate industry. By providing
accurate, transparent, and user-friendly predictions, we have contributed to informed
decision-making and data-driven strategies. As we look ahead, the potential for further
innovation and impact in this field is vast, and we remain committed to pushing the
boundaries of data analysis to benefit individuals and organizations in the ever-
changing real estate landscape.
48
5.2 System Specifications
5.2.1 Hardware Requirement
Processor: Multicore processor(Intel Core i5 or equivalent)
RAM: 8GB or higher
Storage: Minimum 256 GB SSD for optimal Performance
5.2.2 Software Requirement
The software requirement for the application includes:
Operating System: Windows 10 or later
Python: Python 3.8 or higher
5.3 REFERENCES
1. IBM
https://www.ibm.com/in-en/about
2. Geeksforgeeks
https://www.geeksforgeeks.org/
6. https://www.ijcaonlineorg/archives/volume152/number2/bhagat-2016-ijca-
911775.pdf
49
50
Chapter 6: Annexures
A-10 Sample Outputs
Figure 6.1
Figure 6.2
51
Figure 6.3
Figure 6.4
52
Figure 6.5
53