A New Algorithm for Large-Scale Geographically Weighted Regression with K-Nearest Neighbors
<p>Flow chart of K-nearest neighbor geographically weighted regression. (<b>a</b>) Optimal bandwidth selection reference k-means. (<b>b</b>) KD Tree construction and search. (<b>c</b>) Restructured <math display="inline"><semantics><mrow><msub><mrow><mover accent="true"><mrow><mi>W</mi></mrow><mo stretchy="false">~</mo></mover></mrow><mrow><mi>i</mi></mrow></msub></mrow></semantics></math>, <math display="inline"><semantics><mrow><msub><mrow><mover accent="true"><mrow><mi>X</mi></mrow><mo stretchy="false">~</mo></mover></mrow><mrow><mi>i</mi></mrow></msub></mrow></semantics></math>, <math display="inline"><semantics><mrow><msub><mrow><mover accent="true"><mrow><mi>Y</mi></mrow><mo stretchy="false">~</mo></mover></mrow><mrow><mi>i</mi></mrow></msub></mrow></semantics></math>. (<b>d</b>) KNN-GWR in calibration.</p> "> Figure 2
<p>An example of finding observations within a certain range around the regression point. 1, 2, 3, and 4 are points within the bandwidth distance from point <span class="html-italic">i</span>, and 5, 6, 7, 8, and 9 are points outside the bandwidth distance from point <span class="html-italic">i</span>.</p> "> Figure 3
<p>Study area. (<b>a</b>) Geographical location of each province of the study area in China. (<b>b</b>) Distribution of house price dataset in the study area.</p> "> Figure 4
<p>Details of each independent variable (<b>a</b>) Map of NBeds parameter estimates for the predictor variables. (<b>b</b>) Map of NBaths parameter estimation for predictor variables. (<b>c</b>) Map of Area parameter estimates for the predictor variables. (<b>d</b>) Map of Age parameter estimation for the predictor variables.</p> "> Figure 5
<p>Estimated regression coefficients <math display="inline"><semantics><mrow><mi>β</mi></mrow></semantics></math> of KNN-GWR method on simulated dataset.</p> "> Figure 6
<p>Comparison of the computational speeds of four software packages(KNN-GWR, MGWR (PySAL), GWmodel, and Spgwr) in simulated dataset experiments.</p> "> Figure 7
<p>Comparison of KNN-GWR with other GWR packages in terms of runtime increase ratio (tested using simulated dataset).</p> "> Figure 8
<p>Space mapping for: (<b>a</b>) number of bedrooms in the house, (<b>b</b>) number of bathrooms, (<b>c</b>) area of the house, and (<b>d</b>) age of the house, by KNN-GWR modeling.</p> "> Figure 9
<p>Comparison of the computational speeds of four software packages (KNN-GWR, MGWR(PySAL), GWmodel, and Spgwr) in the house price dataset in China.</p> "> Figure 10
<p>Comparison of KNN-GWR with other GWR packages in terms of runtime increase ratio (tested using house price dataset in China).</p> ">
Abstract
:1. Introduction
2. Method
2.1. Geographically Weighted Regression
2.2. Geographically Weighted Regression with K-Nearest Neighbors
2.2.1. Optimal Bandwidth Selection Reference K-Means
Algorithm 1: Optimal Bandwidth Selection Reference K-means |
|
2.2.2. KD Tree Construction and Search
2.2.3. Geographically Weighted Regression with K-Nearest Neighbors and Model Evaluation
Algorithm 2: KNN-GWR Algorithm |
I Optimizing bandwidth search by minimizing CV scores |
1. Given the initial data , , , and |
2. Start the loop : |
3. Build KD tree spatial index |
4. Starting loop with : |
5. Take the data from the result of 3 steps: |
6. Reconfiguration , , and |
7. Calculate , |
8. End of loop |
9. Calculate |
10. End of loop |
II Optimal bandwidth selection based on minimum CV criterion |
11. Start the loop : |
12. Reconfiguration , , and |
13. Calculate , |
14. End of loop |
15. Return , |
16. Spatial analysis and model diagnosis using , |
2.2.4. Computational Complexity of GWR and KNN-GWR in Calibration
Time Complexity
Memory Complexity
3. Experiment
3.1. Data Source
3.1.1. Simulated Dataset
3.1.2. House Price Dataset in China
3.2. Testing Specifications and Environment
3.3. Results
3.3.1. Case One: Simulated Dataset
3.3.2. Case Two: House Price Dataset in China
4. Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Fotheringham, A.S.; Brunsdon, C.; Charlton, M.E. Geographically Weighted Regression; John Wiley & Sons: Hoboken, NJ, USA, 2002. [Google Scholar]
- Brunsdon, C.; Fotheringham, A.S.; Charlton, M.E. Geographically Weighted Regression: A Method for Exploring Spatial Nonstationarity. Geogr. Anal. 1996, 28, 281–298. [Google Scholar] [CrossRef]
- Shi, T.; Hu, X.; Guo, L.; Su, F.; Tu, W.; Hu, Z.; Liu, H.; Yang, C.; Wang, J.; Zhang, J.; et al. Digital mapping of zinc in urban topsoil using multisource geospatial data and random forest. Sci. Total Environ. 2021, 792, 148455. [Google Scholar] [CrossRef]
- Jiang, W.; Rao, P.; Cao, R.; Tang, Z.; Chen, K. Comparative evaluation of geological disaster susceptibility using multi-regression methods and spatial accuracy validation. J. Geogr. Sci. 2017, 27, 439–462. [Google Scholar] [CrossRef] [Green Version]
- Kumar, S.; Lal, R.; Liu, D. A geographically weighted regression kriging approach for mapping soil organic carbon stock. Geoderma 2012, 189–190, 627–634. [Google Scholar] [CrossRef]
- Davies, T.J.; Regetz, J.; Wolkovich, E.M.; McGill, B.J.; Kerkhoff, A. Phylogenetically weighted regression: A method for modelling non-stationarity on evolutionary trees. Glob. Ecol. Biogeogr. 2018, 28, 275–285. [Google Scholar] [CrossRef]
- Mellin, C.; Mengersen, K.; Bradshaw, C.J.A.; Caley, M.J. Generalizing the use of geographical weights in biodiversity modelling. Glob. Ecol. Biogeogr. 2014, 23, 1314–1323. [Google Scholar] [CrossRef]
- Yang, L.; Chau, K.W.; Szeto, W.Y.; Cui, X.; Wang, X. Accessibility to transit, by transit, and property prices: Spatially varying relationships. Transp. Res. Part D Transp. Environ. 2020, 85, 102387. [Google Scholar] [CrossRef]
- Wu, C.; Ren, F.; Hu, W.; Du, Q. Multiscale geographically and temporally weighted regression: Exploring the spatiotemporal determinants of housing prices. Int. J. Geogr. Inf. Sci. 2018, 33, 489–511. [Google Scholar] [CrossRef]
- Fotheringham, A.S.; Crespo, R.; Yao, J. Geographical and Temporal Weighted Regression (GTWR). Geogr. Anal. 2015, 47, 431–452. [Google Scholar] [CrossRef] [Green Version]
- Huang, B.; Wu, B.; Barry, M. Geographically and temporally weighted regression for modeling spatio-temporal variation in house prices. Int. J. Geogr. Inf. Sci. 2010, 24, 383–401. [Google Scholar] [CrossRef]
- Hong, Z.; Mei, C.; Wang, H.; Du, W. Spatiotemporal effects of climate factors on childhood hand, foot, and mouth disease: A case study using mixed geographically and temporally weighted regression models. Int. J. Geogr. Inf. Sci. 2021, 35, 1611–1633. [Google Scholar] [CrossRef]
- Hong, Z.; Hao, H.; Li, C.; Du, W.; Wei, L.; Wang, H. Exploration of potential risks of Hand, Foot, and Mouth Disease in Inner Mongolia Autonomous Region, China Using Geographically Weighted Regression Model. Sci. Rep. 2018, 8, 17707. [Google Scholar] [CrossRef] [Green Version]
- Mainardi, S. Modelling spatial heterogeneity and anisotropy: Child anaemia, sanitation and basic infrastructure in sub-Saharan Africa. Int. J. Geogr. Inf. Sci. 2012, 26, 387–411. [Google Scholar] [CrossRef]
- Lu, X.Y.; Chen, X.; Zhao, X.L.; Lv, D.J.; Zhang, Y. Assessing the impact of land surface temperature on urban net primary productivity increment based on geographically weighted regression model. Sci. Rep. 2021, 11, 22282. [Google Scholar] [CrossRef]
- Bivand, R.; Yu, D.; Nakaya, T.; Garcia-Lopez, M.-A. Package SPGWR; R Software Package; R Foundation for Statistical Computing: Vienna, Austria, 2022. [Google Scholar]
- Oshan, T.; Li, Z.; Kang, W.; Wolf, L.; Fotheringham, A. mgwr: A Python Implementation of Multiscale Geographically Weighted Regression for Investigating Process Spatial Heterogeneity and Scale. ISPRS Int. J. Geo-Inf. 2019, 8, 269. [Google Scholar] [CrossRef] [Green Version]
- Gollini, I.; Lu, B.; Charlton, M.; Brunsdon, C.; Harris, P. GWmodel: An R Package for Exploring Spatial Heterogeneity Using Geographically Weighted Models. J. Stat. Softw. 2015, 63, 1–50. [Google Scholar] [CrossRef] [Green Version]
- Li, Z.; Fotheringham, A.S.; Li, W.; Oshan, T. Fast Geographically Weighted Regression (FastGWR): A scalable algorithm to investigate spatial process heterogeneity in millions of observations. Int. J. Geogr. Inf. Sci. 2019, 33, 155–175. [Google Scholar] [CrossRef]
- Sudmanns, M.; Tiede, D.; Lang, S.; Bergstedt, H.; Trost, G.; Augustin, H.; Baraldi, A.; Blaschke, T. Big Earth data: Disruptive changes in Earth observation data management and analysis? Int. J. Digit. Earth 2019, 13, 832–850. [Google Scholar] [CrossRef]
- Ma, Y.; Wu, H.; Wang, L.; Huang, B.; Ranjan, R.; Zomaya, A.; Jie, W. Remote sensing big data computing: Challenges and opportunities. Future Gener. Comput. Syst. 2015, 51, 47–60. [Google Scholar] [CrossRef] [Green Version]
- Shorten, C.; Khoshgoftaar, T.M. A survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef] [Green Version]
- Lü, G.; Batty, M.; Strobl, J.; Lin, H.; Zhu, A.X.; Chen, M. Reflections and speculations on the progress in Geographic Information Systems (GIS): A geographic perspective. Int. J. Geogr. Inf. Sci. 2018, 33, 346–367. [Google Scholar] [CrossRef] [Green Version]
- Apte, J.S.; Messier, K.P.; Gani, S.; Brauer, M.; Kirchstetter, T.W.; Lunden, M.M.; Marshall, J.D.; Portier, C.J.; Vermeulen, R.C.H.; Hamburg, S.P. High-Resolution Air Pollution Mapping with Google Street View Cars: Exploiting Big Data. Environ. Sci. Technol. 2017, 51, 6999–7008. [Google Scholar] [CrossRef]
- Lee, J.-G.; Kang, M. Geospatial Big Data: Challenges and Opportunities. Big Data Res. 2015, 2, 74–81. [Google Scholar] [CrossRef]
- Mendi, A.F.; Demir, Ö.; Sakaklı, K.K.; Çabuk, A. A New Approach to Land Registry System in Turkey: Blockchain-Based System Proposal. Photogramm. Eng. Remote Sens. 2020, 86, 701–709. [Google Scholar] [CrossRef]
- Finley, A.O. Comparing spatially-varying coefficients models for analysis of ecological data with non-stationary and anisotropic residual dependence. Methods Ecol. Evol. 2011, 2, 143–154. [Google Scholar] [CrossRef]
- Harris, R. Grid-enabling Geographically Weighted Regression: A Case Study of Participation in Higher Education in England. Trans. GIS 2010, 14, 43–61. [Google Scholar] [CrossRef]
- Yu, D. Modeling Owner-Occupied Single-Family House Values in the City of Milwaukee: A Geographically Weighted Regression Approach. GISci. Remote Sens. 2007, 44, 267–282. [Google Scholar] [CrossRef]
- Feuillet, T.; Commenges, H.; Menai, M.; Salze, P.; Perchoux, C.; Reuillon, R.; Kesse-Guyot, E.; Enaux, C.; Nazare, J.A.; Hercberg, S.; et al. A massive geographically weighted regression model of walking-environment relationships. J. Transp. Geogr. 2018, 68, 118–129. [Google Scholar] [CrossRef]
- Wang, D.; Yang, Y.; Qiu, A.; Kang, X.; Han, J.; Chai, Z. A CUDA-Based Parallel Geographically Weighted Regression for Large-Scale Geographic Data. ISPRS Int. J. Geo-Inf. 2020, 9, 653. [Google Scholar] [CrossRef]
- Tasyurek, M.; Celik, M. RNN-GWR: A geographically weighted regression approach for frequently updated data. Neurocomputing 2020, 399, 258–270. [Google Scholar] [CrossRef]
- Meenakshi; Gill, S. k-dLst Tree: K-d Tree with Linked List to Handle Duplicate Keys. In Proceedings of the Emerging Trends in Expert Applications and Security, Singapore, 17–18 February 2018; pp. 167–175. [Google Scholar]
- Chen, Z.Y.; Liao, I.Y.; Ahmed, A. KDT-SPSO: A multimodal particle swarm optimisation algorithm based on k-d trees for palm tree detection. Appl. Soft Comput. 2021, 103, 107156. [Google Scholar] [CrossRef]
- Shyu, C.R.; Chi, P.H.; Scott, G.; Xu, D. ProteinDBS: A real-time retrieval system for protein structure comparison. Nucleic Acids Res. 2004, 32, W572–W575. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Böhm, C.; Krebs, F. The k-Nearest Neighbour Join: Turbo Charging the KDD Process. Knowl. Inf. Syst. 2004, 6, 728–749. [Google Scholar] [CrossRef]
- Muja, M.; Lowe, D.G. Scalable Nearest Neighbor Algorithms for High Dimensional Data. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 2227–2240. [Google Scholar] [CrossRef]
- Muja, M. Fast approximate nearest neighbors with automatic algorithm configuration. Proc. Viss. 2009, 1, 331–340. [Google Scholar]
- Boukerche, A.; Zheng, L.; Alfandi, O. Outlier detection: Methods, models, and classification. ACM Comput. Surv. (CSUR) 2020, 53, 1–37. [Google Scholar] [CrossRef]
- Jain, A.K. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 2010, 31, 651–666. [Google Scholar] [CrossRef]
- Fahad, A.; Alshatri, N.; Tari, Z.; Alamri, A.; Khalil, I.; Zomaya, A.Y.; Foufou, S.; Bouras, A. A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis. IEEE Trans. Emerg. Top. Comput. 2014, 2, 267–279. [Google Scholar] [CrossRef]
- Zhao, W.-L.; Deng, C.-H.; Ngo, C.-W. k-means: A revisit. Neurocomputing 2018, 291, 195–206. [Google Scholar] [CrossRef]
- Macqueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 21 June–18 July 1967; Volume 1. [Google Scholar]
- Selim, S.Z.; Ismail, M.A. K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality. IEEE Trans. Pattern Anal. Mach. Intell. 1984, PAMI-6, 81–87. [Google Scholar] [CrossRef]
- Li, S.; Lyu, D.; Huang, G.; Zhang, X.; Gao, F.; Chen, Y.; Liu, X. Spatially varying impacts of built environment factors on rail transit ridership at station level: A case study in Guangzhou, China. J. Transp. Geogr. 2020, 82, 102631. [Google Scholar] [CrossRef]
- Hernández, J.M.; Bulchand-Gidumal, J.; Suárez-Vega, R. Using accommodation price determinants to segment tourist areas. J. Destin. Mark. Manag. 2021, 21, 100622. [Google Scholar] [CrossRef]
- Deng, X.; Gao, F.; Liao, S.; Li, S. Unraveling the association between the built environment and air pollution from a geospatial perspective. J. Clean. Prod. 2023, 386, 135768. [Google Scholar] [CrossRef]
- Murakami, D.; Tsutsumida, N.; Yoshida, T.; Nakaya, T.; Lu, B. Scalable GWR: A Linear-Time Algorithm for Large-Scale Geographically Weighted Regression with Polynomial Kernels. Ann. Am. Assoc. Geogr. 2020, 111, 459–480. [Google Scholar] [CrossRef]
- Murakami, D.; Griffith, D.A. Spatially varying coefficient modeling for large datasets: Eliminating N from spatial regressions. Spat. Stat. 2019, 30, 39–64. [Google Scholar] [CrossRef] [Green Version]
- Mardia, K.V.; Kent, J.T.; Bibby, J.M. Multivariate Analysis; Academic Press: London, UK, 1979. [Google Scholar]
- Carlis, J.; Bruso, K. Rsqrt: An Heuristic for Estimating the Number of Clusters to Report. Electron. Commer. Res. Appl. 2012, 11, 152–158. [Google Scholar] [CrossRef] [Green Version]
- Hassanat, A.B.; Abbadi, M.A.; Altarawneh, G.A.; Alhasanat, A.A. Solving the Problem of the K Parameter in the KNN Classifier Using an Ensemble Learning Approach. Comput. Sci. 2014, 12, 33–39. [Google Scholar]
- Sugar, C.A.; James, G.M. Finding the Number of Clusters in a Dataset. J. Am. Stat. Assoc. 2003, 98, 750–763. [Google Scholar] [CrossRef]
- Tibshirani, R.; Walther, G.; Hastie, T. Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. Ser. B 2001, 63, 411–423. [Google Scholar] [CrossRef]
- Press, W.H.; Teukolsky, S.A.; Vetterling, W.T.; Flannery, B.P. Numerical Recipes: The Art of Scientific Computing, 3rd ed.; Cambridge University Press: Cambridge, UK, 2007. [Google Scholar]
- Harris, P.; Fotheringham, A.S.; Crespo, R.; Charlton, M. The Use of Geographically Weighted Regression for Spatial Prediction: An Evaluation of Models Using Simulated Data Sets. Math. Geosci. 2010, 42, 657–680. [Google Scholar] [CrossRef]
- Chen, F.; Mei, C.-L. Scale-adaptive estimation of mixed geographically weighted regression models. Econ. Model. 2021, 94, 737–747. [Google Scholar] [CrossRef]
Name of Algorithm | Time Complexity |
---|---|
KNN-GWR | |
FastGWR | [19] |
MGWR(PySAL), GWmodel and Spgwr | [19,31] |
Number of Data Points | KNN-GWR | FastGWR | MGWR(PySAL), GWmodel, and Spgwr |
---|---|---|---|
1000 | 33.8 KB | 19.5 KB | 3.8 MB |
10,000 | 33.8 KB | 195.3 KB | 380 MB |
100,000 | 33.8 KB | 1.9 MB | 37.25 GB |
1,000,000 | 33.8 KB | 19.1 MB | 3.8 TB |
10,000,000 | 33.8 KB | 190.7 MB | 364 TB |
KNN-GWR, MGWR(PySAL), GWmodel, Spgwr | |
---|---|
CPU | Intel(R) Core(TM) i7-1065G7 CPU @ 1.30 GHz, 8 cores |
Memory | 16,384 MB RAM |
Number of Data Points | KNN-GWR | MGWR(PySAL) | GWmodel | Spgwr | ||||
---|---|---|---|---|---|---|---|---|
Runtimes (s) | Runtimes (s) | Runtimes (s) | Runtimes (s) | |||||
1000 | 0.34 | 0.995 | 3.71 | 0.995 | 1.33 | 0.994 | 21.65 | 0.994 |
5000 | 3.11 | 0.996 | 43.51 | 0.995 | 41.51 | 0.997 | 1514.71 | 0.997 |
10,000 | 3.59 | 0.996 | 178.52 | 0.994 | 211.41 | 0.998 | 11,574.78 | 0.997 |
15,000 | 6.56 | 0.993 | 342.51 | 0.996 | 598.91 | 0.992 | n/a | n/a |
20,000 | 8.52 | 0.994 | 758.81 | 0.996 | 1234.06 | 0.993 | n/a | n/a |
50,000 | 31.77 | 0.997 | 6327.91 | 0.996 | 8283.33 | 0.995 | n/a | n/a |
100,000 | 83.06 | 0.998 | n/a | n/a | n/a | n/a | n/a | n/a |
10,000,000 | 865.13 | 0.999 | n/a | n/a | n/a | n/a | n/a | n/a |
Number of Data Points | KNN-GWR | MGWR(PySAL) | GWmodel | Spgwr |
---|---|---|---|---|
1000 | 0.73 | 3.43 | 1.45 | 24.70 |
5000 | 4.51 | 44.92 | 45.26 | 1564.89 |
10,000 | 8.73 | 164.20 | 239.99 | 14,812.23 |
15,000 | 18.32 | 359.21 | 646.92 | n/a |
20,000 | 25.21 | 563.13 | 978.95 | n/a |
50,000 | 51.21 | 4116.12 | 7445.82 | n/a |
100,000 | 138.33 | n/a | n/a | n/a |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yang, X.; Yang, Y.; Xu, S.; Han, J.; Chai, Z.; Yang, G. A New Algorithm for Large-Scale Geographically Weighted Regression with K-Nearest Neighbors. ISPRS Int. J. Geo-Inf. 2023, 12, 295. https://doi.org/10.3390/ijgi12070295
Yang X, Yang Y, Xu S, Han J, Chai Z, Yang G. A New Algorithm for Large-Scale Geographically Weighted Regression with K-Nearest Neighbors. ISPRS International Journal of Geo-Information. 2023; 12(7):295. https://doi.org/10.3390/ijgi12070295
Chicago/Turabian StyleYang, Xiaoyue, Yi Yang, Shenghua Xu, Jiakuan Han, Zhengyuan Chai, and Gang Yang. 2023. "A New Algorithm for Large-Scale Geographically Weighted Regression with K-Nearest Neighbors" ISPRS International Journal of Geo-Information 12, no. 7: 295. https://doi.org/10.3390/ijgi12070295
APA StyleYang, X., Yang, Y., Xu, S., Han, J., Chai, Z., & Yang, G. (2023). A New Algorithm for Large-Scale Geographically Weighted Regression with K-Nearest Neighbors. ISPRS International Journal of Geo-Information, 12(7), 295. https://doi.org/10.3390/ijgi12070295