Nothing Special   »   [go: up one dir, main page]

CN115424139A - Residential area extraction method fusing remote sensing data and position big data - Google Patents

Residential area extraction method fusing remote sensing data and position big data Download PDF

Info

Publication number
CN115424139A
CN115424139A CN202210701304.5A CN202210701304A CN115424139A CN 115424139 A CN115424139 A CN 115424139A CN 202210701304 A CN202210701304 A CN 202210701304A CN 115424139 A CN115424139 A CN 115424139A
Authority
CN
China
Prior art keywords
data
remote sensing
residential
grid
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210701304.5A
Other languages
Chinese (zh)
Inventor
夏南
赵鑫
王梓宇
庄苏丹
高醒
陈振杰
李满春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202210701304.5A priority Critical patent/CN115424139A/en
Publication of CN115424139A publication Critical patent/CN115424139A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/13Satellite images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/817Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level by voting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/176Urban or other man-made structures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Strategic Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Development Economics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Educational Administration (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Game Theory and Decision Science (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Astronomy & Astrophysics (AREA)
  • Remote Sensing (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a residential area extraction method for fusing remote sensing data and position big data, which comprises the following steps: s1, acquiring large position data of an open source by adopting a web crawler technology, and extracting time sequence characteristics of the large position data; s2, combining the multi-scale remote sensing image and remote sensing classified product data, and constructing and labeling a training sample and a verification sample set by adopting a layered random sampling method; s3, obtaining multi-dimensional characteristics of remote sensing and position big data based on the unified evaluation unit, and screening low-redundancy characteristics according to the characteristic importance analysis result; and S4, constructing a random forest classifier fusing the remote sensing data and the position big data, and realizing extraction of the large-area residential areas. And comparing the result with the existing result, the single feature extraction result and the like, analyzing the accuracy of the residential area extraction result of remote sensing-position feature combination, and constructing a machine learning model for extracting the residential areas in the large area to realize the high-accuracy extraction of the residential areas in the large area.

Description

Residential area extraction method fusing remote sensing data and position big data
Technical Field
The invention relates to the technical field of position data processing, in particular to a residential area extraction method fusing remote sensing data and position big data.
Background
The residential areas are places where people gather according to activities such as production, life and the like, are also the foundation for various activities of people, and can be generally divided into urban residential areas and rural residential areas. The extraction of the residential areas is the basis for researching and solving the problems related to urbanization, particularly in rapid urbanization and densely populated areas, such as urban heat island effect, urban extreme air temperature, vegetation reduction and ecological deterioration, urban water resource exhaustion, farmland resource sharp reduction and the like. The grid is a main research unit for extracting residents in a large area due to the standard and efficient data organization form of the grid.
The medium-low resolution optical remote sensing image is generally used for extracting large-area large-scale residential areas, such as 500-meter resolution MODIS data and 30-meter resolution Landsat data. The method is characterized in that the residential areas and the non-residential areas are distinguished by extracting the characteristics such as the specific spectrum, texture and form of the construction land covered by the earth surface in the remote sensing image. However, optical remote sensing does not directly reflect regional social attributes and human activities required for residential identification. And remote sensing of the night light obtains a direct or indirect light source of the surface at night, night imaging is carried out, and the intensity of the night light is obtained. The noctilucent data can reflect human activity intensity to a certain extent, namely, the intensity of the human activity at night and the noctilucent value on the remote sensing image are higher in the area with higher human activity intensity. VIIRS/DNB (Visible information Imaging Radiometer Suite-Day/Night Band) data with the spatial resolution of 500 meters is the most widely applied noctilucent data in large-scale large-area research at present. By adopting a method of integrating noctilucent remote sensing and optical remote sensing data, although the human activity intensity can be represented to a certain degree and the complementation among data is realized, the direct representation of the human activity intensity is still lacked.
The position big data reflects the conditions of preference, commute, aggregation and the like of the user and has spatial, temporal and rich social attributes. The data can represent human activity characteristics, distribution rules, motion patterns and the like, and can be used for mining the social and economic attributes, extracting the space of a residential area, analyzing the spatial structure of a town and the like. The position data are generally point location data, the precision is high, the data size is large, original point location data are generally preliminarily integrated by taking a block or a building as a unit, and mapping between the point data and a corresponding summarizing unit is established. Feature extraction is a key for mining potential information in position data, the position data is closely related to human activity rules, and human daily activities basically follow certain biological clock rules, namely, are expressed as obvious time sequence features. The position data is generally fine in time granularity and large in time sequence length, and is generally divided into time units such as hours, days-nights, days, weekdays-weekends and the like according to research purposes, data characteristics, regional differences and the like, and the time sequence change characteristics of the position data are counted and mined according to the time units.
The remote sensing image data has abundant surface feature physical characteristics, the position data has abundant economic society and human activity characteristics, and the two are complemented and integrated, so that the extraction of residents can be effectively carried out. When multi-source data such as remote sensing data, position information and the like are integrated, a machine learning method is generally adopted for information extraction due to the need of processing a large number of characteristics of multiple sources, multiple dimensions and multiple meanings. Commonly used algorithms include bayesian models, classification regression trees CART, support vector machines SVM, artificial neural networks ANN, random forest RF, etc. The random forest is used as an integrated classifier, the classification effect and the processing capacity of the random forest on multi-source features are superior to those of a common individual classifier, and the random forest is also widely applied to feature classification and information extraction research. However, how to construct a random forest classifier to realize the integration of remote sensing and big data and how to scientifically, systematically and pertinently select training samples, classification characteristics, model parameters and the like are problems to be considered for constructing a high-precision classifier.
Therefore, the current extraction of the residential area information of a large area still has certain technical defects:
(1) How to extract the time sequence characteristics of the position big data to represent the distribution information of the residential area.
(2) The remote sensing data and the position big data have certain difference in the aspects of space dimension, data structure, space-time semantics and the like, and how to integrate the two types of data is realized.
(3) How to construct and extract a machine learning model of a large-area residential area and realize high-precision extraction of the large-area residential area.
Disclosure of Invention
Aiming at the problems in the related art, the invention provides a residential area extraction method for fusing remote sensing data and position big data, so as to overcome the technical problems in the prior related art.
Therefore, the invention adopts the following specific technical scheme: the method comprises the following steps:
s1, acquiring large position data of an open source by adopting a web crawler technology, and acquiring time sequence characteristics of the large position data;
s2, combining the multi-scale remote sensing image and remote sensing classified product data, and constructing and labeling a training sample and a verification sample set by adopting a layered random sampling method;
s3, obtaining multi-dimensional characteristics of remote sensing and position big data based on the unified evaluation unit, and screening low-redundancy characteristics through characteristic importance analysis results;
and S4, constructing a random forest classifier fusing the remote sensing data and the position big data, realizing extraction of large-area residential areas, and verifying the effectiveness of the combined features through precision analysis and comparison analysis.
Further, the method for acquiring the open-source position big data by adopting the web crawler technology and acquiring the time sequence characteristics thereof comprises the following steps:
s11, constructing a web crawler by adopting a Python Request library, collecting Tencent positioning density data, analyzing page data and extracting required positioning frequency and grid position information;
s12, constructing a standard time series curve of the hour-level positioning frequency of residential areas and non-residential areas;
s13, constructing a small-scale positioning frequency time series curve for each grid by taking the standard time series curve of the residential area as a reference, and normalizing the small-scale positioning frequency time series curve into 0-1;
s14, calculating the similarity between each grid time sequence curve and the standard time sequence curve by using the Markov distance and the Pearson correlation coefficient;
s15, calculating the Mahalanobis distance and the Pearson correlation coefficient by adopting a Python Scipy library to serve as the correlation between a grid curve and a small-scale standard time sequence curve;
and S16, reflecting the human activity intensity difference between the weekdays and the weekends of the residential areas and the non-residential areas by adopting the ratio of the weekend day-to-day positioning frequency to the weekend day-to-day positioning frequency.
Further, the mahalanobis distance is calculated by the following expression:
Figure BDA0003704324700000031
in the formula, D M Representing the distance calculation result and taking the reciprocal value as the Mahalanobis distance;
X i representing a discretized target time series, having a length n;
Y i representation discretizationA standard time series of length n;
i represents the serial number of each discrete point in the time sequence;
S -1 a covariance matrix representing the target time series and the standard time series.
Further, the calculation expression of the Pearson correlation coefficient is:
Figure BDA0003704324700000032
in the formula, D P Denotes the Pearson correlation coefficient, and the correlation coefficient D P The larger the correlation, the stronger the correlation;
D P =1 represents a linear relationship in the forward direction;
D P = -1 denotes negative linear relationship;
Figure BDA0003704324700000041
an average value representing a target time series;
Figure BDA0003704324700000042
mean values representing standard time series;
X i representing a discretized target time series, having a length n;
Y i representing a discretized standard time series of length n;
i represents the number of each discrete point in the time series.
Further, the method for constructing and labeling the training sample and the verification sample set by combining the multi-scale remote sensing image and the remote sensing classified product data and adopting a layering random sampling method comprises the following steps:
s21, randomly selecting a 0.01-degree grid, and calculating the proportion of residents in the grid based on a plurality of remote sensing classified product data;
s22, calculating normalized building indexes of Sentinel-2 and Landsat-8 pixels in the grid respectively based on Sentinel-2 and Landsat-8 annual images synthesized by the GEE platform;
s23, averaging the normalized building index values of the two images, calculating to obtain the normalized building index value of each pixel in the 0.01-degree sample grid, setting a positive value as a building-land pixel and a negative value as a non-building-land pixel, and counting the proportion of the building-land pixels in the grid;
s24, carrying out weighted summation operation on the obtained pixel proportion value of the land for building to obtain the comprehensive proportion of the land for building;
s25, marking grids with the comprehensive proportion of the building land larger than 30% as residential areas, and marking grids with the comprehensive proportion smaller than 30% as non-residential areas;
and S26, checking the comprehensive proportion of the building land by adopting visual judgment based on the high-resolution remote sensing image of the GEE platform, and accurately marking the sample.
Furthermore, the method for screening the low-redundancy characteristics based on the remote sensing and position big data is characterized by acquiring the multi-dimensional characteristics of the remote sensing and position big data based on the unified evaluation unit, and screening the low-redundancy characteristics according to the characteristic importance analysis result, and comprises the following steps:
s31, extracting 18 features from the remote sensing and position big data on the basis of the unified evaluation unit;
s32, calculating the projection area of the grid under the Lambert cone projection, and dividing the feature value of each grid by the projection area of the grid to obtain the grid feature value under the unit area;
s33, eliminating the characteristics with high similarity through characteristic correlation calculation, and reducing redundancy;
s34, determining input characteristics through the characteristic dissimilarity index, and controlling the growth of a decision tree;
s35, analyzing the feature contribution rate by adopting the relative importance index to realize the evaluation of the extraction result;
s36, calculating the correlation of the characteristics pair by adopting a Spearman correlation coefficient;
s37, increasing the dissimilarity degree of the residential areas and the non-residential areas in the growth process of the decision tree by adopting a Gini index through a random forest classifier;
and S38, determining the optimal segmentation characteristics of the nodes by measuring the impurity degree between the classes.
Further, the 18 features include 8 remote sensing features and 10 location features.
Further, the expression of the Gini index adopted by the random forest classifier is as follows:
Figure BDA0003704324700000051
wherein T represents a training sample;
Figure BDA0003704324700000052
a prediction type representing a decision tree;
prob S representing a probability of classification as a residential site;
prob nS representing the probability of classification as a non-populated area.
Further, the method for constructing the random forest classifier fusing the remote sensing data and the position big data to extract the large-area residential area and verifying the validity of the combined features through precision analysis and comparison analysis comprises the following steps:
s41, constructing a random forest classifier by adopting a soft voting strategy in a Python open source machine learning toolkit Sciket-leann;
s42, setting partial parameters for the random forest classifier to achieve the optimal classification effect;
and S43, performing model training and result prediction based on the training sample by adopting a parameter range and a step interval preset by a grid search method, and taking the minimum OOB error of the data outside the bag as a setting standard of the optimal parameters of the model.
Further, the expression of the soft voting strategy is as follows:
Figure BDA0003704324700000053
in the formula, N represents the number of decision trees;
prob represents the prediction probability of the output result;
prob S representing a probability of classification as a residential site;
prob nS representing a probability of classification as a non-residential;
x represents an input feature vector;
Figure BDA0003704324700000061
representing the prediction type corresponding to the output result;
Figure BDA0003704324700000062
representing the type of prediction made by each decision tree;
n denotes the nth decision tree.
The invention has the beneficial effects that: the method comprises the steps of firstly, acquiring open-source position big data including OSM data, POI data, tencent positioning data and the like by adopting a web crawler technology, and extracting time sequence characteristics of the data; secondly, obtaining multi-dimensional characteristics of remote sensing and position big data by adopting a characteristic screening technology, and constructing a training sample and a verification sample set by adopting a multi-scale sample selection technology; finally, a random forest classifier is constructed, extraction of kilometer grid information of residents in a large area is achieved, and the accuracy of the remote sensing-position feature combined resident extraction result is analyzed through comparison with the existing result and a single feature extraction result and the like; therefore, a machine learning model of the large-area residential areas is constructed, and high-precision extraction of the large-area residential areas is achieved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a residential quarter extraction method that fuses remote sensing data and location big data according to an embodiment of the present invention;
FIG. 2 is a graph of hourly time series of Tencent LRD data for a residential area extraction method that combines remote sensing data with location big data according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a residential area 0.01 degree grid sample labeling process of a residential area extraction method for fusing remote sensing data and position big data according to an embodiment of the present invention;
FIG. 4 is a graph showing the effect of random forest parameter settings on OOB errors in a residential area extraction method that integrates remote sensing data with location big data according to an embodiment of the present invention;
fig. 5 is a diagram showing the extraction result of the residential area grid of the residential area extraction method for fusing the remote sensing data and the position big data according to the embodiment of the invention;
fig. 6 is a comparison graph of the residential area extraction result of the residential area extraction method fusing the remote sensing data and the position big data, the Landsat image, other products, and the reference boundary according to the embodiment of the present invention;
fig. 7 is a comparison graph of the residential area grid extraction results of the remote sensing-position feature, the remote sensing feature and the position feature of the residential area extraction method fusing the remote sensing data and the big position data according to the embodiment of the invention.
Detailed Description
For further explanation of the various embodiments, the present invention provides drawings which are incorporated in and constitute a part of this disclosure, illustrate embodiments primarily, and together with the description serve to explain the principles of operation of the embodiments, and together with the description serve to explain other possible embodiments and advantages of the invention.
According to the embodiment of the invention, a residential area extraction method fusing remote sensing data and position big data is provided.
In the example of the present invention, the study area was selected to be china, with a total area of 963.41 kilo square kilometers and a population of 14.1 hundred million people in 2015. According to the surface coverage information of MODIS/LC in 2017, chinese surface coverage mainly comprises grassland (grassland, sparsely-grown grassland, tropical grassland and the like), wasteland and cultivated land, the area ratio is about 45.95%, 24.28% and 13.22%, the grassland is mainly distributed in western parts, northwest parts and other areas, the cultivated land is largely distributed in north China, northeast China, east China and other areas, and the proportion of residential areas in the surface coverage areas is small.
The technical product and data principle utilized in the present invention includes:
(1) Luminous remote sensing data
NOAA/NCEI (National Centers for Environmental Information) provides a monthly synthetic night light remote sensing data product: VIIRS/DNB data, with a spatial resolution of about 500m, has a product coordinate system of WGS84 geographic coordinate system covering a 75N-65S region of the globe. The data were equally divided into 6 blocks and the study area contained 2 blocks, namely 75 ° N60 ° E and 00 ° N60 ° E, where 75 ° N60 ° E represents the rectangular area covered by 0-75 ° N and 60 ° E-180 ° E and 00 ° N60 ° E represents the rectangular area covered by 65 ° S-0 and 60 ° E-180 ° E. The monthly product is synthesized by adopting the mean value of daily effective grid values, and noises such as stray light, lightning, moonlight and the like can be removed. However, there are some cases where the grid is blocked by cloud layers, and the like, and an effective observation value is not available. Therefore, the MVC method is adopted to synthesize the annual product so as to solve the problem of data loss and eliminate noctilucence errors caused by seasonal fluctuation. The method selects the 12-month noctilucent remote sensing data from 5 months in 2017 to 4 months in 2018 to synthesize the noctilucent remote sensing data.
(2) MODIS data/product
(1) NDVI (normalized vegetation index) vegetation cover product: (MOD 13A1, version 6), 16 days to synthesize the product, the spatial resolution is about 500m, and 2017 23 date data are adopted. The vegetation coverage is generally less in residential areas due to the greater human activity. The NDVI product is synthesized into an annual product by adopting an MVC method, so that the influence caused by the seasonal change of vegetation can be minimized. (2) LST surface temperature product, day time temperature product synthesized for 8 days, spatial resolution of about 1km, data of 46 years 2017. The heat island effect is significant in populated areas due to concentrated human activity, with the average temperature in the area being significantly higher than in unpopulated areas. The LST product is synthesized into an annual product by adopting an averaging method so as to reduce the influence caused by seasonal variation of temperature. (3) LC surface coating product: the MODIS surface covering product of 2017 years has a spatial resolution of about 1km.
The MODIS data adopts a uniform strip number to cover 22 scenes required by China: h23v04-v05, h24v04-v05, h25v03-v06, h26v03-v06, h27v04-v06, h28v05-v08, h29v06-v08. All MODIS products are spliced based on the Mosaic module of the official MRT toolkit, and then the original sinusoidal projection is converted into WGS84 geographic coordinates through the sample module of ArcGIS.
(3) Tencent positioning density data
The Tencent location service provides accurate positioning services for users when using related applications, including WeChat, QQ, tencent maps, drip travel and the like. By the end of 2019, the daily average positioning quantity of the Tencent location service breaks through 500 hundred million times, and the number of covered users is about 6.5 hundred million. Because of privacy and other problems of individual-level data, tencent provides Location Request Density (LRD) data of a grid level, and represents the Location frequency in a kilometer grid (0.01 degree grid) within a certain time. And crawling data from the webpage background every 5 minutes, wherein the original data is the data after rarefaction processing. In order to ensure the quality of the small-level data, simple original data completion is carried out according to the serial number of the data grid and the occurrence frequency of the data, the small-level grid positioning frequency is generated, and the problem of rarefaction of the original data can be solved. The method comprises the steps of analyzing transmission data of a page, directly constructing a web crawler by adopting a Request library of Python and the like, collecting Tencent LRD data of 2018 in 9-12 months, analyzing the page data and extracting required positioning frequency and grid position information.
(4) OSM-road network/POI data
The OpenStreetMap (OSM) is a crowd funding database for providing open source road network data, and also comprises basic geographic information such as building surface data, interest points, traffic auxiliary facilities, ground surface coverage, water bodies, natural features and the like, and obtains an OSM Shapefile vector database which is up to 2018 and 6 months. Since the OSM data is crowd funded data, except for road network and interest point data, the integrity of other data is general, and the data is extracted from residents only by using the two types of data.
The road network data is classified into different levels, and can be roughly classified into motor vehicles and non-motor vehicles according to road attributes. The motor vehicle roads comprise expressways, trunk roads, first-level roads, second-level roads, third-level roads and relevant connecting roads, and the like, and the non-motor vehicle roads comprise residential area roads, living streets, sidewalks, bicycle lanes, footpaths and relevant connecting roads, and the like.
The interest point data are distributed in a plurality of layers of the OSM database, and not only appear in special POI layers, but also appear in other layers such as building surfaces, traffic lines and the like. POI density extraction herein also includes such line data and area data to maximize retention of POI information. Thus, the OSM database point-of-interest data can be broadly divided into various types of data such as public services (police, post, libraries, education, prisons, etc.), medical health, entertainment (theater, movie, stadium, etc.), dining, lodging, shopping, finance, tourism, religion, transportation infrastructure (stations, airports, ferries, etc.), transportation accessories (traffic signs, signal lights, cameras, parking lots, etc.), others (other untagged buildings, public toilets, benches, trash cans, etc.).
Referring now to the drawings and the detailed description, the present invention will be further described, as shown in fig. 1-7, according to an embodiment of the present invention, there is provided a residential area extraction method for fusing remote sensing data and location big data, the method comprising the steps of:
s1, acquiring large position data of an open source by adopting a web crawler technology, and acquiring time sequence characteristics of the large position data, wherein the method comprises the following steps:
s11, constructing a web crawler by adopting a Python Request library, collecting Tencent positioning density data (LRD) data, analyzing page data and extracting required positioning frequency and grid position information;
s12, constructing a standard time series curve of the hour-level positioning frequency of residential areas and non-residential areas;
counting the average positioning frequency of the residential area samples (0.01 degree grid) in the hour scale, and displaying the counting result: starting from the night 21, the positioning frequency gradually decreases; the positioning frequency reaches the lowest valley of one day around 5 hours in the morning; when the frequency reaches 13, the positioning frequency always rises; at about 13-16 hours, the positioning frequency slightly decreases; after 16 hours, the positioning frequency shows an ascending trend until reaching a peak in the day around 21 points (as shown in fig. 2). The positioning frequency of the non-residential area samples does not show obvious time variation rules. Meanwhile, the 95% confidence interval of the two curves is small, which indicates that the reliability of the statistical result is high.
S13, constructing an hour-level time series curve for each grid by taking the standard time series curve of the residential area as a reference, and normalizing the hour-level time series curve into 0-1;
s14, calculating the similarity of each grid time sequence curve and the standard time sequence curve by using the Mahalanobis distance (the problems of non-stationarity in a time sequence, insufficient response of time correlation and the like can be effectively solved) and a Pearson correlation coefficient (the Pearson correlation coefficient is the most commonly used in a correlation coefficient method and is widely applied to time sequence analysis of remote sensing data);
wherein, the calculation expression of the mahalanobis distance is as follows:
Figure BDA0003704324700000091
in the formula D M Representing the distance measurement result and taking the reciprocal value thereof as the Mahalanobis distance, D M Smaller values indicate stronger correlation;
X i representing a discretized target time series, having a length n;
Y i representing a discretized standard time series with a length n;
i represents the serial number of each discrete point in the time sequence;
S -1 a covariance matrix representing the target time series and the standard time series.
The calculation expression of the Pearson correlation coefficient is as follows:
Figure BDA0003704324700000101
in the formula, D P Representing Pearson correlation coefficient, and correlation coefficient D P The larger the correlation, the stronger the correlation;
D P =1 represents a linear relationship in the forward direction;
D P =1 represents a negative linear relationship;
Figure BDA0003704324700000102
an average value representing a target time series;
Figure BDA0003704324700000103
mean values representing standard time series;
X i representing a discretized target time series, having a length n;
Y i representing a discretized standard time series with a length n;
i denotes the number of each discrete point in the time series.
In addition, the lengths of the time sequences constructed in the invention are all equal, so that the offset of the time sequences does not need to be considered.
S15, calculating the correlation coefficient between the Mahalanobis distance and the Pearson by adopting a Python Scipy library to serve as the correlation between a grid curve and a small-scale standard time sequence curve;
normalizing the reciprocal value of the calculated mahalanobis distance to 0-1, and when the mahalanobis distance is 0, setting the reciprocal value to 1; and calculating the average value of the Pearson correlation coefficient and the Mahalanobis distance correlation coefficient, increasing the reliability of the correlation coefficient, taking the average value as the correlation between the grid curve and the small-scale standard curve, and representing the probability that the grid is the resident grid.
And S16, reflecting the human activity intensity difference between the weekdays and the weekends of the residential areas and the non-residential areas by adopting the ratio of the weekend day-to-day positioning frequency to the weekend day-to-day positioning frequency (namely the weekend day-to-frequency ratio).
The training samples are important for a machine learning classifier, and the selection of the training samples has the characteristics of statistical independence, representativeness, balanced category number, sufficient quantity and the like. In the sample labeling process, the grid extracted by the characteristics can be labeled by using a remote sensing image with higher resolution, other high-precision remote sensing products and the like. Meanwhile, by adding manual assistance and manual verification, the precision of sample marking can be further improved. The present invention sets the sample categories to be two categories, residential and non-residential.
S2, combining the multi-scale remote sensing image and the remote sensing classified product data, and adopting a layered random sampling method to construct and label a training sample and a verification sample set, wherein the method comprises the following steps:
s21, randomly selecting a 0.01-degree (resolution about 1 km) grid, classifying product data or ground surface coverage products and the like based on a plurality of remote sensing, wherein the classified product data or ground surface coverage products comprise FROM-GLC30 (2015, resolution of 30 meters), GHSL (2010, resolution of 30 meters) and HMMGUL (2015, resolution of 30 meters), and calculating the proportion of residents in the grid (as shown in figure 3);
s22, in order to ensure timeliness, based on the annual images of Sentinel-2 (resolution about 20 m) and Landsat-8 (resolution about 30 m) synthesized by the GEE platform, respectively calculating normalized building indexes (NDBI) of Sentinel-2 and Landsat-8 pixels in the grid;
NDBI is the sum of the difference ratios of the short-wave infrared and near-infrared bands, landsat-8 has short-wave infrared and near-infrared bands of B6 and B5, and Sentinal-2 has B11 and B8 bands, respectively (as shown in FIG. 4).
S23, averaging the normalized building index values of the two images, calculating to obtain the normalized building index value of each pixel in the 0.01-degree sample grid, setting a positive value as a building land pixel, setting a negative value as a non-building land pixel, and counting the proportion of the pixels of the grid building land;
s24, carrying out weighted summation operation on the obtained pixel proportion value of the land for building to obtain the comprehensive proportion of the land for building;
in the invention, four sets of construction land pixel proportion values of FROM-GLC30, GHSL, HMMGUL, NDBI and the like in the operation process are weighted and summed, the weights are respectively 0.2, 0.2 and 0.4, namely the NDBI is set to be the maximum weight, and the comprehensive proportion of the construction land is obtained.
S25, marking grids with the comprehensive proportion of the construction land larger than 30% as residential areas (according to the IGBP definition, grids with the proportion of the construction land larger than 30% can be used as construction land), and marking grids with the proportion smaller than 30% as non-residential areas;
s26, visually judging and checking the comprehensive proportion of the construction land based on the high-resolution remote sensing image of the GEE platform, namely visually judging whether the proportion of the construction land in the grid is more than 30%; meanwhile, whether the sample is in the range of a residential area or not is roughly judged according to the position of the sample, and the sample is accurately marked.
S3, obtaining multi-dimensional characteristics of remote sensing and position big data based on the unified evaluation unit, and realizing screening of low-redundancy characteristics through characteristic importance analysis results, wherein the method comprises the following steps:
s31, extracting 18 features from the remote sensing and position big data on the basis of the unified evaluation unit;
wherein the 18 features include 8 remote sensing features and 10 location features, as shown in table 1.
TABLE 1 characteristics, data sources and characterization used for residential extraction
Figure BDA0003704324700000121
Each remote sensing data comprises two remote sensing characteristics of an area weighted mean value and a maximum value; the internet location feature comprises an LRD feature: the method comprises the following steps of (1) hour-level time sequence characteristics, daily average positioning frequency characteristics, working day positioning frequency ratio, and road network density characteristics: the density of motor vehicle lanes, the density of non-motor vehicle lanes and the overall density of a road network, the density characteristic of OSM interest points and the characteristic of traffic conditions are as follows: and the passing time, the passing distance and the average passing speed with the provincial administration center.
S32, calculating the grid under Lambert cone projectionProjected area (unit is km) 2 ) Dividing the feature value of each grid by the projection area of the grid to obtain the feature value of the grid in unit area;
s33, eliminating the features with high similarity through feature correlation calculation, and reducing redundancy;
s34, determining input characteristics through the characteristic dissimilarity index, and controlling the growth of a decision tree;
s35, analyzing the feature contribution rate by adopting the relative importance index to realize the evaluation of the extraction result;
s36, calculating the correlation of the characteristics by adopting a Spearman correlation coefficient pair by pair, wherein the correlation coefficient is larger, the correlation of the characteristics is stronger, and the information redundancy is higher. (ii) a
For the feature pairs with the correlation degree larger than 0.9, only one feature with smaller average correlation with other features is reserved to reduce the feature dimension of the input. Wherein, the relativity of F5 and F6, and F7 and F8 is high and reaches about 0.95 (Table 2); the spatial resolution of the LST and population grid density data is about 1km, and the area weighted mean and maximum are closer when unified to a 0.01 degree grid. F16 and F17 are also high in correlation, the passing time and the passing distance are generally in direct proportion, the ratio is the average speed, and the spatial difference of the average speed is not obvious. Therefore, F6, F8 and F17, which have high correlation with other features, namely three indexes of LST maximum, population density maximum and traffic distance, are removed, and only 15 features are reserved as model inputs.
S37, increasing the dissimilarity degree of the residential areas and the non-residential areas in the growth process of the decision tree by adopting a Gini index through a random forest classifier;
wherein the random forest classifier adopts the expression of Gini index as follows:
Figure BDA0003704324700000131
wherein T represents a training sample;
Figure BDA0003704324700000132
representing a decision tree prediction type;
prob S representing a probability of classification as a residential area;
prob nS representing the probability of classification as a non-populated area.
For a given training sample T, by adopting the strategy of Gini index minimization (highest purity), the optimal segmentation characteristics of the node can be determined, and the decision tree can also grow to the maximum depth without pruning.
And S38, determining the optimal segmentation characteristics of the nodes by measuring the impurity degree between the classes.
Table 2: residential extracted raw feature correlation analysis
Figure BDA0003704324700000133
Figure BDA0003704324700000141
And S4, constructing a random forest classifier fusing the remote sensing data and the position big data, realizing extraction of large-area residential areas, and verifying the effectiveness of the joint features through precision analysis and comparison analysis.
Random Forest (RF) is an integrated learning classifier, and a large number of Decision Tree (Decision Tree) classifiers are integrated, so that the classification effect is good. The random forest classifier has the characteristics of multi-dimensional feature processing capability, small calculated amount, strong noise and abnormal value processing capability, weak overfitting and the like, and each decision tree grows independently and completely.
In order to increase the diversity of the decision tree, a random forest randomly generates a training subset from an original data set by adopting a Bootstrap aggregation technology in a place-back manner, so that the overall stability and robustness of the classifier can be increased, and the influence of data abnormity, noise points, overfitting and the like on the model is reduced, thereby improving the classification precision of the model.
The method for constructing the random forest classifier fusing the remote sensing data and the position big data comprises the following steps of:
s41, constructing a random forest classifier by adopting a soft voting strategy in a Python open source machine learning toolkit Scikit-learn;
wherein the expression of the soft voting strategy is as follows:
Figure BDA0003704324700000142
in the formula, N represents the number of decision trees;
prob represents the prediction probability of the output result;
prob S representing a probability of classification as a residential area;
prob nS representing a probability of classification as a non-residential;
x represents an input feature vector;
Figure BDA0003704324700000143
representing the prediction type corresponding to the output result;
Figure BDA0003704324700000144
representing the type of prediction made by each decision tree;
n denotes the nth decision tree.
Taking three decision trees as an example, assuming that the probabilities of predicting a sample to be a residential area and a non-residential area are (0.8,0.2), (0.9,0.1) and (0.7,0.3) respectively, the probability of the sample to be a residential area is 1/3 (0.8 +0.9+ 0.7) =0.8, the output result of the sample point is [ residential area, 0.8], that is, the output result is the maximum prediction probability and the corresponding type thereof, and the value range of the probability is 0.5-1.0.
S42, setting partial parameters for the random forest classifier to achieve the optimal classification effect;
the two most important parameters are the decision tree size k and the maximum feature number m, respectively. The decision tree size k represents the number of decision trees for classification, namely the size of the classifier, and the larger the k value is, the higher the classification precision is, but the greater the computation complexity and redundancy are also brought. The maximum feature number m controls the maximum segmentation feature number used by a single decision tree during classification, and increasing the value of m generally can improve the model accuracy, but an excessively large value of m can also increase the relevance of the decision tree and reduce the diversity of the decision tree.
S43, a parameter range and a step interval which are preset by a grid search method are adopted, model training and result prediction are carried out based on training samples, and the minimum error of the data outside the bag (OOB) is used as a setting standard of the optimal parameters of the model.
And performing model training and result prediction based on the training samples by adopting a grid searching method and a preset parameter range and step interval, and taking the minimum OOB error as a setting standard of the optimal parameters of the model. Comprehensively, the invention sets the k value range to be 10-500 and the step interval to be 10; the value of the parameter m ranges from 1 to 15, and the step interval is 1. Research and statistical results show that for the extraction of Chinese residents, as the k value is increased, the OOB error is obviously reduced, and converges around k =150 (as shown in FIG. 4-A1); whereas when m =7, there is a relative minimum value of OOB error for the selected value of k (as shown in fig. 4-A2). Thus, the parameter k of the RF classifier is set to 150 and the parameter m is set to 7.
The following examples are the analysis and validation of the results of the present invention, including:
1. analysis of residential district extraction results
The result of the random forest classifier is probability, i.e. the probability that the grid is recognized as a certain class, therefore 0.5 is selected as a threshold value, and the grid with the resident extraction probability greater than 0.5 is taken as a resident grid. The extraction results showed that a total of 192,250 grids (0.01 deg.) in China were identified as residential areas, with a total area of about 196,742km under Lambert cone projection 2 . The centralized connected residential areas are mainly concentrated in the coastal areas of the east and southeast, and A, B and C in FIG. 5 show the extraction results of the residential areas of the three urban groups in China: long triangular, zhu triangular and Jingjin Ji area. It can be seen that the city contour is well preserved and is more close toNear the city center, the higher the probability that the grid is identified as a residential; the lower the probability that grids far from the city center are identified as residential areas. Meanwhile, scattered residential areas (rural areas) in the identification result are less distributed, which is not different from the increasing urbanization rate of China year by year. The positions of the large towns are also consistent with the space distribution of the main towns in China, and the reliability of the random forest classification result is shown. Statistical analysis on the extraction probability of the grid shows that the extraction probability of residents in China is distributed uniformly in the range of 0.5-1.0 and shows a slightly descending trend, and a high-value area appears only after 0.95.
And evaluating the precision of the extraction result of the Chinese residential area, respectively carrying out grid-level verification on the training sample and the verification sample, and constructing a confusion matrix to analyze the extraction precision. The confusion matrix includes an overall accuracy OA, a producer accuracy PA, a consumer accuracy UA and a kappa coefficient. In the invention, the overall accuracy is the ratio of the number of grids correctly classified into residential areas and non-residential areas to the total number of sample grids; the user precision is the ratio of the classification of residential areas or non-residential areas to the number of all types of real total samples; the precision of a producer is the ratio of the number of correctly classified samples to the total identification quantity of the category; the kappa coefficient is a parameter for measuring the consistency of the classification, and the higher the value is, the higher the consistency of the classification result is. Overall, the overall extraction accuracy of the chinese residential geogrid reached 90.79%, with the OA of the training sample being 91.80% and the OA of the validation sample being 89.79% (table 3). The kappa coefficient of the training sample is 0.812, which indicates that the classification result has higher consistency and is 0.768 higher than that of the verification sample. For the training sample, the UA extracted by the residential site was 90.60%, slightly lower than the non-residential site, while the UA difference between the residential and non-residential sites of the validation sample reached 5.48%; the resident region producer precision PA of the training sample was 84.13%, which was 11.50% smaller than the non-resident region PA, and the difference between the resident region PA and the non-resident region PA of the verification sample also reached 10.53%.
TABLE 3 confusion matrix of classification results based on training samples and validation samples
Figure BDA0003704324700000161
2. In contrast to other surface covering products
In order to more accurately evaluate and extract results, the invention obtains cadastral survey data from the local national resource bureau (the present natural resource department) and the related data production department, selects the Hefei city, the Shanghai city, the Changsha city and the Nanjing city, and uses the data as reference data to evaluate the extraction results of the residential areas. In order to ensure the reliability of real data, on the basis of extracting the residential area boundary in the data, a visual interpretation result of the Landsat-8 high-quality remote sensing image in 2018 is supplemented, so that a more accurate reference boundary of the residential area in 2018 is obtained. The Landsat-8 image is synthesized by 764 false colors, namely short wave infrared SWIR 2-short wave infrared SWIR 1-Red light Red, and the wave band combination is the optimal wave band combination for identifying cities and construction areas.
Comparing the spatial distribution pattern of the residential areas, the RF extraction result and MODIS/LC products are more consistent with the reference boundary, while GHSL30 and HMMGUL products are less consistent with the reference boundary (as shown in fig. 6). Compared with other LC products, the RF classifier has better extraction effect on southeast areas of the Shanghai, northeast-southwest areas of Nanjing, peripheries of fertilizer-combining cities and east areas of Changsha. But due to the higher resolution of the original image, GHSL30 and HMMGUL products have higher spatial resolution. Compared with the extraction area and the reference boundary area of the residential areas of four cities in China, the difference between the extraction results of Nanjing and the compound fertilizer and the reference data is small and is about 14%; the difference between the extraction results of Shanghai and Changsha and the reference data is large and reaches about 17.7 percent. The random forest classification model proposed herein has high consistency with reference data, both in the neighborhood of cities and in regions far from cities, regardless of the extracted spatial pattern of the population or the grid area. Due to the poor resolution of the adopted position data, the extraction result also has the phenomenon of data boundary overflow. For example, since the Huangpu river region in Shanghai is a dense residential area along the line, the 1KM grid is used for classification, and most of the river surface region is also covered by the classified residential grid (as shown in FIG. 6). The main reason is that the spatial granularity and the spatial resolution of partial original data extracted by residents are coarse.
3. Advantage of the remote sensing-location Joint feature
In order to verify the advantages of the remote sensing-internet position combined characteristics in extraction of the residential area grid, random forest classification is carried out based on the independent remote sensing characteristics and the independent position characteristics, and the training and precision evaluation are carried out by adopting the same training sample and the same verification sample. Research results show that the extraction range of the residential area is obviously larger than that of the extraction result of the remote sensing-position combined feature based on the extraction result of the single-type feature residential area grid of the remote sensing feature or the position feature. Meanwhile, compared with the residential area grid extraction result of the remote sensing-position combined characteristic, the residential area extraction result of the single type remote sensing characteristic and the position characteristic respectively has 16.63% more grids and 23.49% more grids.
Areas such as Beijing-Tianjin, shanghai-Suxichang and the like are selected for detail comparison of extraction results of residents (figure 7). The extraction result of the single-class position features has more noise points in the non-residential area, namely more scattered residential grid recognition results. Different from remote sensing data, tencent LRD data, POI data, OSM road network data and the like are open source or crowd funding data, and the data have certain subjectivity. Certain noise points exist in the original open source position data, so that the noise points also exist in the classification result, and the noise point area can be obviously reduced by integrating with the remote sensing feature. The extraction result of the single type remote sensing features has the highest feature importance of noctilucent data, NDVI and other data, so that the classification result slightly exceeds the basic town range due to the problems of supersaturation, overflow and the like of noctilucent data. Meanwhile, in the remote sensing characteristics, the characteristics of the regional traffic facility land is very similar to the characteristics of the residential land, so that the extraction result has an obvious road network result. And the position characteristics such as POI density and positioning density can effectively distinguish road networks from residential area land, for example, the daily positioning frequency of a regional traffic network is low, and the POI density is not as high as that of the residential area. By integrating with the position characteristics, the areas around towns and traffic areas can be effectively eliminated. The extraction precision of the single-type features is also obviously lower than that of the combined features, and compared with the extraction precision of more than 88% of the combined features, the total extraction precision of the single-type remote sensing features is only 84.54%, and the total extraction precision of the single-type position features is 82.31%.
In summary, with the aid of the above technical solution of the present invention, the present invention firstly adopts a web crawler technology to obtain big location data of an open source, including OSM data, POI data, tencent positioning data, etc., and obtain time series characteristics thereof; secondly, obtaining multi-dimensional characteristics of remote sensing and position big data by adopting a characteristic screening technology, and constructing a training sample and a verification sample set by adopting a multi-scale sample selection technology; and finally, constructing a random forest classifier fusing the remote sensing data and the position big data, realizing extraction of kilometer grid information of the residential areas in the large area, analyzing the accuracy of the residential area extraction result of remote sensing-position feature combination by comparing the kilometer grid information with the existing result and the single feature extraction result and the like, and realizing high-accuracy extraction of the residential areas in the large area.
All possible combinations of the technical features of the above embodiments may be arbitrarily combined, and for the sake of brevity, all the possible combinations of the technical features in the above embodiments are not described, however, the combinations of the technical features should be considered as being within the scope of the present description as long as there is no contradiction between the technical features.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that various changes and modifications can be made by those skilled in the art without departing from the spirit of the invention, and these changes and modifications are all within the scope of the invention. Therefore, the protection scope of the present patent should be subject to the appended claims.

Claims (10)

1. A residential area extraction method fusing remote sensing data and position big data is characterized by comprising the following steps:
s1, acquiring open-source position big data by adopting a web crawler technology, and extracting time sequence characteristics of the open-source position big data;
s2, combining the multi-scale remote sensing image and remote sensing classified product data, and constructing and labeling a training sample and a verification sample set by adopting a layered random sampling method;
s3, extracting multi-dimensional characteristics of remote sensing and position big data based on a unified evaluation unit, and screening low-redundancy characteristics through characteristic importance analysis results;
and S4, constructing a random forest classifier fusing the remote sensing data and the position big data, realizing extraction of large-area residential areas, and verifying the effectiveness of the joint features through precision analysis and comparison analysis.
2. The method for extracting the residential area by fusing the remote sensing data and the big position data according to claim 1, wherein the step of acquiring the big open-source position data and the time series characteristics thereof by adopting a web crawler technology comprises the following steps:
s11, constructing a web crawler by adopting a Python Request library, collecting Tencent positioning density data, analyzing page data and extracting required positioning frequency and grid position information;
s12, constructing a standard time series curve of the hour-level positioning frequency of residential areas and non-residential areas;
s13, establishing an hour-level positioning frequency time series curve for each grid by taking the standard time series curve of the residential area as a reference, and normalizing the hour-level positioning frequency time series curve to be 0-1;
s14, calculating the similarity between each grid time sequence curve and the standard time sequence curve by using the Markov distance and the Pearson correlation coefficient;
s15, calculating the correlation coefficient between the Mahalanobis distance and the Pearson by adopting a Python Scipy library to serve as the correlation between a grid curve and a small-scale standard time sequence curve;
and S16, reflecting the human activity intensity difference between weekdays and weekends of residential areas and non-residential areas by adopting the ratio of the weekend positioning frequency to the weekend positioning frequency.
3. The residential area extraction method integrating the remote sensing data and the position big data as claimed in claim 2, wherein the expression for the mahalanobis distance is as follows:
Figure FDA0003704324690000011
in the formula, D M Representing the distance calculation result and taking the reciprocal value as the Mahalanobis distance;
X i representing a discretized target time series, having a length n;
Y i representing a discretized standard time series with a length n;
i represents the sequence number of each discrete point in the time sequence;
S -1 a covariance matrix representing the target time series and the standard time series.
4. The residential area extraction method integrating the remote sensing data and the big location data as claimed in claim 2, wherein the calculation expression of the Pearson correlation coefficient is as follows:
Figure FDA0003704324690000021
in the formula D P Representing Pearson correlation coefficient, and correlation coefficient D P The larger the absolute value of (a), the stronger the correlation;
D P =1 represents a linear relationship in the forward direction;
D P = -1 denotes negative linear relationship;
Figure FDA0003704324690000022
an average value representing a target time series;
Figure FDA0003704324690000023
representing standard time seriesAverage value;
X i representing a discretized target time series, having a length n;
Y i representing a discretized standard time series of length n;
i denotes the number of each discrete point in the time series.
5. The method for extracting the residential areas fusing the remote sensing data and the position big data according to claim 1, wherein the method of combining the multi-scale remote sensing image and the remote sensing classified product data and adopting the layered random sampling method is adopted to construct and label a training sample and a verification sample set, and comprises the following steps:
s21, randomly selecting a 0.01-degree grid, and calculating the proportion of residential areas in the grid based on a plurality of remote sensing classified product data;
s22, calculating normalized building indexes of Sentinel-2 and Landsat-8 pixels in the grid respectively based on the Sentinel-2 and Landsat-8 year images synthesized by the GEE platform;
s23, averaging the normalized building index values of the two images, calculating to obtain the normalized building index value of each pixel in the 0.01-degree sample grid, setting a positive value as a building-land pixel and a negative value as a non-building-land pixel, and counting the proportion of the building-land pixels in the grid;
s24, carrying out weighted summation operation on the obtained pixel proportion value of the land for building to obtain the comprehensive proportion of the land for building;
s25, marking grids with the comprehensive proportion of the building land larger than 30% as residential areas, and marking grids with the comprehensive proportion smaller than 30% as non-residential areas;
and S26, checking the comprehensive proportion of the building land by adopting visual judgment based on the high-resolution remote sensing image of the GEE platform, and accurately marking the sample.
6. The method for extracting the residential areas fusing the remote sensing data and the big position data according to claim 1, wherein the method for extracting the multi-dimensional features of the remote sensing and the big position data based on the unified evaluation unit and screening the low redundancy features through the feature importance analysis result comprises the following steps:
s31, extracting 18 features from the remote sensing and position big data on the basis of a unified evaluation unit;
s32, calculating the projection area of the grid under the Lambert cone projection, and dividing the feature value of each grid by the projection area of the grid to obtain the grid feature value under the unit area;
s33, eliminating the features with high similarity through feature correlation calculation, and reducing redundancy;
s34, determining input characteristics through the characteristic dissimilarity index, and controlling the growth of a decision tree;
s35, analyzing the feature contribution rate by adopting the relative importance index to realize the evaluation of the extraction result;
s36, calculating the correlation of the characteristics pair by adopting a Spearman correlation coefficient;
s37, increasing the dissimilarity degree of the residential areas and the non-residential areas in the growth process of the decision tree by adopting a Gini index through a random forest classifier;
and S38, determining the optimal segmentation characteristics of the nodes by measuring the impurity degree between the classes.
7. The method as claimed in claim 6, wherein the 18 features include 8 remote sensing features and 10 location features.
8. The method for extracting the residential areas by fusing the remote sensing data and the position big data according to claim 6, wherein the random forest classifier adopts a Gini index expression as follows:
Figure FDA0003704324690000031
wherein T represents a training sample;
Figure FDA0003704324690000032
a prediction type representing a decision tree;
prob S representing a probability of classification as a residential site;
prob nS representing the probability of classification as a non-populated area.
9. The method for extracting the residential area by fusing the remote sensing data and the big position data as claimed in claim 1, wherein the constructing of the random forest classifier fusing the remote sensing data and the big position data comprises the following steps:
s41, constructing a random forest classifier by adopting a soft voting strategy in a Python open source machine learning toolkit Scikit-learn;
s42, setting partial parameters for the random forest classifier to achieve the optimal classification effect;
s43, a parameter range and a step interval which are preset by a grid search method are adopted, model training and result prediction are carried out based on training samples, and the minimum OOB error of the data outside the bag is used as a setting standard of the optimal parameters of the model.
10. The residential area extraction method integrating the remote sensing data and the position big data as claimed in claim 9, wherein the expression of the soft voting strategy is as follows:
Figure FDA0003704324690000041
in the formula, N represents the number of decision trees;
prob represents the prediction probability of the output result;
prob S representing a probability of classification as a residential area;
prob nS representing a probability of classification as a non-residential;
x represents an input feature vector;
Figure FDA0003704324690000042
representing the prediction type corresponding to the output result;
Figure FDA0003704324690000043
representing the type of prediction made by each decision tree;
n denotes the nth decision tree.
CN202210701304.5A 2022-06-21 2022-06-21 Residential area extraction method fusing remote sensing data and position big data Pending CN115424139A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210701304.5A CN115424139A (en) 2022-06-21 2022-06-21 Residential area extraction method fusing remote sensing data and position big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210701304.5A CN115424139A (en) 2022-06-21 2022-06-21 Residential area extraction method fusing remote sensing data and position big data

Publications (1)

Publication Number Publication Date
CN115424139A true CN115424139A (en) 2022-12-02

Family

ID=84196863

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210701304.5A Pending CN115424139A (en) 2022-06-21 2022-06-21 Residential area extraction method fusing remote sensing data and position big data

Country Status (1)

Country Link
CN (1) CN115424139A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118172674A (en) * 2024-04-10 2024-06-11 中国水利水电科学研究院 Remote sensing estimation method and system for reset value of residential building

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107121681A (en) * 2017-05-23 2017-09-01 国家地理空间信息中心 Residential area extraction system based on high score satellite remote sensing date
CN109919875A (en) * 2019-03-08 2019-06-21 中国科学院遥感与数字地球研究所 A kind of Residential area extraction and classification method of high time-frequency Characteristics of The Remote Sensing Images auxiliary
US20210192209A1 (en) * 2020-09-29 2021-06-24 Beijing Baidu Netcom Science Technology Co., Ltd. Resident area prediction method, apparatus, device, and storage medium
CN114596489A (en) * 2022-03-03 2022-06-07 湘潭大学 High-precision multisource remote sensing city built-up area extraction method for human habitation index

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107121681A (en) * 2017-05-23 2017-09-01 国家地理空间信息中心 Residential area extraction system based on high score satellite remote sensing date
CN109919875A (en) * 2019-03-08 2019-06-21 中国科学院遥感与数字地球研究所 A kind of Residential area extraction and classification method of high time-frequency Characteristics of The Remote Sensing Images auxiliary
US20210192209A1 (en) * 2020-09-29 2021-06-24 Beijing Baidu Netcom Science Technology Co., Ltd. Resident area prediction method, apparatus, device, and storage medium
CN114596489A (en) * 2022-03-03 2022-06-07 湘潭大学 High-precision multisource remote sensing city built-up area extraction method for human habitation index

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
NAN XIA ET AL: "Mapping Urban Areas Using a Combination ofRemote Sensing and Geolocation Data", REMOTE SENSING, 21 June 2019 (2019-06-21), pages 1 - 7 *
侯博文,等: "改进卷积网络的高分遥感图像城镇建成区提取", 中国图象图形学报, vol. 25, no. 12, 31 December 2020 (2020-12-31) *
张晔: "基于多源数据的城市功能区提取与分析", 基础科学, no. 3, 15 March 2022 (2022-03-15) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118172674A (en) * 2024-04-10 2024-06-11 中国水利水电科学研究院 Remote sensing estimation method and system for reset value of residential building

Similar Documents

Publication Publication Date Title
Hammerberg et al. Implications of employing detailed urban canopy parameters for mesoscale climate modelling: a comparison between WUDAPT and GIS databases over Vienna, Austria
Daams et al. The effect of natural space on nearby property prices: accounting for perceived attractiveness
Tariq et al. Spatio-temporal assessment of land use land cover based on trajectories and cellular automata Markov modelling and its impact on land surface temperature of Lahore district Pakistan
Wu et al. Using geometrical, textural, and contextual information of land parcels for classification of detailed urban land use
CN109165424A (en) A kind of landslide assessment of easy generation method based on domestic GF-1 satellite data
Pijanowski et al. Modelling urbanization patterns in two diverse regions of the world
Atay Kaya et al. Land use and land cover change monitoring in Bandırma (Turkey) using remote sensing and geographic information systems
Guyot et al. The urban form of Brussels from the street perspective: The role of vegetation in the definition of the urban fabric
WO2023050955A1 (en) Urban functional zone identification method based on function mixing degree and ensemble learning
Lin et al. Fine identification of the supply–demand mismatches and matches of urban green space ecosystem services with a spatial filtering tool
CN116543312B (en) Construction and updating method of remote sensing inversion sample library of water quality parameters of large-scale river and lake
CN110781267A (en) Multi-scale space analysis and evaluation method and system based on geographical national conditions
CN115129802A (en) Population spatialization method based on multi-source data and ensemble learning
Jagarnath et al. Modelling urban land change processes and patterns for climate change planning in the Durban metropolitan area, South Africa
Silva et al. Analysis of the response of the Epitácio Pessoa reservoir (Brazilian semiarid region) to potential future drought, water transfer and LULC scenarios
CN113408867B (en) Urban burglary crime risk assessment method based on mobile phone user and POI data
CN115424139A (en) Residential area extraction method fusing remote sensing data and position big data
Chen et al. Mapping urban functional areas using multi-source remote sensing images and open big data
Wu et al. Per‐field urban land use classification based on tax parcel boundaries
Elangovan et al. Mapping and Prediction of Urban Growth using Remote Sensing, Geographic Information System, and Statistical Techniques for Tiruppur Region, Tamil Nadu, India
He et al. Using mobile phone big data and street view images to explore the mismatch between walkability and walking behavior
CN116415499B (en) Community comfort simulation prediction method
CN114139984B (en) Urban traffic accident risk prediction method based on flow and accident cooperative sensing
Sun et al. Evaluating the Street Greening with the Multiview Data Fusion
Nice et al. The nature of human settlement: building an understanding of high performance city design

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination