CN114911787B

CN114911787B - Multi-source POI data cleaning method integrating position and semantic constraint

Info

Publication number: CN114911787B
Application number: CN202210613379.8A
Authority: CN
Inventors: 陈振杰; 许长青; 徐润鹏; 周琛; 曾智伟; 夏南; 马磊; 陈东
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2023-10-27
Anticipated expiration: 2042-05-31
Also published as: CN114911787A

Abstract

The invention relates to a multi-source POI data cleaning method that integrates location and semantic constraints, and belongs to the field of data processing technology. This method performs the following steps: Step 1, perform GeoHash conversion on the collected multi-source POI data; Step 2, perform adjacent point query on the converted string; Step 3, perform redundancy on the windows with adjacent points in Step 2 Processing; Step 4, construct a word segmentation scheme; Step 5, perform redundant processing on the data processed in Step 4; Step 6, complete POI data rematching based on the word frequency statistics of the word segmentation scheme reconstructed in Step 5. This method can complete the data cleaning work more accurately and efficiently, and the cleaning results are better, more realistic and effective.

Description

A multi-source POI data cleaning method integrating location and semantic constraints

技术领域Technical field

本发明涉及一种融合位置和语义约束的多源POI数据清洗方法，属于数据处理技术领域。The invention relates to a multi-source POI data cleaning method that integrates location and semantic constraints, and belongs to the field of data processing technology.

背景技术Background technique

随着以博客、社交网络、基于位置的服务LBS为代表的新型信息发布方式的不断涌现，以及云计算、物联网等技术的兴起，数据正以前所未有的速度不断地增长和累积，各个领域正不断尝试挖掘大数据之下的隐藏信息。但随着数据量大幅度提升，数据质量也正在不断下降。大数据环境下，来自异构系统的各类数据存在若干问题：①杂乱性，各应用系统的数据缺乏统一标准的定义，具有较大的不一致性。②重复性，对于同一个客观事物在数据库中存在其两个或两个以上完全相同的物理描述。③模糊性，由于系统设计时存在的缺陷以及一些使用过程中的人为因素，导致数据记录中出现属性值丢失不确定的现象。With the continuous emergence of new information publishing methods represented by blogs, social networks, and location-based services (LBS), as well as the rise of technologies such as cloud computing and the Internet of Things, data is growing and accumulating at an unprecedented rate. Constantly try to dig out the hidden information under big data. However, as the amount of data increases significantly, the quality of data is also declining. In the big data environment, various types of data from heterogeneous systems have several problems: ① Complexity, the data of each application system lacks a unified standard definition, and has great inconsistency. ② Repeatability, there are two or more identical physical descriptions of the same objective thing in the database. ③Fuzziness, due to defects in system design and some human factors during use, leads to the phenomenon of uncertain loss of attribute values in data records.

正是由于上述现状，数据清洗在数据分析与管理的过程中扮演着越来越重要的角色。数据清洗旨在识别和纠正数据中的噪声，将噪声对数据分析结果的影响降至最低。POI作为大数据的组成部分，是位置服务的重要载体，直接掌控着位置服务研究的质量。为获得更多更全面的POI数据，研究学者/技术人员试图从多个数据源中获取数据，但与之带来的是冗余数据的增加，不完整数据的出现等问题。It is precisely because of the above situation that data cleaning plays an increasingly important role in the process of data analysis and management. Data cleaning aims to identify and correct noise in data to minimize the impact of noise on data analysis results. As a component of big data, POI is an important carrier of location services and directly controls the quality of location services research. In order to obtain more and more comprehensive POI data, researchers/technicians try to obtain data from multiple data sources, but this brings about problems such as the increase of redundant data and the emergence of incomplete data.

发明内容Contents of the invention

本发明要解决的技术问题是：如何提供一种多源POI数据清洗方法。The technical problem to be solved by this invention is: how to provide a multi-source POI data cleaning method.

为了解决上述技术问题，本发明提出的技术方案是：一种融合位置和语义约束的多源POI数据清洗方法，执行如下步骤，In order to solve the above technical problems, the technical solution proposed by the present invention is: a multi-source POI data cleaning method that integrates location and semantic constraints, and performs the following steps:

步骤1，对收集到的多源POI数据进行GeoHash转换，将二维坐标数据转换为字符串；Step 1: Perform GeoHash conversion on the collected multi-source POI data and convert the two-dimensional coordinate data into a string;

步骤2，对转换后的字符串进行邻近点查询；Step 2: Perform adjacent point query on the converted string;

步骤3，对步骤2中存在邻近点的窗口进行冗余处理，依次进行冗余数据处理、不完整数据处理、不一致数据处理和高相似数据处理；Step 3: Perform redundant processing on the windows with adjacent points in Step 2, and perform redundant data processing, incomplete data processing, inconsistent data processing and high similarity data processing in sequence;

步骤4，基于汉语语言模型Chinese Language Model和隐马尔可夫模型HiddenMarkov Mode共同构建分词方案；Step 4: Build a word segmentation solution based on the Chinese Language Model and the Hidden Markov Mode;

步骤5，对步骤4处理后的数据进行冗余处理；Step 5: Perform redundant processing on the data processed in Step 4;

步骤6，基于步骤5重新构建的分词方案的词频统计完成POI数据重匹配，实现所述多源POI数据清洗。Step 6: Complete the POI data re-matching based on the word frequency statistics of the word segmentation scheme reconstructed in step 5 to realize the multi-source POI data cleaning.

上述技术方案的改进是：对转换后的字符串进行基于B+树方法以前缀匹配进行邻近点查询。The improvement of the above technical solution is to conduct a neighbor point query based on the B+ tree method and prefix matching on the converted string.

上述技术方案的改进是：步骤3中对于冗余数据、不完整数据、不一致数据和高相似数据的处理分别如下，The improvement of the above technical solution is: the processing of redundant data, incomplete data, inconsistent data and highly similar data in step 3 are as follows:

冗余数据处理，对同平台数据连续追踪而导致的重复数据进行保留一条操作；对少量冗余数据的部分属性保持一致的情况，采用基于位置属性保留数据完备性最高的数据的方式处理；Redundant data processing, retaining one operation for duplicate data caused by continuous tracking of data on the same platform; when some attributes of a small amount of redundant data are consistent, the method of retaining the data with the highest data completeness based on location attributes is used;

不完整数据处理，首先和完备数据进行冗余判断，若定义为冗余数据则剔除，若为非冗余数据，则更进一步判断是否为不一致数据或高相似数据，并依照对应方式处理同时附加对应标签；Incomplete data processing, first make a redundancy judgment with the complete data. If it is defined as redundant data, it will be eliminated. If it is non-redundant data, it will be further judged whether it is inconsistent data or highly similar data, and it will be processed according to the corresponding method and appended at the same time. Corresponding label;

不一致数据处理，对于非邻近点的不一致数据，通过对不同地理服务平台多次地理解析和地址解析以核实POI点名称和位置；对于邻近点的不一致数据，选择其中被解析信息最多的位置数据作为实体位置信息，剔除其他不一致数据；Inconsistent data processing, for inconsistent data of non-neighboring points, verify the POI point name and location through multiple geo-parsing and address parsing on different geographical service platforms; for inconsistent data of neighboring points, select the location data with the most parsed information as the Entity location information, eliminating other inconsistent data;

高相似数据处理，利用不一致数据处理方式对实体描述名称进行词组分割，建立相似数据索引并基于指定地域的区域映射库获得地址数据，并选择相对地理要素更全面的POI数据进行存储。Highly similar data processing uses inconsistent data processing methods to segment entity description names into phrases, establishes similar data indexes and obtains address data based on the regional mapping database of designated regions, and selects POI data that is more comprehensive relative to geographical elements for storage.

上述技术方案的改进是：步骤4中，针对已有词表依赖的POI名称拆分使用ChineseLanguage Model进行处理；针对未被词表收录但需被划分的词语，使用Hidden MarkovMode基于字构词对POI名称分词进行划分。The improvement of the above technical solution is: in step 4, the ChineseLanguage Model is used to split the POI names that the existing vocabulary depends on; for the words that are not included in the vocabulary but need to be divided, the Hidden MarkovMode is used to split the POI based on word formation. Names are divided into word segments.

上述技术方案的改进是：步骤5中冗余处理过程与步骤3除处理对象不同外其余部分一致。The improvement of the above technical solution is that the redundancy processing process in step 5 is the same as that in step 3 except that the processing objects are different.

上述技术方案的改进是：步骤6的重构过程中需确定与POI数据的名称相关的关键词以及对应关键词的词频，根据所述词频舍弃逆向文件频率，以此来选择高概率的关键词来对应相应的POI数据。The improvement of the above technical solution is: in the reconstruction process of step 6, it is necessary to determine the keywords related to the name of the POI data and the word frequency of the corresponding keywords, and discard the reverse file frequency according to the word frequency to select high-probability keywords. To correspond to the corresponding POI data.

本发明的有益效果是：本发明对多源POI数据进行冗余、错误、确实、再分类等处理，并以位置约束和语义约束统一数据质量和标准，获得高可用、高可信的POI数据集。该方法能更加准确高效地完成数据清洗工作，清洗结果更加优秀，是一种更切合实际且行之有效的数据清洗方法。The beneficial effects of the present invention are: the present invention processes multi-source POI data such as redundancy, error, confirmation, and reclassification, and unifies data quality and standards with location constraints and semantic constraints to obtain highly available and highly credible POI data. set. This method can complete the data cleaning work more accurately and efficiently, and the cleaning results are better. It is a more practical and effective data cleaning method.

附图说明Description of the drawings

图1是本发明实施例一种融合位置和语义约束的多源POI数据清洗方法的流程图。Figure 1 is a flow chart of a multi-source POI data cleaning method that integrates location and semantic constraints according to an embodiment of the present invention.

具体实施方式Detailed ways

实施例Example

本实施例的一种融合位置和语义约束的多源POI数据清洗方法，如图1所示，执行如下步骤：A multi-source POI data cleaning method that integrates location and semantic constraints in this embodiment, as shown in Figure 1, performs the following steps:

步骤1，对收集到的多源POI数据进行GeoHash转换，将二维坐标数据转换为字符串；GeoHash作为一种地理位置表示结构，能够将经纬度等位置信息编码转换为由字母和数字组成的字符串，即将二维数据降为一维数据，并且两个地理位置转换的GeoHash字符串之间的共享前缀越长，那么两点之间在空间上越邻近。Step 1: Perform GeoHash conversion on the collected multi-source POI data to convert the two-dimensional coordinate data into a string; GeoHash, as a geographical location representation structure, can convert location information encoding such as longitude and latitude into characters composed of letters and numbers. String, that is, reducing two-dimensional data to one-dimensional data, and the longer the shared prefix between the GeoHash strings converted by two geographical locations, the closer the two points are in space.

步骤2，对转换后的字符串进行邻近点查询；利用B+树方法对某一固定窗口内的字符串前缀进行匹配实现邻近点查询。在选择GeoHash长度的时候，应多次进行实验计算每种长度情况下的纬度误差、经度误差和米误差，选择最合适的长度。Step 2: Perform adjacent point query on the converted string; use the B+ tree method to match the string prefix within a fixed window to implement adjacent point query. When selecting the length of GeoHash, you should conduct multiple experiments to calculate the latitude error, longitude error and meter error for each length, and choose the most appropriate length.

步骤3，若步骤2的数据(或者说该区域内)只存在1个POI点(POI位置)或不存在POI点，则将数据进行存储用于后续研究操作。若该区域内存在多个邻近POI点，则需要进行数据相似度检测，分析其数据质量和数据冗余，并进行排查、筛选和剔除，获得区域内低冗余、高可用的数据集合。Step 3. If there is only one POI point (POI position) or no POI point in the data in Step 2 (or in the area), the data will be stored for subsequent research operations. If there are multiple adjacent POI points in the area, it is necessary to perform data similarity detection, analyze their data quality and data redundancy, and conduct inspection, screening and elimination to obtain a low-redundancy and high-availability data set in the area.

步骤4，基于汉语语言模型Chinese Language Model(CLM)和隐马尔可夫模型Hidden Markov Mode(HMM)共同构建分词方案；Step 4: Build a word segmentation solution based on the Chinese Language Model (CLM) and Hidden Markov Mode (HMM);

步骤4.1，针对已有词表依赖的POI名称拆分，使用CLM进行处理。首先使用前缀词典对POI名称组成的词图进行扫描，依据名称中所有可能组成词语的情况，构建有向无环图，以获得POI名称S的全部切分方式W。基于动态规划计算各切分方式条件概率P(W|S)，取最大条件概率对应的POI名称切分方式W^*，即为最终的分词结果。由贝叶斯公式可得，求解P(W)即可获得W^*，而P(W)可使用CP(W)LM建模获得，以其中Bi-gram为例展示公式如下：Step 4.1: Split the POI names that already depend on the vocabulary and use CLM for processing. First, use the prefix dictionary to scan the word graph composed of the POI name, and construct a directed acyclic graph based on all possible words in the name to obtain all the segmentation methods W of the POI name S. Based on dynamic programming, the conditional probability P(W|S) of each segmentation method is calculated, and the POI name segmentation method W ^* corresponding to the maximum conditional probability is taken, which is the final word segmentation result. According to the Bayesian formula, W ^* can be obtained by solving P(W), and P(W) can be obtained using CP(W)LM modeling. Taking Bi-gram as an example, the formula is as follows:

W^*＝argmax_wP(W|S)W ^* =argmax _w P(W|S)

步骤4.2，针对未被词表收录但需被划分的词语，即非登陆词，使用HMM主要基于字构词对POI名称分词进行划分，规定字的四个词位：词首、词中、词尾、单字成词。将POI名称作为输入，对应词位构成的序列串作为输出，并对词位序列串划分，即可得到POI名称的划分。对于已发现的非登陆词，将其加入到词表中，以提高算法效率。最终POI名称分词示例如下表。Step 4.2. For words that are not included in the vocabulary but need to be divided, that is, non-login words, use HMM to divide the POI name into word segments mainly based on word formation, and specify the four morphemes of the word: beginning, middle, and end of the word. , a single character becomes a word. Taking the POI name as the input, the sequence string composed of the corresponding lexeme as the output, and dividing the lexeme sequence string, the division of the POI name can be obtained. For the discovered non-login words, add them to the vocabulary to improve the efficiency of the algorithm. The final POI name segmentation example is as follows.

POI名称分词示意POI name segmentation indication

步骤5，由于数据来源于多个平台，在抓取数据中会存在多个数据记录指向同一个地理实体的情况，从而产生冗余数据。Step 5. Since the data comes from multiple platforms, there will be multiple data records pointing to the same geographical entity in the captured data, thus generating redundant data.

对于该类冗余数据要对其属性和位置两个方面去进行清理。对于属性记录行完全一致的冗余数据，采用保留一条记录并剔除其他冗余记录的方法。对于属性记录行部分一致的冗余数据，由于均是位于同一窗口下的邻近POI数据，因此基于位置信息保留数据完备性最高的POI，舍弃其他属性一致的冗余数据。This type of redundant data needs to be cleaned up in terms of its attributes and location. For redundant data with completely consistent attribute record rows, the method of retaining one record and eliminating other redundant records is adopted. For redundant data with partially consistent attribute record rows, since they are adjacent POI data located in the same window, the POI with the highest data completeness is retained based on location information, and other redundant data with consistent attributes are discarded.

对于部分缺少类别辅助信息的不完整数据，首先和完备数据进行冗余判断，若定义为冗余数据则剔除。若为非冗余数据，依照步骤3建立的对各类型数据的分析结果，计算该数据与各类型的相似程度，根据数据满足某类别的相似度阈值或关键词匹配要求来赋予该POI数据类型标签。For some incomplete data that lack category auxiliary information, redundancy judgment is first made with complete data, and if it is defined as redundant data, it is eliminated. If it is non-redundant data, calculate the similarity between the data and each type based on the analysis results of each type of data established in step 3, and assign the POI data type according to the data meeting the similarity threshold or keyword matching requirements of a certain category. Label.

针对位置不一致的不一致数据，选择被解析信息最多的位置数据作为实体位置信息，剔除其他不一致数据；针对邻近POI信息中名称相似但位置一致数据，采用对窗口内该类不一致数据建立索引，并以名称表述最多一组数据作为实体名称信息。For inconsistent data with inconsistent locations, select the location data with the most parsed information as the entity location information, and eliminate other inconsistent data; for data with similar names but consistent locations in nearby POI information, index the inconsistent data in the window and use The name represents at most one set of data as entity name information.

邻近POI数据不一致示例Example of inconsistent neighboring POI data

针对名称相似但非完全一致、邻近但坐标并非完全重合的POI数据，首先进行相似度初判断：使用分词方案对实体描述名称进行词组分割，取分割后的词语相同数大于1组的POI组、或分词结果仅有1组且相似的POI数据建立相似数据索引。其次对相似数据索引内的数据进行地址数据标准化，如使用中国行政区划映射库，将原地址信息进行抽取和映射，获得标准化之后的地址数据。最后基于行政区域映射库和分词方案，将POI地址数据拆分乘8个部分内容，分别是省、市、区、道路、门牌、社区、楼栋、文本，针对不同POI类型设置不同的拆分结果比较方式，选取地理要素完备的POI数据进行存储。For POI data with similar but not completely identical names and adjacent but not completely overlapping coordinates, first make a preliminary judgment of similarity: use a word segmentation scheme to segment entity description names into phrases, and select POI groups with more than 1 group of identical words after segmentation. Or if there is only one set of word segmentation results and similar POI data, a similar data index will be established. Secondly, the address data is standardized for the data in the similar data index. For example, the Chinese administrative division mapping library is used to extract and map the original address information to obtain the standardized address data. Finally, based on the administrative region mapping library and word segmentation scheme, the POI address data is split and multiplied into 8 parts, namely province, city, district, road, house number, community, building, and text, and different splits are set for different POI types. The result comparison method selects POI data with complete geographical elements for storage.

步骤6，由于步骤5重新构建了POI数据的仍然存在由于多源POI数据存在少量POI类型与实际类型不一致的现象，因此需要对这些POI数据进行类型重匹配，方能实现所述多源POI数据的清洗。Step 6: Since the POI data re-constructed in step 5 still exists. Since there are a small number of POI types in the multi-source POI data that are inconsistent with the actual types, it is necessary to perform type rematching on these POI data to achieve the multi-source POI data. of cleaning.

由于POI数据的类别属性，同类别数据对目标POI的影响远大于非同类别数据，因此构建POI类别语料库，模拟POI真是所处的语料场景。首先统计POI名称中关键词的个数(Term Count，简称TC)，即某个词语料库中出现的次数。确定TC后计算关键词的词频，对于在某i类型的POI名称中的关键词J来说，其词频(Term Frequency，简称TF)如下式所示，其中TC_i,j是该词在i语料库中出现的次数，而是在i语料库中所有字词出现的次数和。Due to the category attributes of POI data, the impact of data of the same category on the target POI is much greater than that of data of the same category. Therefore, a POI category corpus is constructed to simulate the corpus scenario where the POI is really located. First, count the number of keywords in the POI name (Term Count, referred to as TC), that is, the number of times a certain word appears in the database. After determining the TC, calculate the word frequency of the keyword. For the keyword J in the POI name of a certain i type, its term frequency (Term Frequency, TF for short) is as shown in the following formula, where TC _i,j is the frequency of the word in the i corpus the number of occurrences in , and is the sum of the number of occurrences of all words in the i corpus.

基于POI数据中名称指向性和关键词组所处名称序列中的位置存在一定联系的假设，对TF-IDF进行改进，在扫描POI名称时，根据关键词所处POI拆分词组中的序列k，新增位置权重W，W的定义规则如下：当关键词个数大于2，位置最靠后的关键词权重设为2，倒数第二组关键词权重设为1.5，其余设1；当关键词个数有且仅有2个时，位置最后的关键词权重设为1.5，其余设1；关键词个数为1时权重设1；关键词为单独数字或字母则不设权重。通过与TF相乘获得最终的加权词频，公式如下：Based on the assumption that there is a certain relationship between the name directivity in POI data and the position of the keyword group in the name sequence, TF-IDF is improved. When scanning POI names, the sequence k in the phrase is split according to the POI where the keyword is located. Add a new position weight W. The definition rules of W are as follows: when the number of keywords is greater than 2, the weight of the last keyword is set to 2, the weight of the penultimate group of keywords is set to 1.5, and the rest are set to 1; when the keyword When there are only 2 keywords, the weight of the last keyword is set to 1.5, and the rest are set to 1; when the number of keywords is 1, the weight is set to 1; if the keyword is a single number or letter, there is no weight. The final weighted word frequency is obtained by multiplying it with TF. The formula is as follows:

基于堆排序和分批处理的方式取各类型中加权词频前n个关键词作为类别核心词，并构造核心词字典树用以高效匹配核心词，最后根据POI名称匹配的某一类别的关键词，计算关键词对应的权重之和，即可算的其所属类别的概率，最终选取概率最高的类别赋予该POI，完成重新匹配的过程。Based on heap sorting and batch processing, the top n keywords in each type with weighted word frequency are selected as the core words of the category, and a core word dictionary tree is constructed to efficiently match the core words. Finally, the keywords of a certain category are matched according to the POI name. , calculate the sum of the weights corresponding to the keywords, that is, calculate the probability of the category to which it belongs, and finally select the category with the highest probability to assign it to the POI to complete the re-matching process.

Claims

1. A multi-source POI data cleaning method integrating position and semantic constraint is characterized by executing the following steps:

step 1, performing GeoHash conversion on collected multi-source POI data, and converting two-dimensional coordinate data into character strings;

step 2, inquiring the adjacent points of the converted character string;

step 3, performing redundancy processing on the window with the adjacent point in the step 2, and sequentially performing redundancy data processing, incomplete data processing, inconsistent data processing and high-similarity data processing;

step 4, constructing a word segmentation scheme based on the Chinese language model Chinese Language Model and the hidden Markov model Hidden Markov Mode, wherein: processing using Chinese Language Model for existing vocabulary dependent POI name splitting; aiming at words which are not recorded by the vocabulary but need to be divided, using Hidden Markov Mode to divide the POI name into words based on word formation;

step 5, performing redundancy processing on the data processed in the step 4;

and 6, finishing POI data re-matching based on word frequency statistics of the word segmentation scheme reconstructed in the step 5, and realizing multi-source POI data cleaning.

2. The multi-source POI data cleansing method with fusion of location and semantic constraints according to claim 1, wherein: and carrying out neighbor point query on the converted character string by prefix matching based on a B+ tree method.

3. The multi-source POI data cleansing method with fusion of location and semantic constraints according to claim 1, wherein: the processing for redundant data, incomplete data, inconsistent data and highly similar data in step 3 is as follows,

redundant data processing, namely reserving one operation for repeated data caused by continuous tracking of the same platform data; the method comprises the steps of processing partial attributes of a small amount of redundant data in a consistent manner in a mode of reserving data with highest completeness based on the position attributes;

processing incomplete data, namely firstly carrying out redundancy judgment on the incomplete data and the complete data, removing if the complete data is defined as redundant data, further judging whether the complete data is inconsistent data or high similar data if the complete data is non-redundant data, and processing the complete data according to a corresponding mode while adding corresponding labels;

inconsistent data processing, namely verifying the names and positions of POI points by carrying out multiple times of geographic analysis and address analysis on different geographic service platforms for inconsistent data of non-adjacent points; for inconsistent data of adjacent points, selecting the position data with the most analyzed information as entity position information, and eliminating other inconsistent data;

and (3) processing high similarity data, namely performing phrase segmentation on entity description names by utilizing an inconsistent data processing mode, establishing a similarity data index, acquiring address data based on a region mapping library of a designated region, and selecting POI data more comprehensive relative to geographic elements for storage.

4. The multi-source POI data cleansing method with fusion of location and semantic constraints according to claim 1, wherein: the redundant processing procedure in the step 5 is consistent with the rest part of the step 3 except for the processing objects.

5. The multi-source POI data cleansing method with fusion of location and semantic constraints according to claim 1, wherein: in the reconstruction process of step 6, keywords related to the names of the POI data and word frequencies of the corresponding keywords are required to be determined, and the reverse file frequencies are abandoned according to the word frequencies, so that the keywords with high probability are selected to correspond to the corresponding POI data.