CN114911787B - Multi-source POI data cleaning method integrating position and semantic constraint - Google Patents
Multi-source POI data cleaning method integrating position and semantic constraint Download PDFInfo
- Publication number
- CN114911787B CN114911787B CN202210613379.8A CN202210613379A CN114911787B CN 114911787 B CN114911787 B CN 114911787B CN 202210613379 A CN202210613379 A CN 202210613379A CN 114911787 B CN114911787 B CN 114911787B
- Authority
- CN
- China
- Prior art keywords
- data
- poi
- processing
- inconsistent
- source
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 238000004140 cleaning Methods 0.000 title claims abstract description 18
- 238000012545 processing Methods 0.000 claims abstract description 36
- 230000011218 segmentation Effects 0.000 claims abstract description 18
- 238000006243 chemical reaction Methods 0.000 claims abstract description 4
- 238000013507 mapping Methods 0.000 claims description 4
- 238000004458 analytical method Methods 0.000 claims description 3
- 230000015572 biosynthetic process Effects 0.000 claims description 3
- 230000004927 fusion Effects 0.000 claims 4
- 230000001419 dependent effect Effects 0.000 claims 1
- 238000005516 engineering process Methods 0.000 abstract description 3
- 238000007405 data analysis Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 239000013256 coordination polymer Substances 0.000 description 1
- 238000013523 data management Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Quality & Reliability (AREA)
- Data Mining & Analysis (AREA)
- Machine Translation (AREA)
Abstract
本发明涉及一种融合位置和语义约束的多源POI数据清洗方法,属于数据处理技术领域。该方法执行如下步骤:步骤1,对收集到的多源POI数据进行GeoHash转换;步骤2,对转换后的字符串进行邻近点查询;步骤3,对步骤2中存在邻近点的窗口进行冗余处理;步骤4,构建分词方案;步骤5,对步骤4处理后的数据进行冗余处理;步骤6,基于步骤5重新构建的分词方案的词频统计完成POI数据重匹配。该方法能更加准确高效地完成数据清洗工作,清洗结果更加优秀,更切合实际且行之有效。
The invention relates to a multi-source POI data cleaning method that integrates location and semantic constraints, and belongs to the field of data processing technology. This method performs the following steps: Step 1, perform GeoHash conversion on the collected multi-source POI data; Step 2, perform adjacent point query on the converted string; Step 3, perform redundancy on the windows with adjacent points in Step 2 Processing; Step 4, construct a word segmentation scheme; Step 5, perform redundant processing on the data processed in Step 4; Step 6, complete POI data rematching based on the word frequency statistics of the word segmentation scheme reconstructed in Step 5. This method can complete the data cleaning work more accurately and efficiently, and the cleaning results are better, more realistic and effective.
Description
技术领域Technical field
本发明涉及一种融合位置和语义约束的多源POI数据清洗方法,属于数据处理技术领域。The invention relates to a multi-source POI data cleaning method that integrates location and semantic constraints, and belongs to the field of data processing technology.
背景技术Background technique
随着以博客、社交网络、基于位置的服务LBS为代表的新型信息发布方式的不断涌现,以及云计算、物联网等技术的兴起,数据正以前所未有的速度不断地增长和累积,各个领域正不断尝试挖掘大数据之下的隐藏信息。但随着数据量大幅度提升,数据质量也正在不断下降。大数据环境下,来自异构系统的各类数据存在若干问题:①杂乱性,各应用系统的数据缺乏统一标准的定义,具有较大的不一致性。②重复性,对于同一个客观事物在数据库中存在其两个或两个以上完全相同的物理描述。③模糊性,由于系统设计时存在的缺陷以及一些使用过程中的人为因素,导致数据记录中出现属性值丢失不确定的现象。With the continuous emergence of new information publishing methods represented by blogs, social networks, and location-based services (LBS), as well as the rise of technologies such as cloud computing and the Internet of Things, data is growing and accumulating at an unprecedented rate. Constantly try to dig out the hidden information under big data. However, as the amount of data increases significantly, the quality of data is also declining. In the big data environment, various types of data from heterogeneous systems have several problems: ① Complexity, the data of each application system lacks a unified standard definition, and has great inconsistency. ② Repeatability, there are two or more identical physical descriptions of the same objective thing in the database. ③Fuzziness, due to defects in system design and some human factors during use, leads to the phenomenon of uncertain loss of attribute values in data records.
正是由于上述现状,数据清洗在数据分析与管理的过程中扮演着越来越重要的角色。数据清洗旨在识别和纠正数据中的噪声,将噪声对数据分析结果的影响降至最低。POI作为大数据的组成部分,是位置服务的重要载体,直接掌控着位置服务研究的质量。为获得更多更全面的POI数据,研究学者/技术人员试图从多个数据源中获取数据,但与之带来的是冗余数据的增加,不完整数据的出现等问题。It is precisely because of the above situation that data cleaning plays an increasingly important role in the process of data analysis and management. Data cleaning aims to identify and correct noise in data to minimize the impact of noise on data analysis results. As a component of big data, POI is an important carrier of location services and directly controls the quality of location services research. In order to obtain more and more comprehensive POI data, researchers/technicians try to obtain data from multiple data sources, but this brings about problems such as the increase of redundant data and the emergence of incomplete data.
发明内容Contents of the invention
本发明要解决的技术问题是:如何提供一种多源POI数据清洗方法。The technical problem to be solved by this invention is: how to provide a multi-source POI data cleaning method.
为了解决上述技术问题,本发明提出的技术方案是:一种融合位置和语义约束的多源POI数据清洗方法,执行如下步骤,In order to solve the above technical problems, the technical solution proposed by the present invention is: a multi-source POI data cleaning method that integrates location and semantic constraints, and performs the following steps:
步骤1,对收集到的多源POI数据进行GeoHash转换,将二维坐标数据转换为字符串;Step 1: Perform GeoHash conversion on the collected multi-source POI data and convert the two-dimensional coordinate data into a string;
步骤2,对转换后的字符串进行邻近点查询;Step 2: Perform adjacent point query on the converted string;
步骤3,对步骤2中存在邻近点的窗口进行冗余处理,依次进行冗余数据处理、不完整数据处理、不一致数据处理和高相似数据处理;Step 3: Perform redundant processing on the windows with adjacent points in Step 2, and perform redundant data processing, incomplete data processing, inconsistent data processing and high similarity data processing in sequence;
步骤4,基于汉语语言模型Chinese Language Model和隐马尔可夫模型HiddenMarkov Mode共同构建分词方案;Step 4: Build a word segmentation solution based on the Chinese Language Model and the Hidden Markov Mode;
步骤5,对步骤4处理后的数据进行冗余处理;Step 5: Perform redundant processing on the data processed in Step 4;
步骤6,基于步骤5重新构建的分词方案的词频统计完成POI数据重匹配,实现所述多源POI数据清洗。Step 6: Complete the POI data re-matching based on the word frequency statistics of the word segmentation scheme reconstructed in step 5 to realize the multi-source POI data cleaning.
上述技术方案的改进是:对转换后的字符串进行基于B+树方法以前缀匹配进行邻近点查询。The improvement of the above technical solution is to conduct a neighbor point query based on the B+ tree method and prefix matching on the converted string.
上述技术方案的改进是:步骤3中对于冗余数据、不完整数据、不一致数据和高相似数据的处理分别如下,The improvement of the above technical solution is: the processing of redundant data, incomplete data, inconsistent data and highly similar data in step 3 are as follows:
冗余数据处理,对同平台数据连续追踪而导致的重复数据进行保留一条操作;对少量冗余数据的部分属性保持一致的情况,采用基于位置属性保留数据完备性最高的数据的方式处理;Redundant data processing, retaining one operation for duplicate data caused by continuous tracking of data on the same platform; when some attributes of a small amount of redundant data are consistent, the method of retaining the data with the highest data completeness based on location attributes is used;
不完整数据处理,首先和完备数据进行冗余判断,若定义为冗余数据则剔除,若为非冗余数据,则更进一步判断是否为不一致数据或高相似数据,并依照对应方式处理同时附加对应标签;Incomplete data processing, first make a redundancy judgment with the complete data. If it is defined as redundant data, it will be eliminated. If it is non-redundant data, it will be further judged whether it is inconsistent data or highly similar data, and it will be processed according to the corresponding method and appended at the same time. Corresponding label;
不一致数据处理,对于非邻近点的不一致数据,通过对不同地理服务平台多次地理解析和地址解析以核实POI点名称和位置;对于邻近点的不一致数据,选择其中被解析信息最多的位置数据作为实体位置信息,剔除其他不一致数据;Inconsistent data processing, for inconsistent data of non-neighboring points, verify the POI point name and location through multiple geo-parsing and address parsing on different geographical service platforms; for inconsistent data of neighboring points, select the location data with the most parsed information as the Entity location information, eliminating other inconsistent data;
高相似数据处理,利用不一致数据处理方式对实体描述名称进行词组分割,建立相似数据索引并基于指定地域的区域映射库获得地址数据,并选择相对地理要素更全面的POI数据进行存储。Highly similar data processing uses inconsistent data processing methods to segment entity description names into phrases, establishes similar data indexes and obtains address data based on the regional mapping database of designated regions, and selects POI data that is more comprehensive relative to geographical elements for storage.
上述技术方案的改进是:步骤4中,针对已有词表依赖的POI名称拆分使用ChineseLanguage Model进行处理;针对未被词表收录但需被划分的词语,使用Hidden MarkovMode基于字构词对POI名称分词进行划分。The improvement of the above technical solution is: in step 4, the ChineseLanguage Model is used to split the POI names that the existing vocabulary depends on; for the words that are not included in the vocabulary but need to be divided, the Hidden MarkovMode is used to split the POI based on word formation. Names are divided into word segments.
上述技术方案的改进是:步骤5中冗余处理过程与步骤3除处理对象不同外其余部分一致。The improvement of the above technical solution is that the redundancy processing process in step 5 is the same as that in step 3 except that the processing objects are different.
上述技术方案的改进是:步骤6的重构过程中需确定与POI数据的名称相关的关键词以及对应关键词的词频,根据所述词频舍弃逆向文件频率,以此来选择高概率的关键词来对应相应的POI数据。The improvement of the above technical solution is: in the reconstruction process of step 6, it is necessary to determine the keywords related to the name of the POI data and the word frequency of the corresponding keywords, and discard the reverse file frequency according to the word frequency to select high-probability keywords. To correspond to the corresponding POI data.
本发明的有益效果是:本发明对多源POI数据进行冗余、错误、确实、再分类等处理,并以位置约束和语义约束统一数据质量和标准,获得高可用、高可信的POI数据集。该方法能更加准确高效地完成数据清洗工作,清洗结果更加优秀,是一种更切合实际且行之有效的数据清洗方法。The beneficial effects of the present invention are: the present invention processes multi-source POI data such as redundancy, error, confirmation, and reclassification, and unifies data quality and standards with location constraints and semantic constraints to obtain highly available and highly credible POI data. set. This method can complete the data cleaning work more accurately and efficiently, and the cleaning results are better. It is a more practical and effective data cleaning method.
附图说明Description of the drawings
图1是本发明实施例一种融合位置和语义约束的多源POI数据清洗方法的流程图。Figure 1 is a flow chart of a multi-source POI data cleaning method that integrates location and semantic constraints according to an embodiment of the present invention.
具体实施方式Detailed ways
实施例Example
本实施例的一种融合位置和语义约束的多源POI数据清洗方法,如图1所示,执行如下步骤:A multi-source POI data cleaning method that integrates location and semantic constraints in this embodiment, as shown in Figure 1, performs the following steps:
步骤1,对收集到的多源POI数据进行GeoHash转换,将二维坐标数据转换为字符串;GeoHash作为一种地理位置表示结构,能够将经纬度等位置信息编码转换为由字母和数字组成的字符串,即将二维数据降为一维数据,并且两个地理位置转换的GeoHash字符串之间的共享前缀越长,那么两点之间在空间上越邻近。Step 1: Perform GeoHash conversion on the collected multi-source POI data to convert the two-dimensional coordinate data into a string; GeoHash, as a geographical location representation structure, can convert location information encoding such as longitude and latitude into characters composed of letters and numbers. String, that is, reducing two-dimensional data to one-dimensional data, and the longer the shared prefix between the GeoHash strings converted by two geographical locations, the closer the two points are in space.
步骤2,对转换后的字符串进行邻近点查询;利用B+树方法对某一固定窗口内的字符串前缀进行匹配实现邻近点查询。在选择GeoHash长度的时候,应多次进行实验计算每种长度情况下的纬度误差、经度误差和米误差,选择最合适的长度。Step 2: Perform adjacent point query on the converted string; use the B+ tree method to match the string prefix within a fixed window to implement adjacent point query. When selecting the length of GeoHash, you should conduct multiple experiments to calculate the latitude error, longitude error and meter error for each length, and choose the most appropriate length.
步骤3,若步骤2的数据(或者说该区域内)只存在1个POI点(POI位置)或不存在POI点,则将数据进行存储用于后续研究操作。若该区域内存在多个邻近POI点,则需要进行数据相似度检测,分析其数据质量和数据冗余,并进行排查、筛选和剔除,获得区域内低冗余、高可用的数据集合。Step 3. If there is only one POI point (POI position) or no POI point in the data in Step 2 (or in the area), the data will be stored for subsequent research operations. If there are multiple adjacent POI points in the area, it is necessary to perform data similarity detection, analyze their data quality and data redundancy, and conduct inspection, screening and elimination to obtain a low-redundancy and high-availability data set in the area.
步骤4,基于汉语语言模型Chinese Language Model(CLM)和隐马尔可夫模型Hidden Markov Mode(HMM)共同构建分词方案;Step 4: Build a word segmentation solution based on the Chinese Language Model (CLM) and Hidden Markov Mode (HMM);
步骤4.1,针对已有词表依赖的POI名称拆分,使用CLM进行处理。首先使用前缀词典对POI名称组成的词图进行扫描,依据名称中所有可能组成词语的情况,构建有向无环图,以获得POI名称S的全部切分方式W。基于动态规划计算各切分方式条件概率P(W|S),取最大条件概率对应的POI名称切分方式W*,即为最终的分词结果。由贝叶斯公式可得,求解P(W)即可获得W*,而P(W)可使用CP(W)LM建模获得,以其中Bi-gram为例展示公式如下:Step 4.1: Split the POI names that already depend on the vocabulary and use CLM for processing. First, use the prefix dictionary to scan the word graph composed of the POI name, and construct a directed acyclic graph based on all possible words in the name to obtain all the segmentation methods W of the POI name S. Based on dynamic programming, the conditional probability P(W|S) of each segmentation method is calculated, and the POI name segmentation method W * corresponding to the maximum conditional probability is taken, which is the final word segmentation result. According to the Bayesian formula, W * can be obtained by solving P(W), and P(W) can be obtained using CP(W)LM modeling. Taking Bi-gram as an example, the formula is as follows:
W*=argmaxwP(W|S)W * =argmax w P(W|S)
步骤4.2,针对未被词表收录但需被划分的词语,即非登陆词,使用HMM主要基于字构词对POI名称分词进行划分,规定字的四个词位:词首、词中、词尾、单字成词。将POI名称作为输入,对应词位构成的序列串作为输出,并对词位序列串划分,即可得到POI名称的划分。对于已发现的非登陆词,将其加入到词表中,以提高算法效率。最终POI名称分词示例如下表。Step 4.2. For words that are not included in the vocabulary but need to be divided, that is, non-login words, use HMM to divide the POI name into word segments mainly based on word formation, and specify the four morphemes of the word: beginning, middle, and end of the word. , a single character becomes a word. Taking the POI name as the input, the sequence string composed of the corresponding lexeme as the output, and dividing the lexeme sequence string, the division of the POI name can be obtained. For the discovered non-login words, add them to the vocabulary to improve the efficiency of the algorithm. The final POI name segmentation example is as follows.
POI名称分词示意POI name segmentation indication
步骤5,由于数据来源于多个平台,在抓取数据中会存在多个数据记录指向同一个地理实体的情况,从而产生冗余数据。Step 5. Since the data comes from multiple platforms, there will be multiple data records pointing to the same geographical entity in the captured data, thus generating redundant data.
对于该类冗余数据要对其属性和位置两个方面去进行清理。对于属性记录行完全一致的冗余数据,采用保留一条记录并剔除其他冗余记录的方法。对于属性记录行部分一致的冗余数据,由于均是位于同一窗口下的邻近POI数据,因此基于位置信息保留数据完备性最高的POI,舍弃其他属性一致的冗余数据。This type of redundant data needs to be cleaned up in terms of its attributes and location. For redundant data with completely consistent attribute record rows, the method of retaining one record and eliminating other redundant records is adopted. For redundant data with partially consistent attribute record rows, since they are adjacent POI data located in the same window, the POI with the highest data completeness is retained based on location information, and other redundant data with consistent attributes are discarded.
对于部分缺少类别辅助信息的不完整数据,首先和完备数据进行冗余判断,若定义为冗余数据则剔除。若为非冗余数据,依照步骤3建立的对各类型数据的分析结果,计算该数据与各类型的相似程度,根据数据满足某类别的相似度阈值或关键词匹配要求来赋予该POI数据类型标签。For some incomplete data that lack category auxiliary information, redundancy judgment is first made with complete data, and if it is defined as redundant data, it is eliminated. If it is non-redundant data, calculate the similarity between the data and each type based on the analysis results of each type of data established in step 3, and assign the POI data type according to the data meeting the similarity threshold or keyword matching requirements of a certain category. Label.
针对位置不一致的不一致数据,选择被解析信息最多的位置数据作为实体位置信息,剔除其他不一致数据;针对邻近POI信息中名称相似但位置一致数据,采用对窗口内该类不一致数据建立索引,并以名称表述最多一组数据作为实体名称信息。For inconsistent data with inconsistent locations, select the location data with the most parsed information as the entity location information, and eliminate other inconsistent data; for data with similar names but consistent locations in nearby POI information, index the inconsistent data in the window and use The name represents at most one set of data as entity name information.
邻近POI数据不一致示例Example of inconsistent neighboring POI data
针对名称相似但非完全一致、邻近但坐标并非完全重合的POI数据,首先进行相似度初判断:使用分词方案对实体描述名称进行词组分割,取分割后的词语相同数大于1组的POI组、或分词结果仅有1组且相似的POI数据建立相似数据索引。其次对相似数据索引内的数据进行地址数据标准化,如使用中国行政区划映射库,将原地址信息进行抽取和映射,获得标准化之后的地址数据。最后基于行政区域映射库和分词方案,将POI地址数据拆分乘8个部分内容,分别是省、市、区、道路、门牌、社区、楼栋、文本,针对不同POI类型设置不同的拆分结果比较方式,选取地理要素完备的POI数据进行存储。For POI data with similar but not completely identical names and adjacent but not completely overlapping coordinates, first make a preliminary judgment of similarity: use a word segmentation scheme to segment entity description names into phrases, and select POI groups with more than 1 group of identical words after segmentation. Or if there is only one set of word segmentation results and similar POI data, a similar data index will be established. Secondly, the address data is standardized for the data in the similar data index. For example, the Chinese administrative division mapping library is used to extract and map the original address information to obtain the standardized address data. Finally, based on the administrative region mapping library and word segmentation scheme, the POI address data is split and multiplied into 8 parts, namely province, city, district, road, house number, community, building, and text, and different splits are set for different POI types. The result comparison method selects POI data with complete geographical elements for storage.
步骤6,由于步骤5重新构建了POI数据的仍然存在由于多源POI数据存在少量POI类型与实际类型不一致的现象,因此需要对这些POI数据进行类型重匹配,方能实现所述多源POI数据的清洗。Step 6: Since the POI data re-constructed in step 5 still exists. Since there are a small number of POI types in the multi-source POI data that are inconsistent with the actual types, it is necessary to perform type rematching on these POI data to achieve the multi-source POI data. of cleaning.
由于POI数据的类别属性,同类别数据对目标POI的影响远大于非同类别数据,因此构建POI类别语料库,模拟POI真是所处的语料场景。首先统计POI名称中关键词的个数(Term Count,简称TC),即某个词语料库中出现的次数。确定TC后计算关键词的词频,对于在某i类型的POI名称中的关键词J来说,其词频(Term Frequency,简称TF)如下式所示,其中TCi,j是该词在i语料库中出现的次数,而是在i语料库中所有字词出现的次数和。Due to the category attributes of POI data, the impact of data of the same category on the target POI is much greater than that of data of the same category. Therefore, a POI category corpus is constructed to simulate the corpus scenario where the POI is really located. First, count the number of keywords in the POI name (Term Count, referred to as TC), that is, the number of times a certain word appears in the database. After determining the TC, calculate the word frequency of the keyword. For the keyword J in the POI name of a certain i type, its term frequency (Term Frequency, TF for short) is as shown in the following formula, where TC i,j is the frequency of the word in the i corpus the number of occurrences in , and is the sum of the number of occurrences of all words in the i corpus.
基于POI数据中名称指向性和关键词组所处名称序列中的位置存在一定联系的假设,对TF-IDF进行改进,在扫描POI名称时,根据关键词所处POI拆分词组中的序列k,新增位置权重W,W的定义规则如下:当关键词个数大于2,位置最靠后的关键词权重设为2,倒数第二组关键词权重设为1.5,其余设1;当关键词个数有且仅有2个时,位置最后的关键词权重设为1.5,其余设1;关键词个数为1时权重设1;关键词为单独数字或字母则不设权重。通过与TF相乘获得最终的加权词频,公式如下:Based on the assumption that there is a certain relationship between the name directivity in POI data and the position of the keyword group in the name sequence, TF-IDF is improved. When scanning POI names, the sequence k in the phrase is split according to the POI where the keyword is located. Add a new position weight W. The definition rules of W are as follows: when the number of keywords is greater than 2, the weight of the last keyword is set to 2, the weight of the penultimate group of keywords is set to 1.5, and the rest are set to 1; when the keyword When there are only 2 keywords, the weight of the last keyword is set to 1.5, and the rest are set to 1; when the number of keywords is 1, the weight is set to 1; if the keyword is a single number or letter, there is no weight. The final weighted word frequency is obtained by multiplying it with TF. The formula is as follows:
基于堆排序和分批处理的方式取各类型中加权词频前n个关键词作为类别核心词,并构造核心词字典树用以高效匹配核心词,最后根据POI名称匹配的某一类别的关键词,计算关键词对应的权重之和,即可算的其所属类别的概率,最终选取概率最高的类别赋予该POI,完成重新匹配的过程。Based on heap sorting and batch processing, the top n keywords in each type with weighted word frequency are selected as the core words of the category, and a core word dictionary tree is constructed to efficiently match the core words. Finally, the keywords of a certain category are matched according to the POI name. , calculate the sum of the weights corresponding to the keywords, that is, calculate the probability of the category to which it belongs, and finally select the category with the highest probability to assign it to the POI to complete the re-matching process.
Claims (5)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210613379.8A CN114911787B (en) | 2022-05-31 | 2022-05-31 | Multi-source POI data cleaning method integrating position and semantic constraint |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210613379.8A CN114911787B (en) | 2022-05-31 | 2022-05-31 | Multi-source POI data cleaning method integrating position and semantic constraint |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114911787A CN114911787A (en) | 2022-08-16 |
CN114911787B true CN114911787B (en) | 2023-10-27 |
Family
ID=82771332
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210613379.8A Active CN114911787B (en) | 2022-05-31 | 2022-05-31 | Multi-source POI data cleaning method integrating position and semantic constraint |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114911787B (en) |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20150032378A (en) * | 2013-09-16 | 2015-03-26 | 엔에이치엔엔터테인먼트 주식회사 | Service method and system for providing reward using moving path of users |
CN107153712A (en) * | 2017-05-26 | 2017-09-12 | 南京大学 | Support the personalized customization picture management method of the time and space association of mobile terminal |
CN107798054A (en) * | 2017-09-04 | 2018-03-13 | 昆明理工大学 | A kind of range query method and device based on Trie |
CN108846013A (en) * | 2018-05-04 | 2018-11-20 | 昆明理工大学 | A kind of spatial key word querying method and device based on geohash Yu Patricia Trie |
CN111143588A (en) * | 2019-12-27 | 2020-05-12 | 中科星图股份有限公司 | Image space-time index quick retrieval method based on machine learning |
CN111274341A (en) * | 2020-01-16 | 2020-06-12 | 中国建设银行股份有限公司 | Method and device for site selection |
CN112287055A (en) * | 2020-11-03 | 2021-01-29 | 亿景智联(北京)科技有限公司 | Algorithm for calculating redundant POI data according to cosine similarity and Buffer |
CN112307142A (en) * | 2020-06-05 | 2021-02-02 | 北京沃东天骏信息技术有限公司 | Method and device for determining information point in geographic information system and storage medium |
CN112527938A (en) * | 2020-12-17 | 2021-03-19 | 安徽迪科数金科技有限公司 | Chinese POI matching method based on natural language understanding |
CN113032672A (en) * | 2021-03-24 | 2021-06-25 | 北京百度网讯科技有限公司 | Method and device for extracting multi-modal POI (Point of interest) features |
CN113568951A (en) * | 2021-07-30 | 2021-10-29 | 拉扎斯网络科技(上海)有限公司 | Data mining and processing method and device, storage medium and electronic equipment |
CN113761867A (en) * | 2020-12-29 | 2021-12-07 | 京东城市(北京)数字科技有限公司 | Address recognition method and device, computer equipment and storage medium |
CN114201480A (en) * | 2021-11-04 | 2022-03-18 | 深圳市凯立德科技股份有限公司 | Multi-source POI fusion method and device based on NLP technology and readable storage medium |
CN114491056A (en) * | 2021-12-10 | 2022-05-13 | 新智道枢(上海)科技有限公司 | Method and system for improved POI search in digital policing scenarios |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7937402B2 (en) * | 2006-07-10 | 2011-05-03 | Nec (China) Co., Ltd. | Natural language based location query system, keyword based location query system and a natural language and keyword based location query system |
CN107291785A (en) * | 2016-04-12 | 2017-10-24 | 滴滴(中国)科技有限公司 | A kind of data search method and device |
US10776405B2 (en) * | 2016-07-28 | 2020-09-15 | International Business Machines Corporation | Mechanism and apparatus of spatial encoding enabled multi-scale context join |
US11366866B2 (en) * | 2017-12-08 | 2022-06-21 | Apple Inc. | Geographical knowledge graph |
-
2022
- 2022-05-31 CN CN202210613379.8A patent/CN114911787B/en active Active
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20150032378A (en) * | 2013-09-16 | 2015-03-26 | 엔에이치엔엔터테인먼트 주식회사 | Service method and system for providing reward using moving path of users |
CN107153712A (en) * | 2017-05-26 | 2017-09-12 | 南京大学 | Support the personalized customization picture management method of the time and space association of mobile terminal |
CN107798054A (en) * | 2017-09-04 | 2018-03-13 | 昆明理工大学 | A kind of range query method and device based on Trie |
CN108846013A (en) * | 2018-05-04 | 2018-11-20 | 昆明理工大学 | A kind of spatial key word querying method and device based on geohash Yu Patricia Trie |
CN111143588A (en) * | 2019-12-27 | 2020-05-12 | 中科星图股份有限公司 | Image space-time index quick retrieval method based on machine learning |
CN111274341A (en) * | 2020-01-16 | 2020-06-12 | 中国建设银行股份有限公司 | Method and device for site selection |
CN112307142A (en) * | 2020-06-05 | 2021-02-02 | 北京沃东天骏信息技术有限公司 | Method and device for determining information point in geographic information system and storage medium |
CN112287055A (en) * | 2020-11-03 | 2021-01-29 | 亿景智联(北京)科技有限公司 | Algorithm for calculating redundant POI data according to cosine similarity and Buffer |
CN112527938A (en) * | 2020-12-17 | 2021-03-19 | 安徽迪科数金科技有限公司 | Chinese POI matching method based on natural language understanding |
CN113761867A (en) * | 2020-12-29 | 2021-12-07 | 京东城市(北京)数字科技有限公司 | Address recognition method and device, computer equipment and storage medium |
CN113032672A (en) * | 2021-03-24 | 2021-06-25 | 北京百度网讯科技有限公司 | Method and device for extracting multi-modal POI (Point of interest) features |
CN113568951A (en) * | 2021-07-30 | 2021-10-29 | 拉扎斯网络科技(上海)有限公司 | Data mining and processing method and device, storage medium and electronic equipment |
CN114201480A (en) * | 2021-11-04 | 2022-03-18 | 深圳市凯立德科技股份有限公司 | Multi-source POI fusion method and device based on NLP technology and readable storage medium |
CN114491056A (en) * | 2021-12-10 | 2022-05-13 | 新智道枢(上海)科技有限公司 | Method and system for improved POI search in digital policing scenarios |
Non-Patent Citations (2)
Title |
---|
Spatial Data Quality in the Internet of Things: Management, Exploitation, and Prospects;Huan Li等;《ACM Computing Surveys》;第55卷(第3期);第1-41页 * |
结合否定关键词的空间关键词查询;金海等;《微电子学与计算机》;第38卷(第9期);第54-60页 * |
Also Published As
Publication number | Publication date |
---|---|
CN114911787A (en) | 2022-08-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108804521B (en) | Knowledge graph-based question-answering method and agricultural encyclopedia question-answering system | |
CN111522910B (en) | Intelligent semantic retrieval method based on cultural relic knowledge graph | |
US8171029B2 (en) | Automatic generation of ontologies using word affinities | |
CN111967761B (en) | Knowledge graph-based monitoring and early warning method and device and electronic equipment | |
CN116628173B (en) | Intelligent customer service information generation system and method based on keyword extraction | |
CN104679885B (en) | A kind of user's search string organization names recognition method based on semantic feature model | |
CN111767476B (en) | Method for constructing space-time big data spatialization engine of smart city based on HMM model | |
CN109408578B (en) | Monitoring data fusion method for heterogeneous environment | |
CN109145260A (en) | A kind of text information extraction method | |
CN113033183B (en) | Network new word discovery method and system based on statistics and similarity | |
CN114780680B (en) | Retrieval and completion method and system based on place name and address database | |
CN116361474A (en) | Method and device for updating text knowledge graph of defects of power equipment | |
CN112447172B (en) | Quality improvement method and device for voice recognition text | |
CN114911787B (en) | Multi-source POI data cleaning method integrating position and semantic constraint | |
CN118535978A (en) | News analysis method and system based on multi-mode large model | |
CN116304092B (en) | Method for automatically acquiring job concepts and expanding map for recruitment field | |
CN111814457B (en) | Power grid engineering contract text generation method | |
CN113128210B (en) | Webpage form information analysis method based on synonym discovery | |
Feng et al. | Research on the technology of data cleaning in big data | |
CN115617981A (en) | Information level abstract extraction method for short text of social network | |
CN114610842A (en) | Associated searching method and system based on intention identification | |
CN114707574A (en) | Historical error correction method and system for preventing over-splitting of scholarly-theorized library | |
KR20170032084A (en) | System and method for correcting user's query | |
CN115146630B (en) | Word segmentation method, device, equipment and storage medium based on professional domain knowledge | |
KR100745367B1 (en) | Method of index and retrieval of record based on template and question answering system using as the same |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |