CN104679827A - Big data-based public information association method and mining engine - Google Patents
Big data-based public information association method and mining engine Download PDFInfo
- Publication number
- CN104679827A CN104679827A CN201510017418.8A CN201510017418A CN104679827A CN 104679827 A CN104679827 A CN 104679827A CN 201510017418 A CN201510017418 A CN 201510017418A CN 104679827 A CN104679827 A CN 104679827A
- Authority
- CN
- China
- Prior art keywords
- public information
- data
- information
- source
- allowing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a big data-based public information association method and a mining engine. The method includes the steps of 1, collecting internet public information sources, and collecting data sources related to mass public information according to types of direct acquisition and authentication acquisition; 2, allowing a multi-source matching system to perform matching types of information according to the different data sources; 3, allowing a multi-format information extraction system to extract specified data and elements according to different formats of information carriers; 4, allowing a multi-dimensional association integrating-analyzing system to integrate and analyze gathered data by means of operations such as deduplication, denoising, false removal and clustering according to an association algorithm of public information models; 5, allowing an experts correction system to correct related algorithms of deep learning and the systems used in the step above, on the basis of various indexes and a quality assessment model acquired; 6, allowing a visual display system to visually and integrally display the specified public information according to the principle of time series.
Description
Technical field
The present invention relates to based on large data public information correlating method and excavate the technical field of engine, specifically a kind of association analysis method that the complete period data of specifying in the evolution of artificial person's object are carried out and the actualizing technology excavating engine.
Background technology
Internet era, data, information become important Enterprise Resource, valuable information is extracted rapidly in the mass data of making rapid progress, numerous and jumbled and the dispersion of information simultaneously on internet, universal search engine has become the essential tool of people's obtaining information, can active searching information and can automatic indexing, provide inquiry service, when user entered keyword is inquired about, this website can return all network address that user comprises this keyword message, and provides the link of leading to this information.At present, there is a lot of search engine system in internet, but functionally with in performance all there are some defects, especially in inquiry public information, lacked relevance and accuracy.
Hadoop is a distributed system architecture, is a software platform can more easily developing and run process large-scale data.
NoSQL, the database of general reference non-relational, have easy expansion, big data quantity, high-performance, data model flexibly, the feature such as high availability.
Microblogging is one and focuses on ageing and random based on the platform of customer relationship Information Sharing, propagation and acquisition, and micro-blog more can give expression to thought all the time and latest tendency.
Micro-letter public platform, provides the new service platform of business service and subscriber management capabilities to individual, enterprise and tissue.
Excavated the public information and incidence relation that flow in the platforms such as website, microblogging, micro-letter by the degree of depth, the true complete period data comprehensively objectively understanding artificial person's object have become a kind of demand of reality; Meanwhile, reaching its maturity of the distributed storage that the large data ecosystem provides, calculating, NoSQL database, data relation analysis instrument and data mining algorithm etc., also for the large data mining of open letter provides technical support.At present, also do not have ripe process based on the public information correlating method of large data and excavate engine.
Summary of the invention
In order to overcome limitation and the deficiency of technique scheme, the invention provides a kind of public information correlating method based on large data and excavating engine.
The technical solution adopted in the present invention realizes in the following manner, and concrete steps are as follows:
(1) gather internet public information, adopt the mode of directly collection and certification collection to obtain the data source of magnanimity public information;
This engine gathers all public information in internet, contains business, proprietary and common data sets, under the prerequisite observing the original access rule of data set, by directly to gather and certification gathers two kinds of modes and maximizes the extension territory and data source thereof that obtain public information.
(2) multi-source matching system, according to the difference (website, microblogging, micro-letter, Mobile solution) of information source, carries out the coupling of the corresponding pattern of information; The difference of information source, its corresponding data source model is also different, and the information pattern of website, microblogging, micro-letter and Mobile solution client is also different, and exploitation adapts to the pattern matching system of multi-source.
(3) multi-format information extraction system, according to the different-format of information carrier, extracts the data and key element of specifying; Platform integration multi-source data, are placed in a unified quantitative test environment by information sets different for information pattern.By building Multiple Velocity Model, simple extraction model becomes the element of complex model, thus builds streamlined, a modular information extraction streaming system.
Form modeling is the basis that data pick-up carries out.Form model is responsible for identification to key message and conversion, wherein further comprises the descriptor to source data.The representative of these objects be the social property information of artificial person's object, a model can represent a mechanism, a company, an Enterprise Human, and the nature person's object information in any reality is not in this data area.
(4) multidimensional associates whole analysis system, according to the coupling index of public information model, by duplicate removal, denoising, goes the operation such as puppet, cluster, carries out confluence analysis to the data after gathering; Comprise the association analysis instrument of many covers, to meet the needs that multi dimensional analysis associates with complexity.
System is carried out compound to data, gathers, changes, is compared and cluster even depth learning manipulation, comprises categorical variables and relative variable, time series and Various types of data dimension.By numerous isolated tidal data recovering to specific environment, then reason out valuable result via time series and other deep analyses, there is the characteristic of real-time analysis simultaneously.
(5) expert amendment system, based on the indices obtained and data quality model, the related algorithm of Corrected Depth study;
Iteratively faster combines fine setting analysis and constantly promotes data value, and therefore whole system becomes more clever, constantly circulates.
(6) visual presentation system, according to time series principle, gives visual integration exhibition by the public information of artificial person's object.Multi-source data unity is a unified various dimensions model shown by system, by abundant visual represent form by abstract become directly perceived, by user provide one the overall close examination angle of concern object associated data.Visual presentation is along with source data real-time update, and user can see information the most timely at any time.
Meanwhile, externally provide extendability, customizability and application programming interfaces, realize customizing messages stream from bottom data integration, self-definition model to User Interface, be designed to an open platform.This customizing messages can be shared, links, recombinate, and is not modifiable product, but a kind of material that can join flexibly in new workflow, both can be iterated, also can add in new analytical model as material.
Compared with prior art, the present invention has the following advantages:
The present invention studies a kind of new public information correlating method and develops new data mining engine, retrospective study is carried out to the source feature of information, extraction analysis is carried out to the carrier format of information, and realizes the association confluence analysis system of magnanimity public information on this basis: take time as the confluence analysis module of information sequence and the relevant dimension model based on expert amendment system.These two systems influence each other, mutually supplement the data mining engine forming a set of public information.
Technical scheme of the present invention can help that individual, enterprise and mechanism are convenient, dynamic sensing specifies complete period data in object evolution, thus improve and data supporting accurately for decision analysis, behavior prediction provide, make the value of final data play maximum effectiveness.
Embodiment
Below in conjunction with accompanying drawing, the present invention is further described.
(1) according to the information model of specifying artificial person's object, determine the distribution source of public information on internet, according to the difference of information source character, as: government website, portal website, professional media, specialized agency etc., determine the technological means directly gathered or certification gathers, and collectable Data Elements;
This engine gathers all public information in internet, contains business, proprietary and common data sets, under the prerequisite observing the original access rule of data set, by directly to gather and certification gathers two kinds of modes and maximizes the extension territory and data source thereof that obtain public information.
(2) difference (website, microblogging, micro-letter, Mobile solution) of information source, the corresponding pattern of information is also different, its corresponding data source model is also different, corresponding information style sheet difference is very large, even the renewal of website structure needs again to develop new style sheet, multi-source pattern matching system should Auto-matching style sheet, also wants the coupling of Timeliness coverage style sheet abnormal.
(3) Platform integration multi-source data, are placed in a unified qualitative and quantitative analysis environment by information sets different for information pattern.By building Multiple Velocity Model, simple extraction model becomes the element of complex model, thus composition streamlined, a modular information extraction streaming system.Multi-format information extraction system, according to the different-format of information carrier, can identify and comprise the multiple file layout such as Word, Excel, WPS, PDF and on the basis of Chinese word segmentation, accurately extract the data and key element of specifying
Form modeling is the basis that data pick-up carries out.Form model is responsible for identification to key message and conversion, wherein further comprises the descriptor to source data.The representative of these objects be the social property information of artificial person's object, a model can represent a mechanism, a company, an Enterprise Human, but does not comprise any nature person's object.
(4) multidimensional associates whole analysis system, according to the coupling index of the information model of autonomous research, develops many sets of data association analysis instrument, to meet the needs that multi dimensional analysis associates with complexity.System not only will complete duplicate removal, denoising, go the task such as puppet, cluster, also will carry out compound to data, gather, change, compare and cluster even depth learning manipulation, comprise categorical variables and relative variable, time series and Various types of data dimension.
System requirements is, by numerous isolated tidal data recovering to specific environment, then reasons out valuable result via time series and other deep analyses, has the characteristic of real-time analysis simultaneously.
(5) in order to adapt to complicacy and the polygons of information, expert amendment system becomes more important, and at the indices obtained and data quality model, considers industry singularity, the related algorithm of continuous Corrected Depth study, iteratively faster constantly promotes data value in conjunction with local analysis.Object is that therefore whole system becomes more intelligent, and this is also a process constantly circulated.
(6) multi-source data unity is the unified various dimensions time series models shown by system, become intuitively by the abundant visual form that represents by abstract, histogram, pie chart, curve map etc., by user provide one totally close examination the visual angle of associated data of concern object; Meanwhile, visual presentation is along with the renewal of source data, and real-time exhibition is information the most timely.
In addition, externally provide extendability, customizability and application programming interfaces, realize customizing messages stream from bottom data integration, self-definition model to User Interface, be designed to an open platform.This customizing messages can be shared, links, recombinate, be not not modifiable product, but a kind of material that can join flexibly in new workflow, both can be iterated, also can add in new industry analysis model as material, be supplied to the partner that various expert data requires.
Accompanying drawing explanation
Accompanying drawing is public information correlating method and excavates engine figure.
Claims (5)
1., based on public information correlating method and the excavation engine of large data, it is characterized in that described method step is as follows:
(1) public information source, internet is collected: according to the classification directly gathered and certification gathers, collect the data source that magnanimity public information is associated;
(2) multi-source matching system: according to the difference (website, microblogging, micro-letter, Mobile solution) of data source, carry out the coupling of the corresponding pattern of information;
(3) multi-format information extraction system: according to the different-format of information carrier, extracts the data and key element of specifying;
(4) multidimensional associates whole analysis system: according to the association algorithm of public information model, by duplicate removal, denoising, go the operation such as puppet, cluster, carry out confluence analysis to the data after gathering;
(5) expert amendment system: based on the indices obtained and Evaluation Model on Quality, Corrected Depth study related algorithm and above-mentioned steps in each system;
(6) visual presentation system: according to time series principle, the public information of specifying is given visual integration exhibition.
2. a kind of public information correlating method based on large data according to claim 1 and excavation engine, is characterized in that, in described step (1), and the data field that public information source, the data field that can directly gather and palpus certification gather.
3. a kind of public information correlating method based on large data according to claim 1 and excavation engine, it is characterized in that, in described step (3), must identify and comprise the multiple file layout such as Word, Excel, WPS, PDF and on the basis of Chinese word segmentation, accurately extract the data required.
4. a kind of public information correlating method based on large data according to claim 1 and excavation engine, it is characterized in that, in described step (4), according to the association algorithm of public information model, by duplicate removal, denoising, go the operation such as puppet, cluster, integrate the information source that the Data Concurrent after gathering is now new.
5. a kind of public information correlating method based on large data according to claim 1 and excavation engine, is characterized in that, in described step (5), what expert amendment system adopted is the pattern that non-relational database combines with machine learning.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510017418.8A CN104679827A (en) | 2015-01-14 | 2015-01-14 | Big data-based public information association method and mining engine |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510017418.8A CN104679827A (en) | 2015-01-14 | 2015-01-14 | Big data-based public information association method and mining engine |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104679827A true CN104679827A (en) | 2015-06-03 |
Family
ID=53314869
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510017418.8A Pending CN104679827A (en) | 2015-01-14 | 2015-01-14 | Big data-based public information association method and mining engine |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104679827A (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105677768A (en) * | 2015-12-30 | 2016-06-15 | 芜湖乐锐思信息咨询有限公司 | Networked classification analysis system based on complex products |
CN106203676A (en) * | 2016-06-27 | 2016-12-07 | 浪潮(北京)电子信息产业有限公司 | A kind of Work Flow Optimizing method based on cloud computing framework |
CN106227896A (en) * | 2016-08-28 | 2016-12-14 | 杭州合众数据技术有限公司 | A kind of big data visualization fractional analysis method |
CN106453554A (en) * | 2016-10-11 | 2017-02-22 | 上海携程商务有限公司 | Monitoring system and monitoring method for application dependency in distributed information system |
CN106649298A (en) * | 2015-07-22 | 2017-05-10 | 中国科学院微电子研究所 | Cross-domain association establishment method and system based on Internet of things |
WO2017092696A1 (en) * | 2015-12-02 | 2017-06-08 | 中国银联股份有限公司 | Method for safe integration of big data without leaking privacy |
CN107093019A (en) * | 2017-04-21 | 2017-08-25 | 北京恒冠网络数据处理有限公司 | A kind of big data analysis system for macro adjustments and controls |
CN107391686A (en) * | 2017-07-24 | 2017-11-24 | 威创软件南京有限公司 | A kind of visual configuration data collecting system implementation method |
CN108763565A (en) * | 2018-06-04 | 2018-11-06 | 广东京信软件科技有限公司 | A kind of matched construction method of data auto-associating based on deep learning |
CN110008251A (en) * | 2019-03-07 | 2019-07-12 | 平安科技(深圳)有限公司 | Data processing method, device and computer equipment based on time series data |
CN110874356A (en) * | 2020-01-19 | 2020-03-10 | 南京创维信息技术研究院有限公司 | Cloud big data system and construction method thereof |
CN111625537A (en) * | 2020-04-24 | 2020-09-04 | 山东电子职业技术学院 | Multidimensional data analysis system and multidimensional data analysis method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009003281A1 (en) * | 2007-07-03 | 2009-01-08 | Tlg Partnership | System, method, and data structure for providing access to interrelated sources of information |
CN102523246A (en) * | 2011-11-23 | 2012-06-27 | 陈刚 | Cloud computation treating system and method |
CN104123323A (en) * | 2013-04-28 | 2014-10-29 | 成都勤智数码科技股份有限公司 | Method for collecting and recognizing service activities based on knowledge base |
CN104123317A (en) * | 2013-04-28 | 2014-10-29 | 成都勤智数码科技股份有限公司 | Service organization assessing and analyzing method based on knowledge base |
-
2015
- 2015-01-14 CN CN201510017418.8A patent/CN104679827A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009003281A1 (en) * | 2007-07-03 | 2009-01-08 | Tlg Partnership | System, method, and data structure for providing access to interrelated sources of information |
CN102523246A (en) * | 2011-11-23 | 2012-06-27 | 陈刚 | Cloud computation treating system and method |
CN104123323A (en) * | 2013-04-28 | 2014-10-29 | 成都勤智数码科技股份有限公司 | Method for collecting and recognizing service activities based on knowledge base |
CN104123317A (en) * | 2013-04-28 | 2014-10-29 | 成都勤智数码科技股份有限公司 | Service organization assessing and analyzing method based on knowledge base |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649298A (en) * | 2015-07-22 | 2017-05-10 | 中国科学院微电子研究所 | Cross-domain association establishment method and system based on Internet of things |
CN106649298B (en) * | 2015-07-22 | 2021-01-22 | 中国科学院微电子研究所 | Cross-domain association establishment method and system based on Internet of things |
WO2017092696A1 (en) * | 2015-12-02 | 2017-06-08 | 中国银联股份有限公司 | Method for safe integration of big data without leaking privacy |
CN105677768A (en) * | 2015-12-30 | 2016-06-15 | 芜湖乐锐思信息咨询有限公司 | Networked classification analysis system based on complex products |
CN106203676A (en) * | 2016-06-27 | 2016-12-07 | 浪潮(北京)电子信息产业有限公司 | A kind of Work Flow Optimizing method based on cloud computing framework |
CN106227896A (en) * | 2016-08-28 | 2016-12-14 | 杭州合众数据技术有限公司 | A kind of big data visualization fractional analysis method |
CN106453554B (en) * | 2016-10-11 | 2019-11-19 | 上海携程商务有限公司 | The monitoring system and monitoring method of dependence are applied in distributed information system |
CN106453554A (en) * | 2016-10-11 | 2017-02-22 | 上海携程商务有限公司 | Monitoring system and monitoring method for application dependency in distributed information system |
CN107093019A (en) * | 2017-04-21 | 2017-08-25 | 北京恒冠网络数据处理有限公司 | A kind of big data analysis system for macro adjustments and controls |
CN107391686A (en) * | 2017-07-24 | 2017-11-24 | 威创软件南京有限公司 | A kind of visual configuration data collecting system implementation method |
CN108763565A (en) * | 2018-06-04 | 2018-11-06 | 广东京信软件科技有限公司 | A kind of matched construction method of data auto-associating based on deep learning |
CN110008251A (en) * | 2019-03-07 | 2019-07-12 | 平安科技(深圳)有限公司 | Data processing method, device and computer equipment based on time series data |
CN110008251B (en) * | 2019-03-07 | 2023-07-04 | 平安科技(深圳)有限公司 | Data processing method and device based on time sequence data and computer equipment |
CN110874356A (en) * | 2020-01-19 | 2020-03-10 | 南京创维信息技术研究院有限公司 | Cloud big data system and construction method thereof |
CN111625537A (en) * | 2020-04-24 | 2020-09-04 | 山东电子职业技术学院 | Multidimensional data analysis system and multidimensional data analysis method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104679827A (en) | Big data-based public information association method and mining engine | |
Gerçek et al. | Object-based classification of landforms based on their local geometry and geomorphometric context | |
CN110781236A (en) | Method for constructing government affair big data management system | |
Best et al. | Geospatial web services within a scientific workflow: Predicting marine mammal habitats in a dynamic environment | |
CN111708774B (en) | Industry analytic system based on big data | |
CN106407278A (en) | Architecture design system of big data platform | |
CN104794151A (en) | Spatial knowledge service system building method based on collaborative plotting technology | |
CN103631882A (en) | Semantization service generation system and method based on graph mining technique | |
Lee et al. | Fundamentals of big data network analysis for research and industry | |
Zhang et al. | Research hotspots and trends in heritage building information modeling: A review based on CiteSpace analysis | |
Zhang | Application of data mining technology in digital library. | |
Mehdipoor et al. | Developing a workflow to identify inconsistencies in volunteered geographic information: a phenological case study | |
CN111831856A (en) | Metadata-based automatic holographic digital power grid data storage system and method | |
Chiang | Querying historical maps as a unified, structured, and linked spatiotemporal source: vision paper | |
CN113254517A (en) | Service providing method based on internet big data | |
CN113722564A (en) | Visualization method and device for energy and material supply chain based on space map convolution | |
CN112612778B (en) | Enterprise data architecture method | |
KR101545998B1 (en) | Method for Management Integration of Runoff-Hydraulic Model Data and System thereof | |
Bonilla-Bedoya et al. | Urban socio-ecological dynamics: applying the urban-rural gradient approach in a high Andean city | |
Alhaj Ali et al. | Distributed data mining systems: techniques, approaches and algorithms | |
Starkweather et al. | Cyberinfrastructure and collaboratory support for the integration of Arctic atmospheric research | |
STAICULESCU | Application of GIS Technologies in Monitoring Biodiversity | |
CN110880151A (en) | Chain correlation analysis system is traceed back to quality safety of reassurance agricultural product | |
CN111161034A (en) | Method for constructing recommendation engine based on LBS (location based service) renting scene | |
Abdallah et al. | A data collection quality model for big data systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20150603 |