CN102214179A - Method for capturing network information - Google Patents
Method for capturing network information Download PDFInfo
- Publication number
- CN102214179A CN102214179A CN 201010144137 CN201010144137A CN102214179A CN 102214179 A CN102214179 A CN 102214179A CN 201010144137 CN201010144137 CN 201010144137 CN 201010144137 A CN201010144137 A CN 201010144137A CN 102214179 A CN102214179 A CN 102214179A
- Authority
- CN
- China
- Prior art keywords
- webpage
- object node
- structured message
- network information
- current
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method for capturing network information. The method comprises the following steps of: capturing a webpage from an initial website serving as a current website, analyzing the captured webpage, extracting structural information in the captured webpage and storing the structural information as a current object node; and continually capturing a webpage from a chained address in the captured webpage serving as a current website, analyzing the captured webpage, extracting structural information in the captured webpage, storing the structural information as a current object node, defining and storing the relation between the current object node and an existing object node and repeating the operation to finish capturing the network information.
Description
[technical field]
The present invention relates to searching engine field, particularly the webpage of search engine grasps technology.
[background technology]
Fast development along with network communications technology, the internet has become a huge distributed information space of containing potential value knowledge, containing many usefully, potential but be not easy found knowledge and pattern in the network information, people need to find and grasp the Method and kit for that can obtain these knowledge and pattern urgently.
Information on the internet is present in many pieces of webpage, relies on hyperlink to connect each other between the webpage, forms complicated Information Network.The Internet era of early stage, it is very inconvenient that people search information, caused the appearance of search engine.Search engine is collected in the internet, discovery information, information is understood, extracts, is organized and handles, and provide retrieval service for the user.The principle of search engine is divided into three sections in simple terms: information extracting, information processing and inquiry service.Wherein information grasps exactly by the network address of web crawlers from one or several Initial pages, obtain the network information on the Initial page, put into formation and obtain the network information on more webpage and the webpage by constantly extracting the new network address, till the certain stop condition that satisfies system from current web page.Information processing is exactly after obtaining the network information it to be stored in the database of search engine, then the network information is carried out certain processing and is beneficial to retrieval.Last inquiry service need be fed back these processing network information later according to the user's.
But the handled smallest object of search engine is a webpage in the prior art.Please refer to Fig. 1, it shows the structural model 100 that existing search engine is described the internet.The structural model 100 that described existing search engine is described the internet is the webpage graph model.Described webpage Figure 100 is made up of plurality of webpages node and hyperlink limit.Search engine is preserved into each webpage a web page joint, node 102 as shown in FIG. in information extracting process; Then each web page joint is coupled together limit 104 as shown in FIG. as relation by hyperlink; Whole internet is stored into a webpage graph structure.
Should be noted that information not all in a webpage all is the information that the user wishes to obtain.Please refer to Fig. 2, it shows a webpage 200 that comprises the structured message piece of the prior art, and described webpage 200 comprises three parts: the theme part 206 of websites collection navigation information piece 202, advertisement and other message block 204 and described webpage 200.For most users, the information of the just relevant theme part 206 that its hope searches with key word, and for websites collection navigation information piece 202 and advertisement and other information 204 and be indifferent to.The network information that the theme part 206 of similar described webpage 200 is such, we are referred to as the structured message piece.The structured message piece can be decomposed into a plurality of inter-related ingredients after being meant the information via analysis, and clear and definite hierarchical structure, the info web that its operation and maintenance manages by database are arranged between each ingredient.Such as in the page of a relevant notebook, its structured message piece comprised notebook " brand, model, CPU, internal memory, hard disk, display screen ... " information; At the page of a relevant house property information, its structured message piece comprised house property " type, region, address, house type, area, fitting case, rent, contact person, telephone number ... " information.Can recognize, on the network similarly information be magnanimity, also be that the user wishes the information that can directly obtain.If search engine adopts webpage graph structure shown in Figure 1 to describe the internet in information extracting process, obviously can cause Query Result to contain a large amount of garbages, cause the decline of precision ratio.And the relation of storing between each web page joint as relation by hyperlink does not have logicality yet, because search engine all is that web page address is presented to the user as Search Results, and the user when clicking correlated results probably the next website of hyperlink be exactly a useless advertisement website, have greater difference waste user's time with user's target expectation.
Therefore, be necessary to propose a kind of new technical scheme and solve above-mentioned shortcoming.
[summary of the invention]
The purpose of this part is to summarize some aspects of embodiments of the invention and briefly introduces some preferred embodiments.In this part and the application's specification digest and denomination of invention, may do a little simplification or omit avoiding the making purpose of this part, specification digest and denomination of invention fuzzy, and this simplification or omit and can not be used to limit the scope of the invention.
One object of the present invention is to provide a kind of network information grasping means, and search engine can be by the structured message in the described network information grasping means extracting internet.
In order to reach purpose of the present invention, according to an aspect of the present invention, the invention provides a kind of network information grasping means, described method comprises: with an initial network address as current network address, grasp webpage from described current network address, the webpage that analysis grabs also extracts its interior structured message, and described structured message is stored as current object node;
With the chained address in the webpage that grabs as current network address, continuation is grasped webpage from current network address, the webpage that analysis grabs also extracts its structured message, described structured message is stored as current object node, definition and the relation of storing described current object node and existing object node repeat this operation to finish the extracting of the network information.
Further, described initial network address is one or more.
Further, webpage that described analysis grabs and the structured message that extracts in it are meant that the structured message piece that extracts in the webpage that grabs maybe is converted to the structured message piece with semi-structured message block in the webpage that grabs and unstructured information piece, and each structured message piece is as an object node.
Further, may extract one or more structured message pieces in the webpage that grabs, each structured message piece is as an object node.
Further, the described definition and the relation of storing described current object node and existing object node are meant relation and the storage that logic or semantic relation by current object node and existing object intranodal data define current object node and existing object node.
Further, the described definition and the relation of storing described current object node and existing object node be meant current object node of every extraction all will with existing object node definition relation and storage.
Further, if can't extract structured message in the webpage that grabs, then with the described webpage that grabs as a pseudo-object node.
Further, the network information that grabs by described network information grasping means is an object figure.
Further, also comprise the pseudo-object node of removing among the object figure that obtains by described network information grasping means.
Compared with prior art, the present invention describes the internet by object figure, and the handled least unit of search engine is the i.e. structured message piece of an object node, can make the user obtain direct useful information, has got rid of advertising message and garbage; Relation between each object node defines by logic or semantic relation simultaneously, and the relation between each object node has certain logic or semantic relation, can make Query Result have precision ratio preferably.
[description of drawings]
In conjunction with reaching ensuing detailed description with reference to the accompanying drawings, the present invention will be more readily understood, the structure member that wherein same Reference numeral is corresponding same, wherein:
Fig. 1 describes the structural model of internet for existing search engine;
Fig. 2 is a webpage that comprises the structured message piece of the prior art;
Fig. 3 is object figure structural representation in one embodiment among the present invention;
Fig. 4 is for describing the synoptic diagram of internet with the object figure described in the present invention; With
Fig. 5 is a network information grasping means of the present invention method flow diagram in one embodiment.
[embodiment]
Detailed description of the present invention is mainly come the running of direct or indirect simulation technical solution of the present invention by program, step, logical block, process or other symbolistic descriptions.Be the thorough the present invention that understands, in ensuing description, stated a lot of specific detail.And when not having these specific detail, the present invention then may still can realize.Affiliated those of skill in the art use these descriptions herein and state that the others skilled in the art in affiliated field effectively introduce their work essence.In other words, be the purpose of the present invention of avoiding confusion, owing to method, program, composition and the circuit known are readily appreciated that, so they are not described in detail.
Alleged herein " embodiment " or " embodiment " are meant special characteristic, structure or the characteristic that can be contained at least one implementation of the present invention.Different in this manual local " in one embodiment " that occur not are all to refer to same embodiment, neither be independent or optionally mutually exclusive with other embodiment embodiment.In addition, represent the sequence of modules in method, process flow diagram or the functional block diagram of one or more embodiment and revocablely refer to any particular order, also be not construed as limiting the invention.
Network information grasping means among the present invention can utilize computing machine to realize becoming an information in conjunction with relative program and grasp module, is positioned at the information crawl position of whole search engine system.When the network information grasps with the structured message piece as minimum treat unit, depict the internet as an object figure rather than webpage figure.In order to give top priority to what is the most important, only tell about the network information extracting technology relevant below with the present invention, for other aspects of search engine system, this paper is not repeated.
Please refer to Fig. 3, it shows the object figure structural representation in one embodiment among the present invention.Object Figure 30 0 comprises two big fundamentals of graph model equally: node and limit.We define object figure by some object nodes (node 304 and node 310 as shown in the figure) and be connected two object nodes concern that limit (limit 306 as shown in the figure) is constituted.Wherein the object node is represented the interior structured message piece of a webpage in the internet.A structured message piece 304 in the webpage 302 promptly is an object node as shown in FIG.; The theme part 206 of the webpage 200 among Fig. 2 promptly is an object node.In an embodiment, the object node can be represented the structured message of commodity, and it can comprise information such as trade name, commodity price, merchandise news and the commodity place of production.In another embodiment, the object node can be represented the structured message of company, and it can comprise information such as Business Name, company size, Date of Incorporation and corporate juridical person.For different themes, the object node may be represented different information in a word.The limit that concerns that connects two object nodes then is the relation of two object nodes of expression, the normally logic of the structured message of two object node representatives or semantic relation etc.In one embodiment, if the theme that two object node A and B describe all is a scientific paper, its structured message may comprise paper author, paper publishing house, paper publication time and the abstract of a thesis etc., and the relation of these two object nodes may be that to have quoted object Node B, object node A and object Node B be that same author, object node A and object Node B are that same publishing house, object node A and object Node B are same theme or the like to object node A so.
Please refer to Fig. 4, it shows the synoptic diagram of describing the internet with object figure of the present invention.Internet 400 comprises a lot of inter-related object figure.In one embodiment, object Figure 40 2 is that a theme is about the object node of scientific paper and the set on correlationship limit; In another embodiment, object Figure 40 4 is ensembles of communication of representing an all personnel of school, and wherein the object node is represented all students, teacher and employee's personal information, and the limit that concerns wherein may be logical relations such as class, age; In another embodiment, object Figure 40 6 is all blog articles of representing a blog website, and wherein the object node is represented the information such as text, author, time of blog, concerns that wherein the limit may be the common hobby of author, same delivering the time etc.Each object figure may be one and independently gather on theme or semantically, but all there is relation to connect mutually, such as student among object Figure 40 4 or teacher may be the author of the scientific paper of object Figure 40 2, and the owner of blog is exactly employee of object Figure 40 4 or the like among object Figure 40 6.In a word, wish when describing the internet by object figure that each object node comprises one in logic or independent structures message block semantically, the relation between each object node is a relation in logic a kind of or semantically.
Obviously, when describing the internet, be equivalent to search engine and in advance the information on the network carried out screening, filtration by object figure.When user search, can directly feed back and give the user the most important or information that obtains of expectation.
Please refer to Fig. 5, it shows the method flow diagram of network information grasping means 500 of the present invention.Described method 500 comprises the steps.
Step 502 as current network address, grasps webpage from described current network address with an initial network address, analyzes the webpage that grabs and extracts its interior structured message, and described structured message is stored as current object node.
Search engine can begin to grasp webpage from one or more initial network address, and after grabbing a webpage, the structured message that will extract in the webpage comes out as the object node.In one embodiment, before extracting structured message from webpage, can the definition structure information model.Same, as mentioned above, for different data themes, the definition of described structured message template can be different fully, such as, for the such theme of merchandise news, described structured message can comprise information fields such as trade name, commodity brief introduction, commodity price, merchandise news and the commodity place of production, for another example, for the such theme of company information, described structured message can comprise information fields such as Business Name, company size, Date of Incorporation and corporate juridical person.Utilize the structured message template of definition in webpage, to carry out traversal search, if a part of data in the described webpage can with described structured message template matches, this part data is extracted with regard to the structured message that can be used as in the described webpage.In another embodiment, employing is extracted the structured message piece on the network or semi-structured message block and unstructured information piece is converted into the structured message piece based on the network structure information extraction technology of vision, promptly a complete page is divided into a plurality of semantic chunks, extracts the structured message piece of one of them semantic chunk as webpage.In another embodiment, come the overall treatment webpage to obtain more structural message block as the object node in conjunction with multiple network structured message piece extractive technique.
In one embodiment, if current web page extracts a structured message, just with it as a current object node; If current web page extracts two structured messages, also with it as two current object nodes, and define the relation of current two object nodes; If current web page does not extract the structured message piece, just be stored as a pseudo-object node earlier.If webpage 302 shown in Figure 3 is commodity shopping guide pages, the structured message 304 that it can extract commodity so just forms an object node 304; If webpage 308 is user's in-service evaluation pages of commodity, it can't extract structured message so, sets up a pseudo-object node 310 so earlier; If webpage 312 shown in Figure 3 just comprises two structured message pieces 314 and 316, two object nodes have at this moment just been formed.
After handling a page, just continue to grasp the next page, and carry out the extraction of structured message equally according to the chained address in this page.Especially, similar processing all to be carried out according to certain strategy successively in chained addresses all in this page, such as, can adopt the strategy of the algorithm of PageRank to handle.If when extracting a structured message piece, just with it as an object node; When if current web page does not extract the structured message piece, just earlier as a pseudo-object node.
In one embodiment, whenever extract a new object node and all will come defining relation with existing object node, whether described relation judges whether to contain identical data or data of the same type by the relevant data or the attribute tags of the structured message of each object intranodal, have between the data to quote with inheritance to wait to determine.Such as, in one embodiment, the object node of two same brand food of representative, owing to comprise identical branding data in the structured message of two object nodes, then the contextual definition with two object nodes is same brand.
Above-mentioned 504 steps circulate, just the webpage on the whole internet all can be handled once, at this moment just can obtain an object figure, we can also be follow-up with the pseudo-object knot-removal among the described object figure, and the relation of optimizing then between the object node among the object figure obtains more accurate object figure.
In a specific embodiment, we utilize computing machine to realize becoming an information in conjunction with relative program with described network data grasping means and grasp module, be positioned at the information crawl position of a mobile phone searching engine, for the user provides cuisines, the firmly retrieval of life informations such as row, commodity, the user is behind input key word " Wuxi coffee-house ", will directly obtain the relevant information of relevant Wuxi coffee-house at cell-phone customer terminal, and not have other advertising messages or garbage.Not only saved user's time but also made full use of that less display screen has demonstrated more useful informations on the mobile phone.
Characteristic, advantage or a benefit of the network data grasping means among the present invention are: be not directly to grasp whole webpage, but the data of webpage are analyzed extraction, only grasp part useful information wherein, data quantity stored also can be reduced greatly, simultaneously can guarantee that follow-up search is more targeted, Search Results is also more accurate.By setting different themes, can the data on the internet be grasped targetedly, both guaranteed the comprehensive of data, also guaranteed the specific aim of data.
Above-mentioned explanation has fully disclosed the specific embodiment of the present invention.It is pointed out that and be familiar with the scope that any change that the person skilled in art does the specific embodiment of the present invention does not all break away from claims of the present invention.Correspondingly, the scope of claim of the present invention also is not limited only to described embodiment.
Claims (9)
1. network information grasping means is characterized in that it comprises:
One initial network address as current network address, is grasped webpage from described current network address, analyze the webpage that grabs and extract its interior structured message, described structured message is stored as current object node;
With the chained address in the webpage that grabs as current network address, continuation is grasped webpage from current network address, the webpage that analysis grabs also extracts its structured message, described structured message is stored as current object node, definition and the relation of storing described current object node and existing object node repeat this operation to finish the extracting of the network information.
2. network information grasping means according to claim 1 is characterized in that: described initial network address is one or more.
3. network information grasping means according to claim 1, it is characterized in that: webpage that described analysis grabs and the structured message that extracts in it are meant that the structured message piece that extracts in the webpage that grabs maybe is converted to the structured message piece with semi-structured message block in the webpage that grabs and unstructured information piece, and each structured message piece is as an object node.
4. network information grasping means according to claim 1 is characterized in that: may extract one or more structured message pieces in the webpage that grabs, each structured message piece is as an object node.
5. network information grasping means according to claim 1 is characterized in that: described definition and the relation of storing described current object node and existing object node are meant relation and the storage that logic or the semantic relation by current object node and existing object intranodal data defines current object node and existing object node.
6. network information grasping means according to claim 1 is characterized in that: described definition and the relation of storing described current object node and existing object node be meant current object node of every extraction all will with existing object node definition relation and storage.
7. network information grasping means according to claim 1 is characterized in that: if can't extract structured message in the webpage that grabs, then with the described webpage that grabs as a pseudo-object node.
8. network information grasping means according to claim 7 is characterized in that: the network information that grabs by described network information grasping means is an object figure.
9. network information grasping means according to claim 8 is characterized in that: also comprise the pseudo-object node of removing among the object figure that obtains by described network information grasping means.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201010144137 CN102214179A (en) | 2010-04-12 | 2010-04-12 | Method for capturing network information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201010144137 CN102214179A (en) | 2010-04-12 | 2010-04-12 | Method for capturing network information |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102214179A true CN102214179A (en) | 2011-10-12 |
Family
ID=44745494
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 201010144137 Pending CN102214179A (en) | 2010-04-12 | 2010-04-12 | Method for capturing network information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102214179A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103885957A (en) * | 2012-12-20 | 2014-06-25 | 百度在线网络技术(北京)有限公司 | Webpage information extraction method and device |
CN104750812A (en) * | 2015-03-30 | 2015-07-01 | 浪潮集团有限公司 | Automatic data collecting method based on webpage label analysis |
CN105786847A (en) * | 2014-12-22 | 2016-07-20 | 北京奇虎科技有限公司 | Method and system for displaying structured abstracts of commodity web page in e-commerce website |
CN107798101A (en) * | 2017-10-30 | 2018-03-13 | 广州市勤思网络科技有限公司 | The webpage data acquiring method and system of user's free point arrangement |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101164039A (en) * | 2005-03-02 | 2008-04-16 | 谷歌公司 | Generating structured information |
CN101305366A (en) * | 2005-11-29 | 2008-11-12 | 国际商业机器公司 | Method and system for extracting and visualizing graph-structured relations from unstructured text |
CN101630330A (en) * | 2009-08-14 | 2010-01-20 | 苏州锐创通信有限责任公司 | Method for webpage classification |
-
2010
- 2010-04-12 CN CN 201010144137 patent/CN102214179A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101164039A (en) * | 2005-03-02 | 2008-04-16 | 谷歌公司 | Generating structured information |
CN101305366A (en) * | 2005-11-29 | 2008-11-12 | 国际商业机器公司 | Method and system for extracting and visualizing graph-structured relations from unstructured text |
CN101630330A (en) * | 2009-08-14 | 2010-01-20 | 苏州锐创通信有限责任公司 | Method for webpage classification |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103885957A (en) * | 2012-12-20 | 2014-06-25 | 百度在线网络技术(北京)有限公司 | Webpage information extraction method and device |
CN105786847A (en) * | 2014-12-22 | 2016-07-20 | 北京奇虎科技有限公司 | Method and system for displaying structured abstracts of commodity web page in e-commerce website |
CN104750812A (en) * | 2015-03-30 | 2015-07-01 | 浪潮集团有限公司 | Automatic data collecting method based on webpage label analysis |
CN107798101A (en) * | 2017-10-30 | 2018-03-13 | 广州市勤思网络科技有限公司 | The webpage data acquiring method and system of user's free point arrangement |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Salloum et al. | Mining text in news channels: a case study from Facebook | |
CN103294781B (en) | A kind of method and apparatus for processing page data | |
CN107918644B (en) | News topic analysis method and implementation system in reputation management framework | |
CN107291778B (en) | Data collection method and device | |
Jindal et al. | Construction of domain ontology utilizing formal concept analysis and social media analytics | |
Nikhil et al. | A survey on text mining and sentiment analysis for unstructured web data | |
CN111859065A (en) | Big data-based public opinion listening system | |
CN112149422B (en) | Dynamic enterprise news monitoring method based on natural language | |
CN106649308B (en) | Word segmentation and word library updating method and system | |
CN102214179A (en) | Method for capturing network information | |
CN103226601A (en) | Method and device for image search | |
Ding et al. | Scoring tourist attractions based on sentiment lexicon | |
Al-Ghuribi et al. | A comprehensive survey on web content extraction algorithms and techniques | |
Dritsas et al. | Aspect-based community detection of cultural heritage streaming data | |
CN112214615A (en) | Policy document processing method and device based on knowledge graph and storage medium | |
Cao et al. | Extraction of informative blocks from web pages | |
Das et al. | Opinion based on polarity and clustering for product feature extraction | |
CN113407803A (en) | Method for acquiring internet data in one step | |
Singh et al. | User specific context construction for personalized multimedia retrieval | |
Akilan et al. | Pos tagging for classical tamil texts | |
Saleheen et al. | User centric dynamic web information visualization | |
Huang et al. | Learning to find comparable entities on the web | |
Liu et al. | Research of news tagging based on word frequency statistics and user information | |
Malinský et al. | Improvements of webometrics by using sentiment analysis for better accessibility of the web | |
Kim et al. | Knowledge extraction framework for building a largescale knowledge base |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20111012 |
|
RJ01 | Rejection of invention patent application after publication |