CN101866342B - Method and device for generating or displaying webpage label and information sharing system - Google Patents
Method and device for generating or displaying webpage label and information sharing system Download PDFInfo
- Publication number
- CN101866342B CN101866342B CN 200910133976 CN200910133976A CN101866342B CN 101866342 B CN101866342 B CN 101866342B CN 200910133976 CN200910133976 CN 200910133976 CN 200910133976 A CN200910133976 A CN 200910133976A CN 101866342 B CN101866342 B CN 101866342B
- Authority
- CN
- China
- Prior art keywords
- webpage
- web page
- annotation
- cbf
- label
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 77
- 239000013598 vector Substances 0.000 claims description 59
- 238000002372 labelling Methods 0.000 claims description 35
- 238000004140 cleaning Methods 0.000 claims description 10
- 230000002194 synthesizing effect Effects 0.000 claims description 8
- 230000004044 response Effects 0.000 claims description 7
- 230000005540 biological transmission Effects 0.000 claims description 6
- 238000004891 communication Methods 0.000 claims description 6
- 230000008569 process Effects 0.000 description 20
- 230000008859 change Effects 0.000 description 15
- 238000012545 processing Methods 0.000 description 15
- 238000010586 diagram Methods 0.000 description 11
- 210000001175 cerebrospinal fluid Anatomy 0.000 description 9
- 238000003860 storage Methods 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 6
- 230000008901 benefit Effects 0.000 description 4
- 238000004519 manufacturing process Methods 0.000 description 4
- 239000000203 mixture Substances 0.000 description 4
- 241000218028 Anotea Species 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000004590 computer program Methods 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000015654 memory Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
Images
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a method and a device for generating or displaying a webpage label and an information sharing system based on the webpage label. The method for generating webpage label information comprises the following steps: responding to a user to select a target webpage element on a current webpage loaded on a client Web browser as a labelled object and extracting an XPath path of the labelled object in a document object model (DOM) tree of the current webpage; based on the labelled object and contents of context webpage elements next to the front and back parts of the labelled object in the current webpage, generating a feature code CF of the labelled object; and based on the XPath path and the feature code CF of the labelled object and a label input by the user, generating the webpage label information, wherein the webpage label information is stored in a label database of a remote label server and the feature code CF of the labelled object consists of a content-based feature (CBF) of the labelled object and the CBFs of the context webpage elements of the labelled object.
Description
Technical Field
The present invention relates generally to web page annotation technology, and more particularly, to technology for generating or displaying a web page annotation in consideration of the content of a target web page element as an annotated object on a web page, and technology for realizing information sharing based on such web page annotation.
Background
Annotation is a technique for adding information to a document. This concept was originally created in paper media, including highlighting keywords, adding side notes, etc. With the rapid development and increasing popularity of computer and network technologies, network media has become one of the important ways for people to know information. Under the circumstances, the technology of webpage labeling has also been highlighted and developed, and webpage labeling is becoming one of the popular topics in various fields including digital libraries, computer-aided collaborative work, information sharing and management.
Conventional Web systems provide a convenient information distribution platform, such as a Web page production platform, to content or information providers. However, this manner of information communication is essentially unidirectional. The interactions that a web page reader can make are limited to just clicking on a link or adding a bookmark, etc. The currently popular Web2.0 concept emphasizes the participation and information sharing of a large number of Web users, so that the information flow becomes a bidirectional or even multidirectional mode. Currently, common information sharing techniques include:
RSS (Really simple synchronization): in which contents to be distributed are integrated through a server and then contents to be acquired are selected by a user. In this way, the user can only passively acquire the content published by the RSS source, and the information flow is also asymmetric;
interactive Web publishing platforms (e.g., Wiki and Blog): through the platform, users can publish own articles and opinions so as to achieve the purpose of information sharing. However, this information sharing method needs to be performed on a specific structured web page, and cannot share opinions of all viewed web pages at any time and any place.
The web page labeling system is different from the above two information sharing modes, and actually provides a labeling device to help a user label a browsed web page, wherein the labeling device can be a separate software tool containing a browser, a separate software tool independent of the browser or an expansion module integrated in the browser. Anotea, as a standard webpage labeling tool provided by the World Wide Web (W3C), uses RDF (Resource description Format) and XPointer as methods for describing labeled webpages. As a recommendation plan of W3C, Anotea provides a standard framework and implementation method for the representation and storage of webpage annotations. In the Annotea system, the system uses an RDF database server to store all the webpage labeling information, and a user labels the webpage with a specific software client. On the basis of Anotea, some distinctive webpage labeling systems, such as Annouty, Crit, e-market, YAWAS, and the like, have appeared.
In general, the basic architecture of the existing web page annotation system can be as shown in fig. 1. As shown in fig. 1, the prior art webpage annotation system mainly includes a user command processing unit 110, an annotation query unit 120, a webpage obtaining unit 130, and a webpage annotation synthesizing unit 140. Wherein the user command processing unit 110 receives input information of the user (including a web page URL, display options, user information, etc.) and transmits the information to the annotation query unit 120 and the web page obtaining unit 130. The annotation query unit 120 obtains annotation information of a web page by querying a remote annotation server via a network such as the internet according to the URL information of the web page input by the user. The web page obtaining unit 130 obtains a desired web page through the internet based on web page URL information provided by the user. The web page tag synthesizing unit 140 synthesizes the acquired web page and the related tag information together and provides the synthesized web page and the related tag information to the user, so that the user can see the related web page tag information while seeing the desired web page.
Although existing web page annotation systems can implement adding annotations to web pages, there are various problems such as the following:
the case where the annotated object is transferred to other pages cannot be handled. In many websites, a specific element in one page is often automatically listed in other pages along with the scrolling of contents, and the traditional webpage labeling method cannot display such a label;
when some tolerable change occurs in the format of the annotated object in the web page (e.g., font in the annotated object changes to italics or blacking, etc.), the annotation cannot be displayed correctly;
in many cases, several modifications are often made to the content of the annotated object, and in conventional webpage annotation systems, the annotated object with modified content is considered not to be the originally annotated content and is therefore no longer annotated for display.
Therefore, there is still a need to provide a method and apparatus for generating a webpage annotation or displaying a webpage annotation in consideration of the content of the annotated object, and a system for more effectively sharing information among users based on the webpage annotation, so as to overcome one or more of the above-mentioned defects in the prior art.
Disclosure of Invention
The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. It should be understood that this summary is not an exhaustive overview of the invention. It is not intended to determine the key or critical elements of the present invention, nor is it intended to limit the scope of the present invention. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.
In order to solve the above problems of the prior art, it is an object of the present invention to provide a method and apparatus for generating or displaying a webpage annotation in consideration of the content of an annotated object on a webpage, wherein webpage annotation information can be associated with the annotated object and the content of contextual webpage elements immediately before and after the annotated object on the webpage, so that the change of the annotated object can be dynamically tracked.
Another object of the present invention is to provide a web page annotation method and apparatus, by which a web page desired to be loaded and displayed by a user and existing annotations previously annotated on the web page stored on a remote annotation server can be displayed on a client browser, and new annotations are added and displayed on the web page.
Still another object of the present invention is to provide an information sharing system for implementing information sharing based on webpage annotation by using the above webpage annotation method and apparatus.
In order to achieve the above object, according to one aspect of the present invention, there is provided a method for generating webpage markup information, the method comprising: in response to a target webpage element selected as a labeled object on a current webpage loaded on a client Web browser by a user, extracting an XPath of the labeled object in a Document Object Model (DOM) tree of the current webpage; generating feature codes CF of the labeled objects based on the labeled objects and the contents of the context webpage elements which are immediately before and after the labeled objects in the current webpage; and generating webpage labeling information based on an XPath path of the labeled object, a feature code CF and a label input by a user, wherein the webpage labeling information is stored in a label database of a remote label server, the feature code CF of the labeled object is composed of a content-based feature (CBF) of the labeled object and a CBF of a context webpage element of the labeled object, the CBF of the webpage element is composed of an alphabetic projection vector and an alphabetic order vector of the webpage element, wherein the alphabetic projection vector is composed of statistical numbers of all letters in the webpage element on an alphabet { a, b, c, d.,. z }, and the alphabetic order vector is composed of inverse statistical numbers of all letters in the webpage element on the alphabet Λ.
According to another aspect of the present invention, there is also provided an apparatus for generating webpage annotation information, the apparatus comprising: the XPath generator is used for responding to a target webpage element selected by a user on a current webpage loaded on a client Web browser as a labeled object and extracting an XPath of the labeled object in a Document Object Model (DOM) tree of the current webpage; a feature Code (CF) generator for generating a feature code CF of the tagged object based on the tagged object and contents of context web page elements immediately before and after the tagged object in the current web page; and a label generator, configured to generate webpage label information based on an XPath path of the labeled object, a feature code CF of the labeled object, and a label input by a user, where the feature code CF of the labeled object is composed of a content-based feature CBF of the labeled object and a CBF of a context webpage element of the labeled object, where the webpage label information is stored in a label database of a remote label server, and the CBF of the webpage element is composed of an alphabetical projection vector and an alphabetical order vector of the webpage element, where the alphabetical projection vector is composed of statistical numbers of all letters in the webpage element on an alphabet Λ ═ { a, b, c, d,. once, z }, and the alphabetical order vector is composed of inverse statistical numbers of all letters in the webpage element on Λ.
According to another aspect of the present invention, there is also provided a method for displaying a Web page and annotations on the Web page on a client Web browser, the method comprising: a) analyzing an input Uniform Resource Locator (URL) of a web page to be loaded and displayed on a browser in response to a user inputting the URL to obtain a valid URL; b) inquiring all labels related to the effective URL from a remote label server according to the effective URL so as to obtain a label candidate set and webpage label information of the labels; c) for each label in the label candidate set, determining whether the label labels a webpage element in the webpage to be loaded according to the labeled webpage label information of the label, that is, determining whether the label should exist in the webpage to be loaded, and if so, further determining the position of the labeled webpage element in the webpage to be loaded, that is, a label position; and d) synthesizing the labels with the loaded web pages according to the labeled web page label information and the label positions thereof determined to be present in the loaded web pages, and displaying the synthesized web pages to users through browsers, wherein the labeled web page label information comprises XPath paths of labeled objects corresponding to the labels, feature codes CF of the labeled objects, labeled contents and formats, URLs of the web pages where the labels are located, and content feature codes of the web pages where the labels are located, the feature codes CF of the labeled objects are composed of content-based features (CBF) of the labeled objects and CBF of context web page elements immediately before and after the labeled objects, and the CBF of the web page elements is composed of letter projection vectors and letter sequence vectors of the web page elements, wherein the letter projection vectors are composed of all letters in the web page elements in the alphabet of Lambda, b, c, d, a.
According to another aspect of the present invention, there is also provided an apparatus for displaying a Web page and annotations on the Web page via a client Web browser, the apparatus comprising: a URL analyzer for analyzing an input URL to obtain a valid URL in response to a Uniform Resource Locator (URL) of a web page to be loaded and displayed on a browser, which is input by a user; the label querier is used for querying all labels related to the effective URL from the remote label server according to the effective URL so as to obtain a label candidate set and webpage label information of the labels; a label position determining unit, configured to determine, for each label in the label candidate set, according to the labeled webpage label information, whether the label labels a webpage element in the webpage to be loaded, that is, whether the label should exist in the webpage to be loaded, and if so, further determine a position of the webpage element labeled by the label in the webpage to be loaded, that is, a label position; and a synthesizing unit for synthesizing the labels and the loaded web pages according to the labeled web page label information and the label positions thereof, wherein the synthesized web pages are displayed to users via browsers, the labeled web page label information comprises XPath paths of labeled objects corresponding to the labels, feature codes CF of the labeled objects, labeled contents and formats, URLs of the web pages where the labels are located, and content feature codes of the web pages where the labels are located, the feature codes CF of the labeled objects are composed of content-based features (CBF) of the labeled objects and CBFs of context web page elements immediately before and after the labeled objects, and the CBFs of the web page elements are composed of letter projection vectors and letter sequence vectors of the web page elements, wherein the letter projection vectors are composed of all letters in the web page elements in the alphabet Λ a, b, c, d, a.
In addition, according to another aspect of the present invention, there is also provided a webpage labeling method, including: displaying the Web page on the client Web browser and existing annotations previously annotated on the Web page stored on the remote annotation server by performing the above-described method for displaying the Web page and annotations on the Web page on the client Web browser in response to a user-entered URL of the Web page to be loaded and displayed on the client Web browser; adding a new label to the webpage by executing the method for generating webpage label information, wherein the newly labeled webpage label information is stored on a remote label server; and displaying the added new annotation on the webpage via the browser.
According to another aspect of the present invention, there is also provided a web page labeling apparatus, including: the device for generating webpage labeling information; and the device for displaying the webpage and the label on the webpage through the client Web browser.
According to another aspect of the present invention, there is also provided an information sharing system based on webpage annotation, including: the system comprises a client and a remote labeling server, wherein the client comprises the webpage labeling device, and the remote labeling server comprises a labeling database for storing webpage labeling information and a labeling information accessor for performing access control on the labeling database.
According to other aspects of the invention, corresponding computer-readable storage media and computer program products are also provided.
The method, the device and the system have the advantages that the XPath of the annotated object and the content of the annotated object and the contextual webpage elements thereof are considered when the webpage annotation information is generated, so that the annotated object can be dynamically tracked by the annotation, and therefore, the related annotation information can move along with the annotated object. Moreover, even if the format of the object to be labeled changes, the label can be displayed correctly. Even when the content of the marked object changes, the content change can be evaluated to determine whether the corresponding mark can be displayed.
These and other advantages of the present invention will become more apparent from the following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings.
Drawings
The invention may be better understood by referring to the following description in conjunction with the accompanying drawings, in which like reference numerals are used throughout the figures to indicate like or similar parts. The accompanying drawings, which are incorporated in and form a part of this specification, illustrate preferred embodiments of the present invention and, together with the detailed description, serve to further explain the principles and advantages of the invention. In the drawings:
FIG. 1 is a schematic diagram showing the general architecture of a web page annotation system in the prior art;
FIG. 2 is a diagram illustrating the structure of a system for information sharing using webpage annotation according to an embodiment of the present invention;
FIG. 3 is an exemplary flow chart illustrating a process performed when adding a new annotation to a web page using the system shown in FIG. 2 in accordance with embodiments of the present invention;
FIG. 4 is a diagram illustrating in detail an exemplary structure and process of the CBF generator shown in FIG. 2;
FIG. 5 is a block diagram illustrating in detail an exemplary structure of the annotation analyzer shown in FIG. 2;
FIG. 6 is a flow diagram illustrating a process of entering a URL (Uniform resource locator) of a web page to be loaded in a client browser for displaying the web page and existing annotations therein using the system shown in FIG. 2 at a user in accordance with an embodiment of the present invention;
fig. 7 is a flowchart illustrating a process of obtaining an alternative URL based on a URL input by a user and performing the same or similar page determination on a web page corresponding to the alternative URL and a web page currently loaded by a browser to obtain a valid URL in an embodiment according to the present invention (i.e., a specific processing procedure of step S610 shown in fig. 6);
FIG. 8 is a flowchart showing a process of determining whether all possible annotations exist in the currently loaded webpage and the annotation position annotated therein (i.e., a specific processing procedure of step S630 in FIG. 6) in an embodiment according to the present invention; and
fig. 9 is a diagram showing the structure of a certain annotated feature code CF (as shown in (a) of fig. 9) used in the process shown in fig. 8 and its corresponding DOM tree (as shown in (b) of fig. 9) of the current web page.
Skilled artisans appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help improve the understanding of the embodiments of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described hereinafter with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual implementation are described in the specification. It will of course be appreciated that in the development of any such actual embodiment, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which will vary from one implementation to another. Moreover, it will be appreciated that such a development effort might be complex and time-consuming, but would nevertheless be a routine undertaking for those of ordinary skill in the art having the benefit of this disclosure.
Here, it should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the device structure and/or the processing steps closely related to the scheme according to the present invention are shown in the drawings, and other details not so related to the present invention are omitted.
Fig. 2 is a diagram illustrating a structure of a system for implementing information sharing using webpage annotation according to an embodiment of the present invention. The system can be divided into two parts, a client side and a server side (i.e. an annotation server) which are connected through a network (not shown).
As shown in FIG. 2, the apparatus 200 for webpage annotation mainly comprises a user interface 210, an XPath generator 220, a content-based features (CBF) generator 230, an annotation generator 240, an annotation analyzer 250 and an XML transformer 260 at the client side, and a annotation information accessor 270 and an annotation database 280 at the server side.
In a specific implementation example of the system shown in fig. 2, the web page annotation device 200 of the client can be implemented in the form of a browser plug-in; the annotation server can be implemented by Java server, specifically, the server-side annotation information accessor 240 can be implemented by Java server, and the annotation database 250 can be implemented by an existing database management system. However, it will be understood by those skilled in the art that the principles of the present invention are not limited thereto but may be embodied in other different forms as may be desired.
At the client, the user can utilize the web page annotation device 200 to add and display new web page annotations on the web page loaded by the browser, as well as to correctly display existing web page annotations that have been previously added on the web page. In the web page annotation device 200, the user interface 210 is responsible for receiving input for the entire device, which may receive any one or more of the following input information: (1) input information relating to configuration parameters of the system; (2) input information relating to a tagged object selected by a user on a web page; (3) input information relating to the annotated content; (4) input information relating to the display mode of the annotation; and so on.
The XPath generator 220 is used to extract XPath paths of annotated objects in the DOM (document object model) tree of a web page. XPath is the expression of any element in the web page recommended by W3C, each element in the web page corresponds to an XPath, and any element in the web page can be positioned through the XPath. Each node in the DOM tree of the web page corresponds to each element contained in the web page. That is, both the annotated object and the elements of the web page immediately preceding and following the annotated object in the web page can be represented as nodes on the DOM tree. For ease of explanation, the elements of the web page immediately preceding and following the annotated object in the web page are referred to as the above and below elements, which correspond to the immediately adjacent sibling nodes of the corresponding node of the annotated object in the DOM tree, respectively, and thus may also be referred to as context nodes or context web page elements.
The CBF (content-based feature) generator 230 generates a CBF of the annotated object according to the content of the annotated object. The CBF of the labeled object is composed of the alphabetical projection vector (CPF) and the alphabetical order vector (CSF) of the labeled object, namely: CBF ═ CPF + CSF.
The alphabet projection vector (CPF) is composed of the statistical number of all the letters in the labeled object on the alphabet Λ ═ { a, b, c, d., z }, and the length of the vector is the length of the alphabet Λ. For example, assuming that the labeled object is a piece of english caption on a web page, the numbers num (a), num (b), and num (z) of each letter a, b, and z in the piece of caption can be counted, so as to obtain the following letter projection vector CPF: [ num (a), num (b),.. and num (z). The change of the CPF can reflect operations such as deletion, insertion and replacement of the content of the marked object to a certain extent.
The alphabetical order vector (CSF) consists of the statistical number of the negative orders on the alphabet Λ representing all the letters in the labeled object, the length of the vector being the length of the alphabet. Assuming that the alphabet Λ has a partial ordering relationship: a < b < c > < z, then the statistical number of all letters in the labeled object x in reverse order on letter a is all letters larger than letter a (i.e., b, c, a.. eta.,. z) and closely preceding letter a, and the statistical number of all letters in the labeled object x in reverse order on letter b is all letters larger than letter b (i.e., c, d, a.. eta.., z) and closely preceding letter b, and so on, so that the statistical number of reverse order on the entire alphabet of all letters in the labeled object x can be obtained. Changes in CSF may reflect, to some extent, the exchange changes of the annotated objects. For example, for bad and dab, their CPFs are the same, but the CSFs are different, reflecting the alphabetical differences between them.
In order to effectively track whether the context of the annotated object changes, the CBF generator 320 generates the CBF of the context node of the annotated object in addition to the CBF of the annotated object. The context node of the annotated object may be determined by the XPath path of the annotated object generated by the XPath generator 220. CBF of annotated object (represented by DOM tree node x) and its context node (respectively node x)leftAnd xrightCBF of (a) constitutes a feature code CF of a labeled object, i.e., CF (x) CBF (x)left)+CBF(x)+CBF(xright)。
The specific structure of the CBF generator 230 and the processing procedure thereof, and how to add a new label to a web page by using the web page labeling apparatus will be described below with reference to fig. 3 and 4.
The markup generator 240 generates webpage markup information according to related information of the tagged object (e.g., feature code of the tagged object) and the content and format of the input markup, and the like, and the XML converter 260 converts the generated webpage markup information into an XML message format suitable for communication with the server side through the network, so as to transmit the webpage markup information to the server side and store the webpage markup information in the markup database 280 via the markup information accessor 270. The webpage labeling information includes a labeled URL (i.e., the URL of the webpage where the label is located), a position labeled on the webpage (i.e., XPath path information of the corresponding labeled object), relevant features of the corresponding labeled object (e.g., feature code CF information, etc.), a content feature code of the webpage where the label is located, and the content and format of the label. Here, the content feature code of the web page is a feature code for identifying the content of the web page, the content feature codes of two web pages are the same, which indicates that the content of the two web pages is the same, and the content feature codes of the web pages can be obtained by using a conventional encoding method, such as hash encoding (MD 5).
The annotation analyzer 250 determines, based on the URL of the current web page, the URL stored in the annotation database 250 that is the same as or similar to the URL of the current web page in the same website as the current web page, determines the URL as a valid URL, queries all annotations related to the valid URL from the annotation database, and matches all the annotations obtained by the query in the current web page to determine which annotations should be annotated with elements currently loaded in the web page (i.e., determine which annotations should exist in the current web page), and determines at which positions in the current web page the annotations should be displayed. The annotation analyzer 230 can support situations in which the content of an annotated object is transferred from one page to another. The specific processing procedure and the structure thereof relating to the annotation analyzer 250 will be described below with reference to fig. 5 to 9.
The XML converter 260 is used to convert the information that needs to be communicated between the client and the server into XML message format, so that the web page annotation device 200 of the client can communicate with the server. However, it should be understood by those skilled in the art that XML-formatted messages are used for facilitating communication between the client and the server implemented by Java server, and the principles of the present invention are not limited to only converting the message format into XML format, but may use other different message formats to communicate between the client and the server according to different implementations of the server part as shown in fig. 2.
As shown in FIG. 2, at the server side, the annotation information accessor 270 accesses the annotation database 280 in response to a request from the client, and the annotation database 280 stores therein the webpage annotation information related to each annotation collected by the information sharing system, which as mentioned above may include the URL of the annotation (i.e., the URL of the webpage where the annotation is located), the location of the annotation on the webpage, the feature code of the corresponding annotated object, the content and format of the annotation, and so on.
The following description is made with reference to fig. 3 and 4. FIG. 3 is an exemplary flow diagram illustrating a process 300 performed when a new annotation is added to a web page using the system shown in FIG. 2 according to an embodiment of the present invention, and FIG. 4 is a schematic diagram illustrating in detail an exemplary structure and process of the CBF generator shown in FIG. 2.
As shown in fig. 3, in step S310, the XPath path of the annotated object in the DOM tree of the current web page is extracted according to the annotated object selected by the user on the current web page, and then in step S320, based on the annotated object and the content of its context node (which can be determined based on the XPath path generated in step S310), CBF of the annotated object is generated as described above, so as to obtain the feature code CF of the annotated object. Next, in step S330, web page markup information is generated based on information about the object to be tagged, the input tagged content, and the like, in step S340, the web page markup information generated in step S330 is converted into a message in XML format suitable for communication with the server side, and then in step S350, the web page markup information generated by the client side is stored in the markup database 280 at the server side via the markup information accessor 270.
The CBF generator 230 as shown in fig. 2 is shown in detail in fig. 4. As shown in fig. 4, the CBF generator 230 may include an HTML (hypertext markup language) cleaning (cleaning) unit 410, an HTML alphabets unit 420, a letter projection vector (CPF) generation unit 430, and an alphabetical order vector (CSF) generation unit 440. The following description will be given taking as an example the CBF generator 230 for generating a CBF of a labeled object.
The HTML cleaning unit 410 is used to remove some HTML tags (e.g. format tags such as < b > </b >, < u > </u >, etc.) which have no effect from the tagged objects selected by the user according to the pre-stored HTML cleaning rules (e.g. as shown in fig. 4, it can be pre-stored in the HTML dictionary 450), so as to reduce HTML noise and reduce the influence of web page format change on the tagged objects.
The HTML alphabetizing unit 420 is used for HTML alphabetizing the marked object cleaned by HTML, so that the marked object is converted into a letter string composed of letters from a to z based on the content of the marked object. For tagged objects that contain a Chinese caption, the HTML tokenization unit 420 needs to first convert the Chinese caption in the tagged object to Chinese pinyin with reference to the Chinese dictionary 460 (which may be omitted when the tagged object does not contain a Chinese caption) and then obtain the letter string. For the case of polyphones, the HTML tokenization unit may take the first Chinese pinyin for the polyphone, but it is understood that the principles of the present invention are not so limited.
The letter projection vector (CPF) generation unit 430 and the letter order vector (CSF) generation unit 440 generate the letter projection vector and the letter order vector of the labeled object, respectively, based on the letter string obtained through HTML alphabets according to the definitions of the letter projection vector (CPF) and the letter order vector (CSF) given above. Then, by concatenating the alpha projection vector (CPF) and the alphabetical order vector (CSF), the content-based features CBF of the annotated object can be obtained.
See back fig. 2. When the user inputs a URL of a certain web page in the client browser to browse the web page and the markup information on the web page, the client browser loads a desired web page and transmits the URL of the web page and the DOM tree structure to the markup parser 240.
FIG. 5 illustrates an exemplary structure of an annotation analyzer 240 according to an embodiment of the invention. As shown in FIG. 5, the annotation analyzer 230 includes a URL analyzer 510, an annotation querier 520, and a web page annotation synthesizer 530.
The URL analyzer 510 analyzes the URL input by the user, extracts all URLs in the same website with the currently loaded web page (i.e., the web page corresponding to the currently input URL, or simply the current web page) from the annotation database 280 (via the XML converter 260 and the annotation information accessor 270), forms an alternative URL set, performs the same page determination and the similar page determination on the web pages corresponding to all URLs in the alternative URL set (hereinafter, referred to as alternative URLs) and the current web page, and determines the alternative URLs corresponding to the web pages the same as or similar to the current web page as valid URLs.
The annotation querier 520 queries (via the XML converter 260 and the annotation information accessor 270) all annotations associated with the valid URL (i.e., all annotations on the web page corresponding to the valid URL) in the annotation database 280 according to the valid URL determined by the URL analyzer 510, that is, queries all annotations possibly associated with the current web page in the annotation database 280, thereby obtaining an annotation candidate set, and obtains all webpage annotation information of the possible annotations from the annotation database 280.
The web page tag compositor 530 matches the current web page with all possible tags to determine which tags are most likely to tag which elements or objects currently loaded in the web page, i.e., determines whether and where each of the possible tags exists in the current web page, and composites the tags with the web page for display to the user via the browser. As shown in fig. 5, the web page annotation synthesizer 530 may further include an annotation location determination unit 532 and a synthesis unit 534.
For each possible annotation in the annotation candidate set, the annotation position determination unit 532 determines, according to the webpage annotation information of the annotation (e.g., information such as an XPath path and a feature code CF of the annotated object corresponding to the annotation), whether the possible annotation identifies a webpage element in the current webpage (i.e., determines whether the possible annotation exists in the current webpage), and further determines the position (i.e., identifies the position) of the webpage element identified by the possible annotation in the current webpage if the possible annotation is determined to exist.
The composition unit 534 composes the annotations with the current web page according to the web page annotation information of the possible annotations determined to be present in the current web page and the determined annotation positions of the annotations in the current web page, and displays the composed web page to the user via the browser.
FIG. 6 is a flow diagram illustrating a process 600 for a user entering a URL of a web page to be loaded in a client browser using the information sharing system described above to display the web page and existing annotations therein, according to an embodiment of the invention.
As shown in fig. 6, in step S610, the URLs input by the user are analyzed to obtain the alternative URL sets, and the web pages corresponding to all the alternative URLs and the web page to be loaded (i.e., the current web page) are subjected to the same or similar page determination, so as to determine a valid URL. A specific processing procedure in step S610 will be described below with reference to fig. 7.
In step S620, according to the determined valid URL, all annotations that may be related to the current web page are queried in the annotation database, so as to obtain an annotation candidate set. Then, in step S630, it is determined which of all possible annotations exist in the current webpage, and the annotation location of the existing annotations in the current webpage is determined. A specific processing procedure related to step S630 will be described below with reference to fig. 8 and 9.
Then, in step S640, the annotations are composited with the current web page based on the web page annotation information of the annotations determined to be present in step S630 and the determined annotation positions of the annotations, and the composited web page is displayed to the user via the browser in step S650. In this case, the annotation can be first converted into html format by dynamically modifying the DOM code of the current web page, and then the html fragment after conversion is inserted into the web page code and displayed in the browser.
Fig. 7 is an exemplary flowchart illustrating a process of obtaining an alternative URL based on a URL input by a user and making the same and similar page determination on a corresponding web page and a web page currently loaded by a browser (i.e., a current web page) in one embodiment according to the present invention (i.e., a specific processing procedure of step S610 illustrated in fig. 6).
As shown in fig. 7, in step S710, as described above, based on the URL input by the user, a set of all alternative URLs in the same website as the input URL, that is, an alternative URL set, is obtained. Then, in step S720, it is determined whether the web page corresponding to a certain candidate URL is the same as the current web page. Here, if the content feature code of the web page corresponding to the alternative URL is the same as the content feature code of the current web page, it may be determined that the two web pages are the same page, otherwise, the two web pages are different. Here, whether the webpage where the label is located and the current webpage are the same page is determined by the content feature code of the webpage, so as described above, the content feature code of the webpage can be obtained by using the existing encoding method, such as MD 5. This is mainly the case for some web pages where the URL is different but the content has not changed.
If it is determined in step S720 that the two web pages are not the same, it is determined in step S730 whether the two web pages are similar pages. Here, the two web pages may be determined to be similar when the following conditions are satisfied between the two web pages, otherwise, the two web pages are not similar:
(1) the titles of the web pages are the same, and
(2) the condition of parameter transmission exists between the two webpages, digital parameters in the URL are lost, and the rest are the same;
the two webpages have parameter transmission conditions, the digital parameters in the URLs are different, and the digital parameters in the webpage corresponding to the alternative URL are smaller than those in the webpage corresponding to the current URL, and the other digital parameters are the same; or
There is no parameter passing between the two web pages, and the last address part of the URL is different, and the others are the same.
It is obvious here that the principle of the present invention is not limited to the above-mentioned similar page determination condition, and those skilled in the art can set other different similar page determination conditions as required.
When the determination result in step S720 or step S730 is affirmative, the processing proceeds to step S740, and the current alternative URL is determined as a valid URL.
If it is determined that the two web pages are neither identical nor close after the determinations in step S720 and step S730, the process proceeds to step S750, where it is determined whether there are any URLs in the alternative URL set that have not been determined as identical or close pages. If so, in step S760, the next candidate URL is extracted from the candidate URL set, and the process returns to step S720, so that the web page corresponding to the extracted next candidate URL is determined to be the same as or close to the current web page. The processing of steps S720 to S760 is repeated until it is determined in step S750 that all the alternative URLs in the alternative URL set have been subjected to the same and close page determination, thereby determining all the valid URLs in the alternative URL set.
Fig. 8 is a flowchart showing in detail the processing procedure of step S630 in fig. 6 (i.e., determining whether all possible annotations exist in the current web page and their annotation positions in the current web page), and fig. 9 is a schematic diagram showing the structure of the feature code CF of a certain annotation (as shown in (a) in fig. 9) and its corresponding DOM tree (as shown in (b) in fig. 9) used in the processing procedure shown in fig. 8.
As shown in fig. 8, in step S810, based on the webpage annotation information of the possible annotation to be currently determined, for example, the feature code CF and the XPath path of the annotated object corresponding to the annotation, and the like, based on the node determined according to the XPath path in the DOM tree of the current webpage, the nodes in the DOM tree of the current webpage are sequentially detected upward and downward, respectively, so as to determine the node (where similarity refers to that the difference between the content of the node and the context is within an allowable range) in the DOM tree that is the same as or closest to the annotated object corresponding to the annotation and the context node thereof as the DOM tree node corresponding to the annotation in the current webpage.
For example, taking the feature code CF of a certain possible annotation to be determined shown in (a) of fig. 9 as an example, wherein A, B and C respectively represent the annotated object corresponding to the annotation, its upper node and its lower node, and nodes in the DOM tree are sequentially detected based on the nodes determined based on the XPath path of a, and it is determined that A, B and C are respectively a ', B ' and C ' as shown in (B) of fig. 9, which may be referred to as DOM tree nodes corresponding to the annotation to be determined.
Then, in step S820, based on the determined DOM tree nodes corresponding to the possible annotations to be determined, the distance D (a, a') of the annotation from the DOM tree is calculated as follows:
D(A,A’)=d(A,A’)+α(d(B,B’)+d(C,C’))+βds
wherein,
d(A,A’)=|CBF(A)-CBF(A’)|,
d(B,B’)=|CBF(B)-CBF(B’)|,
d(B,B’)=|CBF(C)-CBF(C’)|,
dsalpha and beta are constants for the tree structure distance, alpha represents the influence degree of the difference of the context of the marked object on the difference of the marked object, beta represents the influence degree of the difference of the DOM tree structure on the similarity difference of the mark, dsRepresenting the difference between the context node structure in the current DOM tree and the annotated CF structure (i.e., the original context node structure).
Suppose that the lowest common node P of nodes A ', B ', C ' can be found in the DOM tree, and lA’、lB’、lC’Respectively representing the number of nodes passing from the nodes A ', B ' and C ' to the node P, dsIt can be calculated as follows:
ds=lA’+lB’+lC’
in the case as shown in FIG. 9(b), ds=1。
See back fig. 8. In step S830, it is determined whether the distance D of the to-be-determined label calculated in step S820 is smaller than a predetermined threshold. If so, it may be determined in step S840 that the annotation should exist on the current web page, and its location on the current web page. For example, if the calculated D (a, a ') is less than the predetermined threshold, it is determined that the to-be-determined annotation still marks an element or object in the current web page and thus should be displayed on the current web page, and the position of the node a' in the DOM tree determines the position where the annotation should be displayed on the current web page.
If it is determined in step S830 that the distance D of the to-be-determined annotation is not less than the predetermined threshold, then in step S840, the annotation is discarded, i.e., it is determined that the annotation should not be displayed on the current webpage.
As can be seen from the above definitions of the content-based feature CBF and the feature code CF of the tagged object, the CBF is generally unique to the tagged object (especially when the tagged object is web page content represented by english text), and has a uniform length, which is convenient for data transmission and storage; the change of the CBF can truly reflect the change of the content of the marked object; and the distance between the CF's of the labeled objects is a measure of the variation of the objects.
In the information sharing system according to the embodiment of the present invention, the labeled object is identified by using the XPath, and simultaneously, the feature code CF information of the labeled object is also utilized, so that the dynamic tracking of the labeled object in the dynamic webpage can be realized, which is impossible to realize in the conventional webpage information labeling system. This is because, in the conventional webpage information labeling system, the characteristic of the labeled object is generally constructed in the form of a hash function (such as MD5 encoding), and although the characteristic is generally unique and uniform in length, which is convenient for data transmission and storage, the characteristic cannot reflect the degree of change of the labeled content. Such hash encoding causes a slight change in the annotated object to result in a large change in the features, so that the degree of change in the annotated object cannot be measured by the distance between the features.
In the information sharing method and system based on webpage annotation described above with reference to the drawings, according to the embodiments of the present invention, the feature code of the annotated object can be generated based on the content of the annotated object and the context content thereof, so that when all possible annotations are used for matching in the currently loaded webpage, the change of the annotation can be measured, and thus whether the annotation is displayed or not can be determined according to the degree of the change, thereby implementing dynamic tracking. In addition, in the process of label matching, a lightweight DOM tree searching method based on the characteristics of the context content is adopted to measure the content change and the context change of the labeled object.
As can be seen from the above description, in the method and system according to the embodiment of the present invention described above, a dynamic tracking technology is used, so that even if a labeled object in a web page changes to some extent, the corresponding label can be correctly displayed at the changed position on the web page, and for the content that disappears from the web page, the corresponding label will not be displayed. In addition, when a labeled object in a web page is transferred from another web page, the corresponding label can be displayed at the correct position on the web page for the labeled object. In addition, in the case where the current web page may have been annotated by a different URL, all of the annotations will be correctly displayed. In addition, when the format of the marked object is changed, the mark can be displayed correctly at the same time, such as blacking, italics, and the like, and quotation. Changes in format are common in web updates or forum content transfers. Therefore, the webpage annotation can be used as a means to achieve the purpose of sharing information among users.
Further, it is apparent that the respective operational procedures of the above-described method according to the present invention can also be implemented in the form of computer-executable programs stored in various machine-readable storage media.
Moreover, the object of the present invention can also be achieved by: a storage medium storing the above executable program code is directly or indirectly supplied to a system or an apparatus, and a computer or a Central Processing Unit (CPU) in the system or the apparatus reads out and executes the program code.
At this time, as long as the system or the apparatus has a function of executing a program, the embodiment of the present invention is not limited to the program, and the program may be in any form, for example, an object program, a program executed by an interpreter, a script program provided to an operating system, or the like.
Such machine-readable storage media include, but are not limited to: various memories and storage units, semiconductor devices, magnetic disk units such as optical, magnetic, and magneto-optical disks, and other media suitable for storing information, etc.
In addition, the computer can also implement the present invention by connecting to a corresponding website on the internet, and downloading and installing the computer program code according to the present invention into the computer and then executing the program.
It is also to be noted that the steps of executing the above-described series of processes may naturally be executed chronologically in the order described, but need not necessarily be executed chronologically. Some steps may be performed in parallel or independently of each other.
Finally, it should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The term "comprising", without further limitation, means that the element so defined is not excluded from the group consisting of additional identical elements in the process, method, article, or apparatus that comprises the element.
Although the embodiments of the present invention have been described in detail with reference to the accompanying drawings, it should be understood that the above-described embodiments are only for illustrating the present invention and are not to be construed as limiting the present invention. Various changes, substitutions, and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims. Moreover, the scope of the present application is not intended to be limited to the particular embodiments of the process, machine, manufacture, composition of matter, means, methods and steps described in the specification. As one of ordinary skill in the art will readily appreciate from the disclosure of the present invention, processes, machines, manufacture, compositions of matter, means, methods, or steps, presently existing or later to be developed that perform substantially the same function or achieve substantially the same result as the corresponding embodiments described herein may be utilized according to the present invention. Accordingly, the appended claims are intended to include within their scope such structures, means, methods, or steps of a process, apparatus, manufacture, composition of matter.
Claims (25)
1. A method for generating webpage annotation information, comprising the steps of:
in response to the fact that a user selects a target webpage element as a labeled object on a current webpage loaded on a client Web browser, extracting an XPath of the labeled object in a Document Object Model (DOM) tree of the current webpage;
generating feature codes CF of the labeled objects based on the labeled objects and the contents of the context webpage elements which are immediately before and after the labeled objects in the current webpage; and
generating webpage labeling information based on the XPath path of the labeled object, the feature code CF and the label input by the user,
wherein the webpage annotation information is stored in an annotation database of a remote annotation server,
the feature code CF of the tagged object is composed of the content-based feature CBF of the tagged object and the CBF of its context web page element, an
The CBF of a web page element consists of an alphabetical projection vector consisting of the statistical number of all letters in the web page element on the alphabet Λ ═ { a, b, c, d., z } and an alphabetical order vector consisting of the negative statistical number of all letters in the web page element on the alphabet Λ.
2. The method of claim 1, wherein the step of generating the feature code CF of the labeled object further comprises:
generating CBF of the marked object and the context webpage element thereof according to the following modes:
removing meaningless HTML marks from the webpage elements by referring to a prestored HTML cleaning principle;
HTML alphabetizing the webpage elements cleaned by HTML, so that the webpage elements are converted into letter strings consisting of letters from a to z based on the content of the webpage elements;
counting the number of all letters in the letter string on an alphabet of Λ ═ { a, b, c, d, · z } and the number of inverses so as to generate an alphabetical projection vector and an alphabetical order vector of the webpage elements;
concatenating the alpha projection vector and the alphabetical order vector of the web page element to obtain the CBF of the web page element, an
The feature code CF of the labeled object is obtained as follows: CBF of the above web page element of the annotated object + CBF of the below web page element of the annotated object.
3. The method of claim 2, wherein in the case where the tagged object and its contextual web page elements contain chinese written descriptions, the chinese written descriptions are converted to pinyin with reference to a chinese dictionary before HTML-tokenization of the HTML-cleaned web page elements.
4. The method according to any one of claims 1 to 3, wherein the webpage annotation information comprises, in addition to the XPath, feature code CF and annotated content and format of the annotated object, the URL of the webpage in which the annotation is located and the content feature code of the webpage in which the annotation is located.
5. A method according to any one of claims 1 to 3, wherein the remote annotation server is implemented by Java Servelet, and
the method further comprises the steps of: the generated web page annotation information is converted into an XML format suitable for communication with a remote annotation server for transmission to the remote annotation server.
6. An apparatus for generating webpage annotation information, comprising:
the XPath generator is used for responding to the situation that a user selects a target webpage element as a marked object on a current webpage loaded on a client Web browser and extracting an XPath of the marked object in a Document Object Model (DOM) tree of the current webpage;
the feature code CF generator is used for generating the feature code CF of the labeled object based on the labeled object and the contents of the context webpage elements which are immediately before and after the labeled object in the current webpage; and
a label generator for generating webpage label information based on the XPath path of the labeled object, the feature code CF of the labeled object and the label input by the user,
wherein the feature code CF of the labeled object is composed of the content-based feature CBF of the labeled object and the CBF of the context webpage element,
wherein the webpage annotation information is stored in an annotation database of a remote annotation server,
the CBF of a web page element consists of an alphabetical projection vector consisting of the statistical number of all letters in the web page element on the alphabet Λ ═ { a, b, c, d., z } and an alphabetical order vector consisting of the negative statistical number of all letters in the web page element on the alphabet Λ.
7. The apparatus of claim 6, wherein the feature code CF generator comprises a content-based feature CBF generator to generate a content-based feature CBF for a web page element based on content of the web page element; and
the CBF generator further comprises:
the HTML cleaning unit is used for removing meaningless HTML marks from the webpage elements by referring to a prestored HTML cleaning principle;
the HTML alphabetizing unit is used for performing HTML alphabetizing on the webpage elements after being cleaned by the HTML, so that the webpage elements are converted into letter strings formed by letters from a to z based on the content of the webpage elements;
an alphabet projection vector generating unit, configured to count the number of all letters in the alphabet string over an alphabet of Λ ═ { a, b, c, d, · z }, so as to generate an alphabet projection vector of a web page element;
an alphabetical order vector generating unit, configured to count the number of inverses of all letters in the alphabetical string in an alphabet of Λ ═ { a, b, c, d, · z }, so as to generate an alphabetical order vector of the web page elements; and
means for concatenating the alpha projection vector and the alphabetical order vector of the web page element to obtain the CBF of the web page element, an
The CF generator generates the feature code CF of the labeled object according to the following mode: CBF of the above web page element of the annotated object + CBF of the below web page element of the annotated object.
8. The apparatus as claimed in claim 7, wherein, in case that the tagged object and its contextual web page element contain chinese caption, the HTML-alphabetizing unit converts the chinese caption of the HTML-cleaned web page element into chinese pinyin with reference to a chinese character dictionary and then HTML-alphabetizes it.
9. The apparatus according to any one of claims 6 to 8, wherein the webpage annotation information includes, in addition to the XPath, feature code CF and annotated content and format of the annotated object, URL of the annotated webpage and content feature code of the annotated webpage.
10. The apparatus according to any one of claims 6 to 8, wherein the apparatus is implemented by a browser plug-in, the remote annotation server is implemented by a Java Server,
the apparatus further comprises an XML converter for converting the generated web page annotation information into an XML format suitable for communication with a remote annotation server.
11. A method for displaying a Web page and annotations on the Web page on a client Web browser, comprising the steps of:
a) responding to a Uniform Resource Locator (URL) of a webpage to be loaded and displayed on a browser input by a user, and analyzing the input URL to obtain a valid URL;
b) inquiring all labels related to the effective URL from a remote label server according to the effective URL so as to obtain a label candidate set and webpage label information of the labels;
c) for each label in the label candidate set, determining whether the label labels a webpage element in the webpage to be loaded according to the labeled webpage label information of the label, that is, determining whether the label should exist in the webpage to be loaded, and if so, further determining the position of the labeled webpage element in the webpage to be loaded, that is, a label position; and
d) synthesizing the annotations with the web page to be loaded according to the web page annotation information of the annotations determined to be present in the web page to be loaded and the annotation positions thereof, and displaying the synthesized web page to a user via a browser,
wherein, the labeled webpage labeling information comprises an XPath path of a labeled object corresponding to the label, a feature code CF of the labeled object, labeled content and format, a URL of the webpage where the label is located, and a content feature code of the webpage where the label is located,
the feature code CF of the annotated object consists of the content-based feature CBF of the annotated object and the CBFs of the contextual web page elements immediately before and after the annotated object,
the CBF of a web page element consists of an alphabetical projection vector consisting of the statistical number of all letters in the web page element on the alphabet Λ ═ { a, b, c, d., z } and an alphabetical order vector consisting of the negative statistical number of all letters in the web page element on the alphabet Λ.
12. The method of claim 11, wherein the step a) further comprises:
and taking out the URLs of all the webpages to be loaded in the same website from the remote labeling server as alternative URLs based on the input URLs, judging the same or similar pages of the webpages corresponding to the alternative URLs and the webpages to be loaded, and determining the alternative URLs of the corresponding webpages which are the same or similar to the webpages to be loaded as effective URLs.
13. The method according to claim 11 or 12, wherein the step c) further comprises:
for each annotation in the annotation candidate set:
based on the feature code CF and XPath path of the labeled object corresponding to the label, based on the node determined according to the XPath path in the DOM tree of the webpage to be loaded, sequentially detecting the nodes in the DOM tree of the webpage respectively upwards and downwards so as to determine the node in the DOM tree which is the same as or closest to the labeled object corresponding to the label and the context webpage element thereof and is used as the node of the corresponding DOM tree of the label in the DOM tree;
calculating the distance D between the label and the DOM tree based on the characteristic code of the label and the DOM tree node corresponding to the characteristic code;
determining whether the calculated distance D is less than a predetermined threshold; and
and when the distance D of the annotation is smaller than a preset threshold value, determining that the annotation should exist in the webpage to be loaded, and determining the annotation position of the annotation in the webpage to be loaded based on the determined DOM tree node which is the same as or closest to the annotated object corresponding to the annotation.
14. The method of claim 13, wherein the distance D of the annotation to the DOM tree is calculated as follows:
assuming that the annotated object and its contextual web page element corresponding to the annotation are A, B and C, and the tree nodes in the DOM tree that are the same as or closest to them are A ', B ' and C ', respectively:
D=d(A,A’)+α(d(B,B’)+d(C,C’))+βds,
wherein,
d(A,A’)=|CBF(A)-CBF(A’)|,
d(B,B’)=|CBF(B)-CBF(B’)|,
d(B,B’)=|CBF(C)-CBF(C’)|,
wherein d (A, A ') represents the distance between the web page element A and the same or most similar tree node A' in the DOM tree as the web page element A, d (B, B ') represents the distance between the web page element B and the same or most similar tree node B' in the DOM tree as the web page element B, d (C, C ') represents the distance between the web page element C and the same or most similar tree node C' in the DOM tree as the web page element C, and CBF (A), CBF (B), and CBF (C) represent the web page elements A, B, and C, respectivelyCBF, CBF (A '), CBF (B') and CBF (C ') of elements A, B and C respectively represent CBF of tree nodes A', B 'and C', alpha and beta are constants, alpha represents influence degree of difference of context of the marked object on difference of the marked object, beta represents influence degree of difference of DOM tree structure on similarity difference of mark, d represents influence degree of difference of DOM tree structure on similarity difference of marksThe difference of the signature CF representing the structure of the DOM tree and the annotation.
15. A method according to claim 11 or 12, wherein the CBF of a web page element is generated as follows:
removing meaningless HTML marks from the webpage elements by referring to a prestored HTML cleaning principle;
HTML alphabetizing the webpage elements cleaned by HTML, so that the webpage elements are converted into letter strings consisting of letters from a to z based on the content of the webpage elements;
counting the number of all letters in the letter string on an alphabet of Λ ═ { a, b, c, d, · z } and the number of inverses so as to generate an alphabetical projection vector and an alphabetical order vector of the webpage elements;
and splicing the letter projection vector and the letter sequence vector of the webpage element to obtain the CBF of the webpage element.
16. A method according to claim 11 or 12, wherein the remote annotation server is implemented by Java Servelet, and
the method further comprises the steps of: information passed between the client and the remote annotation server is converted to XML format before being sent or received.
17. An apparatus for displaying a Web page and annotations on the Web page via a client Web browser, comprising:
the URL analyzer is used for responding to a uniform resource locator URL of a webpage to be loaded and displayed on a browser and input by a user, and analyzing the input URL to obtain a valid URL;
the label querier is used for querying all labels related to the effective URL from the remote label server according to the effective URL so as to obtain a label candidate set and webpage label information of the labels;
a label position determining unit, configured to determine, for each label in the label candidate set, according to the labeled webpage label information, whether the label labels a webpage element in the webpage to be loaded, that is, whether the label should exist in the webpage to be loaded, and if so, further determine a position of the webpage element labeled by the label in the webpage to be loaded, that is, a label position; and
a synthesizing unit for synthesizing the annotations with the web page to be loaded according to the labeled web page annotation information of the annotations determined to be present in the web page to be loaded and the labeled positions thereof,
wherein the synthesized web page is displayed to the user via the browser,
the labeled webpage labeling information comprises an XPath path of a labeled object corresponding to the label, a feature code CF of the labeled object, labeled content and format, a URL of the webpage where the label is located, and a content feature code of the webpage where the label is located,
the feature code CF of the annotated object consists of the content-based feature CBF of the annotated object and the CBF of the contextual web page elements immediately before and after the annotated object, an
The CBF of a web page element consists of an alphabetical projection vector consisting of the statistical number of all letters in the web page element on the alphabet Λ ═ { a, b, c, d., z } and an alphabetical order vector consisting of the negative statistical number of all letters in the web page element on the alphabet Λ.
18. The apparatus of claim 17, wherein the URL analyzer retrieves URLs of all the webpages to be loaded in the same website from the remote markup server as alternative URLs based on the input URL, performs the same or similar page determination on the webpage corresponding to the alternative URL and the webpage to be loaded, and determines the alternative URL corresponding to the webpage which is the same as or similar to the webpage to be loaded as a valid URL.
19. The apparatus according to claim 17 or 18, wherein the annotation position determination unit performs the following for each annotation in the annotation candidate set:
based on the feature code CF and XPath path of the labeled object corresponding to the label, based on the node determined according to the XPath path in the DOM tree of the webpage to be loaded, sequentially detecting the nodes in the DOM tree of the webpage respectively upwards and downwards so as to determine the node in the DOM tree which is the same as or closest to the labeled object corresponding to the label and the context webpage element thereof and is used as the node of the corresponding DOM tree of the label in the DOM tree;
calculating the distance D between the label and the DOM tree based on the characteristic code of the label and the DOM tree node corresponding to the characteristic code;
determining whether the calculated distance D is less than a predetermined threshold; and
and when the distance D of the annotation is smaller than a preset threshold value, determining that the annotation should exist in the webpage to be loaded, and determining the annotation position of the annotation in the webpage to be loaded based on the determined DOM tree node which is the same as or closest to the annotated object corresponding to the annotation.
20. The apparatus of claim 19, wherein the annotation position determination unit calculates the distance D of the annotation from the DOM tree as follows:
assuming that the annotated object and its contextual web page element corresponding to the annotation are A, B and C, and the tree nodes in the DOM tree that are the same as or closest to them are A ', B ' and C ', respectively:
D=d(A,A’)+α(d(B,B’)+d(C,C’))+βds,
wherein,
d(A,A’)=|CBF(A)-CBF(A’)|,
d(B,B’)=|CBF(B)-CBF(B’)|,
d(B,B’)=|CBF(C)-CBF(C’)|,
wherein d (A, A ') represents the distance between web page element A and the same or most similar tree node A' in the DOM tree as said web page element A, d (B, B ') represents the distance between web page element B and the same or most similar tree node B' in the DOM tree as said web page element B, d (C, C ') represents the distance between web page element C and the same or most similar tree node C' in the DOM tree as said web page element C, CBF (A), CBF (B), and CBF (C) represent CBF of web page elements A, B and C, respectively, CBF (A '), CBF (B'), and CBF (C ') represent CBF of tree nodes A', B ', and C', respectively, and alpha, beta are constants, and alpha represents the degree of influence of the difference of the context of the annotated object on the difference of the annotated object, beta represents the degree of influence of the difference of the DOM tree structure on the difference of the degree of similarity of the annotation, dsThe difference of the signature CF representing the structure of the DOM tree and the annotation.
21. The apparatus of claim 17 or 18, further comprising a content-based feature CBF generator to generate a content-based feature CBF for a web page element,
the CBF generator further comprises:
the HTML cleaning unit is used for removing meaningless HTML marks from the webpage elements by referring to a prestored HTML cleaning principle;
the HTML alphabetizing unit is used for performing HTML alphabetizing on the webpage elements after being cleaned by the HTML, so that the webpage elements are converted into letter strings formed by letters from a to z based on the content of the webpage elements;
an alphabet projection vector generating unit, configured to count the number of all letters in the alphabet string over an alphabet of Λ ═ { a, b, c, d, · z }, so as to generate an alphabet projection vector of a web page element;
an alphabetical order vector generating unit, configured to count the number of inverses of all letters in the alphabetical string in an alphabet of Λ ═ { a, b, c, d, · z }, so as to generate an alphabetical order vector of the web page elements; and
and a unit for splicing the alpha projection vector and the alphabetical order vector of the web page element to obtain the CBF of the web page element.
22. The apparatus according to claim 17 or 18, wherein the apparatus is implemented by a browser plug-in, the remote annotation server is implemented by a Java Server,
the apparatus further comprises an XML converter for converting information communicated between the client and the remote annotation server into XML format prior to transmission or reception.
23. A webpage labeling method comprises the following steps:
displaying a Web page to be loaded and displayed on a client Web browser by performing the method of any one of claims 11 to 16 on the browser in response to a user entering a URL for the Web page, and existing annotations previously annotated on the Web page stored on a remote annotation server;
adding a new annotation to the web page by performing the method of any one of claims 1 to 5, the new annotated web page annotation information being stored on a remote annotation server; and
displaying the added new annotation on the webpage via a browser.
24. A web page annotation appliance, comprising:
the means for generating webpage annotation information of claim 9; and
apparatus for displaying Web pages and annotations thereon via a client Web browser according to any of claims 17 to 22.
25. An information sharing system based on webpage annotation comprises a client and a remote annotation server, wherein,
the client comprises a web page annotation device according to claim 24, and
the remote labeling server comprises a labeling database for storing webpage labeling information and a labeling information accessor for performing access control on the labeling database.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 200910133976 CN101866342B (en) | 2009-04-16 | 2009-04-16 | Method and device for generating or displaying webpage label and information sharing system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 200910133976 CN101866342B (en) | 2009-04-16 | 2009-04-16 | Method and device for generating or displaying webpage label and information sharing system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101866342A CN101866342A (en) | 2010-10-20 |
CN101866342B true CN101866342B (en) | 2013-09-11 |
Family
ID=42958073
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 200910133976 Expired - Fee Related CN101866342B (en) | 2009-04-16 | 2009-04-16 | Method and device for generating or displaying webpage label and information sharing system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101866342B (en) |
Families Citing this family (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102637172B (en) * | 2011-02-10 | 2013-11-27 | 北京百度网讯科技有限公司 | Webpage blocking marking method and system |
US9619580B2 (en) | 2012-09-11 | 2017-04-11 | International Business Machines Corporation | Generation of synthetic context objects |
US9741138B2 (en) | 2012-10-10 | 2017-08-22 | International Business Machines Corporation | Node cluster relationships in a graph database |
US8931109B2 (en) | 2012-11-19 | 2015-01-06 | International Business Machines Corporation | Context-based security screening for accessing data |
CN103942224B (en) * | 2013-01-23 | 2018-12-14 | 百度在线网络技术(北京)有限公司 | A kind of method and device for the mark rule obtaining web page release |
US9053102B2 (en) | 2013-01-31 | 2015-06-09 | International Business Machines Corporation | Generation of synthetic context frameworks for dimensionally constrained hierarchical synthetic context-based objects |
US9069752B2 (en) | 2013-01-31 | 2015-06-30 | International Business Machines Corporation | Measuring and displaying facets in context-based conformed dimensional data gravity wells |
CN104035916B (en) * | 2013-03-07 | 2017-05-24 | 富士通株式会社 | Method and device for standardizing annotation tool |
US10152526B2 (en) | 2013-04-11 | 2018-12-11 | International Business Machines Corporation | Generation of synthetic context objects using bounded context objects |
US9348794B2 (en) | 2013-05-17 | 2016-05-24 | International Business Machines Corporation | Population of context-based data gravity wells |
CN104794174A (en) * | 2015-04-01 | 2015-07-22 | 百度在线网络技术(北京)有限公司 | Webpage marking information display method and device |
CN104811351A (en) * | 2015-04-21 | 2015-07-29 | 中国电子科技集团公司第四十一研究所 | Distributed communication network testing method and system based on XML |
CN105095432B (en) | 2015-07-22 | 2019-04-16 | 腾讯科技(北京)有限公司 | Web page annotation display methods and device |
CN105117498A (en) * | 2015-09-28 | 2015-12-02 | 北京奇虎科技有限公司 | Webpage data processing method and device |
CN106610994A (en) * | 2015-10-23 | 2017-05-03 | 北京国双科技有限公司 | Method and device for counting click paths |
CN105824925B (en) * | 2016-03-17 | 2019-09-10 | 四川长虹电器股份有限公司 | Dynamic label placement method based on browsing device net page element |
CN105930383A (en) * | 2016-04-14 | 2016-09-07 | 青岛海信移动通信技术股份有限公司 | Method and device for implementing electronic bookmarks |
CN106250394B (en) * | 2016-07-15 | 2019-08-02 | 北京邮电大学 | Network resource content sees clearly system and method |
WO2018053620A1 (en) * | 2016-09-23 | 2018-03-29 | Hvr Technologies Inc. | Digital communications platform for webpage overlay |
CN108874373B (en) * | 2017-05-12 | 2023-05-30 | 深圳市雅阅科技有限公司 | Method and device for inserting information into webpage, display terminal and storage medium |
CN107808000B (en) * | 2017-11-13 | 2020-05-22 | 哈尔滨工业大学(威海) | System and method for collecting and extracting data of dark net |
CN110619100B (en) * | 2019-06-18 | 2023-02-28 | 北京无限光场科技有限公司 | Method and apparatus for acquiring data |
CN113688597A (en) * | 2020-05-18 | 2021-11-23 | 北京字节跳动网络技术有限公司 | Display method, device, equipment and storage medium of labeled file |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101124609A (en) * | 2004-07-29 | 2008-02-13 | 雅虎公司 | Search systems and methods using in-line contextual queries |
CN101251855A (en) * | 2008-03-27 | 2008-08-27 | 腾讯科技(深圳)有限公司 | Equipment, system and method for cleaning internet web page |
-
2009
- 2009-04-16 CN CN 200910133976 patent/CN101866342B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101124609A (en) * | 2004-07-29 | 2008-02-13 | 雅虎公司 | Search systems and methods using in-line contextual queries |
CN101251855A (en) * | 2008-03-27 | 2008-08-27 | 腾讯科技(深圳)有限公司 | Equipment, system and method for cleaning internet web page |
Also Published As
Publication number | Publication date |
---|---|
CN101866342A (en) | 2010-10-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101866342B (en) | Method and device for generating or displaying webpage label and information sharing system | |
US11372935B2 (en) | Automatically generating a website specific to an industry | |
US8554800B2 (en) | System, methods and applications for structured document indexing | |
US20090089278A1 (en) | Techniques for keyword extraction from urls using statistical analysis | |
US7353268B2 (en) | Network system, server, web server, web page, data processing method, storage medium, and program transmission apparatus | |
Papadakis et al. | Stavies: A system for information extraction from unknown web data sources through automatic web wrapper generation using clustering techniques | |
US20090125529A1 (en) | Extracting information based on document structure and characteristics of attributes | |
CN101551800B (en) | Marked information generation device, inquiry unit and sharing system | |
US20020035619A1 (en) | Apparatus and method for producing contextually marked-up electronic content | |
US20080072140A1 (en) | Techniques for inducing high quality structural templates for electronic documents | |
CN1408093A (en) | Electronic shopping agent which is capable of operating with vendor sites having disparate formats | |
US20110137943A1 (en) | Apparatus for deciding word-related keywords, and method and program for controlling operation of same | |
TW200836075A (en) | Method of converting hypertext markup language web page into pure text and system thereof | |
US20060047693A1 (en) | Apparatus for and method of generating data extraction definition information | |
US20170109442A1 (en) | Customizing a website string content specific to an industry | |
TW201415254A (en) | Method and system for recommending semantic annotations | |
US8140575B2 (en) | Apparatus, method, and program product for information processing | |
US20100082594A1 (en) | Building a topic based webpage based on algorithmic and community interactions | |
KR20090130364A (en) | Method, apparatus and computer-readable recording medium for tagging image contained in web page and providing web search service using tagged result | |
KR20100037401A (en) | Method and apparatus for managing search database | |
Maurer et al. | Transclusions in an html-based environment | |
CN101196883A (en) | Internet information natural language translation general method and system | |
US20030176996A1 (en) | Content of electronic documents | |
JP5321777B2 (en) | Product search device and product search method having function of presenting reference keyword | |
JP2009259248A (en) | Method and unit for tagging images included in web page and providing web retrieval service by using the result and computer-readable recording medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20130911 Termination date: 20180416 |
|
CF01 | Termination of patent right due to non-payment of annual fee |