DEAD HYPER LINK DETECTION METHOD AND SYSTEM
Field of Invention
The present invention relates generally to the field of document retrieval and interaction on a distributed computer network. More specifically, the present invention relates to a system for post processing embedded hyperlinks.
Background of the Invention
The World Wide Web (WWW) may be broadly described as a virtual collection of documents with a user being able to access and retrieve these documents through existing telephone or data lines. Documents accessible on the WWW have the capability to direct users to other documents on the web using linking information imbedded in the text itself. Typically, the documents are stored in hypertext markup language (HTML) format. Using hypertext linking an author will integrate references directly into the text of a document which point to other related items of information. Uniform resource locators (URLs) provide a way of converting the integrated reference to a real location where the related information will be located on the Internet. It is possible that links that are valid when they are included, in these pages may become defunct or "dead links" over time.
Summary of the Invention
An aspect of the present invention involves a method of testing embedded hyperlinks including receiving a document request from a client; parsing a first document to determine if elements in the first document contain hyperlinks; separating the elements into hyperlinks and all other non-hyperlink elements; testing the hyperlinks in a first document in parallel to determine if the hyperlinks are valid hyperlinks or invalid hyperlinks by comparing the hyperlinks to a predetermined rule set; adding the valid hyperlinks to a list including the other non-hyperlink elements; generating a second document from the list; and providing the second document to the client.
Another aspect of the present invention involves a system including a memory device which stores a first document; and a processor in communication
with the memory device, said processor configured to: receive a document request from a client; parse the first document to determine if elements in the first document contain said hyperlinks; separate the elements into hyperlinks and all other non- hyperlink elements; test hyperlinks in said first document to determine if said hyperlinks are valid hyperlinks or invalid hyperlinks by the comparing the hyperlinks to a predetermined rale set; add the valid hyperlinks to a list including the other non-hyperlink elements; generate a second document using the list; and provide said second document to said client.
Other and further aspects of the present invention will become apparent during the course of the following description and by reference to the attached drawings.
Brief Description of the Drawings
Figure 1 illustrates a block diagram of an internet client/server relationship;
Figure 2 illustrates a block diagram of the server of Figure 1;
Figure 3 illustrates an HTML document in an exploded view;
Figure 4 illustrates a flow chart of the process of link validation of an embodiment of the present invention;
Figure 5 illustrates a first subroutine of the flow chart of Figure 4 in which the hypertext links and other text are separated;
Figure 6 illustrates a second subroutine of the flow chart of Figure 4 in which the hypertext links are tested to determine if they are valid; and
Figure 7 illustrates an alternative embodiment of the present invention which includes a modification of the subroutine of Figure 5 so that invalid hypertext links are processed to strip away the HTML tags .
Detailed Description of the Preferred Embodiments
The ability of a web server application to ascertain the validity of embedded links in web pages at request time is critical for the creditability of a web site. With more and more web sites moving into the e-commerce arena, this question of web site creditability is becoming even more sensitive. The present invention is capable of detecting defunct hyper links as soon as they become accessible.
Embodiments of the present invention disclosed herein relate to the serving of web pages or documents by Internet web servers. The pages or documents discussed in this application may be in Hyper Text Markup Language (HTML), Standard Generalized Markup Language (SGML), Extensible Markup Language (XML) or any other format which uses a tagging architecture. In the following discussion of this application, HTML will be used for example purposes only.
The embodiments disclosed herein include a method and system for checking the validity of HTML hyperlinks embedded in HTML web pages being served to clients by a server. This is true of web servers that serve static (or non- changing HTML web pages) orapplication web servers that serve dynamic HTML web pages. Static web pages are HTML web pages that are written or "constructed" at some point in time and then remain unchanged until a web site administrator manually either removes them, updates them, or replaces them with entirely new pages. Dynamic HTML web pages are web pages served through some type of application server utilizing HTML templates and some type of dynamic page generation mechanism. In both cases it is possible that links that are valid when they are included in these pages may become defunct or "dead links" over time.
With reference to the Figures, several embodiments of the present invention will now be shown and described. Referring to Figure 1, electronic content distribution system 100 includes a server 110 and a user computer/client 140 both of which are connected across network backbone 105. Network backbone 105 may include an internet backbone, an intranet backbone or any other conventional network backbone or a combination thereof.
Server 110 may be a conventional server which includes conventional computer hardware and functionality. Server 110 may be associated with a web site or a content provider, such as a publisher (e.g., a magazine publisher, book publisher, etc.), a news agency, or any distributor or provider of electronic content. Electronic content may correspond to any publications (e.g., a news or magazine article), reports, technical papers and so forth. Electronic content may include a content body including documents with text and/or images with associated metadata as well as traditional index fields generally provided in a header or trailer
section of this electronic content. Server 110 is configured to perform automatic dead link checking of hyperlinks to determine if dead links appear in a content body of the electronic content.
Fig. 2 is a schematic block diagram illustrating the components of server 110 of Fig. 1. Conventional computer components are included, such as a processor 200, user input devices 205, e.g., keyboard, mouse, etc., for receiving user inputs, network interface 210 for interconnection to the network backbone 105, RAM 215, ROM 220, display 225 and storage device 230. Storage device 230 stores the software which implements the present invention.
Turning to Figure 1, a request is sent from user computer 140 onto the network backbone 105 for a particular document or other piece of information. The requested document 320 as shown, in Figure 3 is stored on server 110. The document 320 may include highlighted text 322 which includes hidden embedded links to other related information as prepared by hypertext authoring tools. The present invention will automatically perform a dead link check on any hyperlinks in the document 320 before sending the document to the user computer 140.
Figure 4 illustrates a flow diagram of the elemental steps of a first embodiment of the present invention. In a first step 410, a user accesses an Internet resource, such as an HTML page, which is served by the server 110. In step 412, the server 110 will, before serving the page to the user, parse that page and isolate the HTML hyper links that are embedded in that page. Figure 5 illustrates step 412 in more detail. In step 412a, a comparison is performed between the HTML page and a predefined rule set. Since all HTML hyperlinks employ a defined syntax the server 110 can work from this predetermined rule set for parsing and isolating these links. This predetermined rale set can optionally be augmented through the use of a web server configuration file. This configuration file may employ an HTML hyper link meta language that will allow the server 110 to dynamically learn at initialization time the syntax and nature of the HTML hyperlinks that must be isolated. In step 412b, a decision is made whether the text is a hyperlink. If so, it is added to the list of "N" hyperlinks in 412c (with N representing a number greater than or equal to 0). If the text is not a hyperlink, it is added to the list of all other
HTML elements which are not hyperlinks 412d. In step 412e, the system determines if all of the document has been checked and if not, returns to step 412a to continue checking the document. If the entire document has been reviewed, then the hyperlink parsing is completed in step 412f and the program returns to the flowchart of Figure 3.
Figure 4 shows that in steps 414 and 416 the hyperlink list of the "N" links and the other non-hyperlink HTML elements lists are separated. Once the server 110 has isolated the list of hyperlinks for a given web page it may in step 418 employ a multi-threaded socket initiator to simultaneously create hypertext transfer protocol (HTTP) socket connections to all the hyperlinks in the hyperlink list and allow the hyperlinks to be tested in parallel. These socket connections will begin retrieving the specified web pages looking in particular for web server error messages in HTTP headers of the incoming pages. For example a 404 return code signifies that the web page in question no longer exists at the specified location. Once the HTTP header is read, the socket connection may be terminated. It is then a matter of parsing and interpreting the headers for the various web pages.
Figure 6 discloses step 418 in more detail. In step 418a, hyperlinks 1 to N are tested. If the first through "N" hyperlinks are valid as determined in steps 418a through 418c then these hyperlinks are given a Boolean value of VALID and added to the list of valid hypertext links in 418d. If these hyperlinks are not valid, then the hyperlink is given the Boolean value of NOT VALID and not added to the list of valid hyperlinks and the program returns to the flowchart of Figure 4.
At this point the server 110 has the HTML web page parsed into a dynamic data stracture with the hyperlinks separated from the remaining page elements. The server 110 also has a dynamic data stracture that has a list of the pages internal links and a Boolean value that represents that links web status (i.e., VALID or NOT VALID). The server 110 will recombine the VALID hyperlinks with the other HTML elements in step 420 and omit any hyperlinks having a NOT VALID value. The server 110 will recompose the elements of the page in step 422. In this way the user will never see invalid or defunct links being served by the web site that employs a server 110 such as this.
In an alternative embodiment disclosed in Figure 7, subroutine 418 will be modified so that server 110 will recompose the page with the non- valid link but will strip away the HTML tags that empower that link, thus making the link look like plain text. In this embodiment, the net result is the same. A user will never click on a hyper link that takes them to a defunct page. Subroutine 418 will be modified to include steps 418e through 418g in which if a hyperlink is found to be invalid, the tag will be stripped and the link will be made to look like text and added to VALID hyperlink list.
The foregoing is to be understood as being in every respect illustrative and exemplary, but not restrictive, and the scope of the invention disclosed herein is not to be determined from the Detailed Description, but rather from the claims as interpreted according to the full breadth permitted by the law. It is to be understood that the embodiments shown and described herein are only illustrative of the principles of the present invention and that various modifications may be implemented by those skilled in the art without departing from the scope and spirit of the invention.