Nothing Special   »   [go: up one dir, main page]

Lit Survey

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 11

2.

LITERATURE SURVEY

Internet contains a huge number of information sources of different kinds. Even if a user has the possibility of browsing the Internet, the search of relevant information is still a difficult task. So there came web data extraction which extracts irrelevant data and shows only the relevant results for the user. This can be done with the help of ontology which is an explicit specification of some topic. It is a formal and declarative representation, which includes the vocabulary (or names) for referring to the terms in a specific subject area and the logical statements that describe what the terms are, how they are related to each other. SOX (Schema for Oriented- Object XML), a language of definition of schemas for XML documents for ontology definition and data modeling is used. SOX is developed by Commerce One to use XML in electronic commerce. The SOX method is used only for pages that do not change its structure and it includes manual intervention. So the automatically generated wrappers to extract data from HTML sites were introduced. Data extraction from HTML is usually performed by software modules called wrappers. Early approaches to wrapping Web sites were based on manual techniques. A key problem with manually coded wrappers is that writing them is usually a difficult and labor intensive task. This process does not rely on any a priori knowledge about the target pages and their contents. The site generation process can be seen as an encoding of the original database content into strings of HTML code; as a consequence, data extraction can be seen as a decoding process. Algorithm match [2] is used here. It is based on a

matching technique called ACME, for Align, Collapse under Mismatch, and Extract. It focuses on data extraction problem alone and has no human interaction and it cannot automatically label the extracted data. Later, the problem of detecting templates in the web is discussed. A templatized page is one among a number of pages sharing a common administrative authority and a look and feel. The shared look and feel is very valuable from a user's point of view since it provides context for browsing. However, templatized pages skew ranking, IR and DM algorithms and consequently, reduce precision. A frequent and systematic violation of the Hypertext IR Principles is due to the proliferation in the use of templates. A template is a pre-prepared master HTML shell page that is used as a basis for composing new web pages. The content of the new pages is plugged into the template shell, resulting in a collection of pages that share a common look and feel. Since all pages that conform to a common template share many links, it is clear that these links cannot be relevant to the specific content on these pages. Thus templates violate both the Relevant Linkage Principle and the Topical Unity Principle. They may also cause violations of the Lexical Affinity Principle, if they are interleaved with the actual content of the pages. Therefore, improving hypertext data quality by recognizing and dealing with templates seems essential to the success of the hypertext IR tools. Two algorithms are presented for detecting templates in a given collection of pages. Both algorithms are scalable and designed to process large amounts of pages efficiently. The first algorithm, called the local template detection algorithm, is more accurate for small sets of pages, while the second algorithm, called the global template detection algorithm, better suits large sets of pages.

Only tags were used to find the template in the previous method. But using EXALG [4] any word can be a part of the template. The structured data from the web pages are extracted without any learning examples or other similar human input. Extraction algorithm uses sets of words that have similar occurrence pattern in the input pages, to construct the template. The constructed template is then used to extract values from the pages. The World Wide Web is a vast and rapidly growing source of information. Most of this information is in unstructured HTML pages that are targeted at a human audience. The unstructured nature of these pages makes it hard to do sophisticated querying over the information present in them. Extracting structured data from the web pages is clearly very useful, since it enables us to pose complex queries over the data. Extracting structured data is also useful in information integration systems which integrate the data present in different web-sites. EXALG has various modules whose function is to extract the data efficiently. The input pages are given to the Equivalence class generation module which consists of DiffForm (Differentiate Roles Using Format), FindEq, HandInv, DiffEq. The output is forwarded to analysis module that combines constTemp and ExVal from which the template schema values are extracted. The observations made using the above modules are, tokens associated with the same type constructor in the template that have unique roles, occur in the same equivalence class. And for real pages an equivalence class of large size and support is usually valid. The later one is heuristic and

there is no guarantee that all the largely and frequently occurring databases will satisfy this observation. A lot of information in these pages is lost when naive key word indexing is used. EXALG dont accommodate the existence of data sections. Thus, the automatic wrapper generation was introduced for search engines which can be used to extract search result from dynamically generated result pages. This technique uses both the visual content features on the result page as displayed on a browser and the HTML tag structures of the HTML source file of the result page. Search engines are very important tools for people to reach the vast information on the World Wide Web. Recent studies indicate that Web searching, behind email, is the second most popular activities on the Internet. Surveys indicate that there are hundreds of thousands of search engines on the Web. Not only Web users interact with search engines, many Web applications also need to interact with search engines. Thus the focus is on the issue of how to extract search result records from dynamically generated result pages returned by search engines in response to submitted queries. A tool called ViNTs [5] (Visual information and Tag structure based wrapper generator) for automatically producing the wrappers for any given search engines is devised to extract search result records. The input to the system is the URL of a search engines interface page, which contains an HTML form used to accept user queries. The output of the system is a wrapper for the search engine. As mentioned earlier, existing techniques on web information extraction are based on the analysis of HTML tag structures. Many visual content features that are designed to help people

locate and understand information on a web page can help information extraction built an operational wrapper generation prototype system (ViNTs) based on our method. Result page rendering and tag tree construction are performed by a commercial tool ICE browser. ViNTs can build a wrapper for a search engine with 5 sample result pages and 1 no-result page in 3 to 7 seconds on a Pentium 4 1.7GH PC. Once a wrapper is built for a search engine, SRRs from a new result page of the search engine can be extracted in a small fraction of a second (about 100 milliseconds). Thus, the wrappers generated by ViNTs are practically useful in real-time web applications. In fact, ViNTs has been used in the development of a commercial news metasearch engine (www.allinonenews.com). Many web applications, such as metasearch engines, deep web crawlers and shopping agents, need to interact with search engines. Thus there is a demand to develop automated tools (wrappers) to extract search result records (SRRs) from the HTML result pages returned by search engines. Some search engines, like Google and Amazon, have web services interfaces, which make automated extraction easier. But a vast majority of search engines do not have web services interfaces and there is no incentive for them to develop such interfaces because they support B2C (business to customer) applications only. XML has been used to deliver web data in many applications. However, almost all search engines still present their search results in HTML format. Therefore, applications that need to harvest data from the search results of search engines must deal with the problem of extracting results presented in HTML files. Multiple Section Extraction

algorithm [6] works to solve this problem. Complete data extraction from web pages (including result pages returned from search engines) may consist of three tasks. The first is section extraction, i.e., extract all the sections from each page; the second is record extraction, i.e., extract the records within each section; and the third is data annotation, i.e., identify and annotate each data unit within each record. Existing work on data extraction (wrapper generation) has been mostly focused on record extraction. By being able to extract search result records from all dynamic sections and maintaining the section-record relationships, MSE allows an application to select the desired sections for data extraction. But the problem of extracting incorrect sections is more serious than that of missing correct sections. The incorrect sections are common in HTML pages so XML, which contains structured data, is employed by Et.al. [7]. XML is emerging as a new standard for data representation and exchange on the web. An XML document can be accompanied by a Document Type Descriptor (DTD) which plays the role of a schema for an XML data collection. DTDs contain valuable information on the structure of documents and thus have a crucial role in the efficient storage of XML data, as well as the effective formulation and optimization of XML queries. XTRACT algorithms are used to find the common patterns in XML documents. XTRACT inference algorithms employ a sequence of sophisticated steps that involve finding patterns in the input sequences and replacing them with regular expressions to generate general candidate DTDs, factoring

candidate DTDs using adaptations of algorithms from the logic optimization literature and applying the MDL principle to find the best DTD among candidates. The genesis of the XML was based on the thesis that structured documents can be freely exchanged and manipulated, if published in a standard, open format. XML today promises to enable a suite of next generation Web applications ranging from intelligent web searching to electronic commerce. A number of the DTDs which were correctly identified by XTRACT were fairly complex and contained factors, metacharacters and nested regular expression terms. Devising generic methods for extracting Web data is a complex (if not impossible) task, since the Web is very heterogeneous and there are no rigid guidelines on how to build HTML pages and how to declare the implicit structure of the Web pages. Thus, in order to develop effective methods for extracting Web data in a precise and completely automatic manner, it is usually required to take into account specific characteristics of the domain of interest. One of such domains is that of on-line newspapers and news portals on the Web, which have become one of the most important sources of up-todate information. The algorithm presented for this is RTDM (restricted tree edit distance) [8]. The abundance of templates on the Web is considered harmful to many web mining and searching methods. Such methods usually base their judgment of relevance on the frequency and distribution of terms (words) and hyperlinks on web pages. Since templates contain a considerable

number of common terms and hyperlinks which are replicated in large number of pages, a relevance assessment that does not take templates into account may turn out to be inaccurate, leading to incorrect results. Taking this problem into account, the RTDM-TD [9] algorithm is devised, which works in two steps. Initially, templates are detected using a set of sample pages. The patterns identified in the detection step are then used to remove the templates present in the other pages in the collection. This separation leads to an efficient process. Although the detection task can be costly, it is applied only to a small number of pages. On the other hand, the template removal, which has to be applied to a large number of pages, can be done through an inexpensive procedure. Many websites have large collections of pages generated dynamically from an underlying structured source like a database. The data of a category are typically encoded into similar pages by a common script or template. In recent years, some value-added services, such as comparison shopping and vertical search in a specific domain, have motivated the research of extraction technologies with high accuracy. Almost all previous works assume that input pages of a wrapper induction system conform to a common template and they can be easily identified in terms of a common schema of URL. However, it is hard to distinguish different templates using dynamic URLs today. Moreover, since extraction accuracy heavily depends on how consistent input pages are, it is risky to determine whether pages share a common template solely based on URLs. Instead, a new approach that utilizes similarity between pages to detect templates is proposed, which separates pages with notable inner differences and then generates wrappers, respectively.

Unfortunately, most of this information is presented in a form accessible only to a human user, e.g., list or tables that visually lay out relational data. Although newer technologies, such as XML and the Semantic Web, address this problem directly, only a small fraction of the information on the Web is semantically labeled. The overwhelming majority of the available data has to be accessed in other ways. Extraction of records or tuples of data from lists or tables in HTML documents are of particular interest, as the majority of Web sites that belong to the hidden Web are presented in this manner. Record extraction is required for a multitude of applications, including web data mining and question answering. The main challenge to automatic extraction of data from tables is the great variability in HTML table styles and layout. The CSP approach [11] is proposed to overcome this challenge. CSP is very reliable on clean data, but it is sensitive to errors and inconsistencies in the data source. One such source of data inconsistency was observed on the Michigan corrections site, where an attribute had one value on the list pages and another value on the detail pages. This by itself is not a problem. However, the list page string appeared on one detail page in an unrelated context. The CSP algorithm could not find an assignment of the variables that satisfied all the constraints. Information extraction from Web documents is a critical issue for Software agents on the Internet. A novel technique that extracts information blocks without training examples using a data structure called a PAT tree is introduced. PAT trees allow the system to efficiently recognize repeated

patterns in a semi-structured Web page. From these repeated patterns, information blocks can be easily located based on pattern length and repeated counts.Afterwards, those blocks through string alignment algorithm will get ordered result data. A Patrica tree is a particular implementation of a binary digital tree (or trie in short) such that the abstract data type sistring is represented as a suffix string that ends with a special character not occurring anywhere in the input string. Like a suffix tree, the Patricia tree stores all its data at the external nodes and keeps one integer, the bit-index, in each internal nodes as an indication of which bit of a query is to be used for branching. This avoids empty sub trees and guarantees that every internal node will have non-null descendants. A large amount of information on the Web is contained in regularly structured objects, which we call data records. Such data records are important because they often present the essential information of their host pages, e.g., lists of products or services. It is useful to mine such data records in order to extract information from them to provide value-added services. Existing automatic techniques are not satisfactory because of their poor accuracies. Therefore a more effective technique to perform the task is devised. The technique is based on two observations about data records on the Web and a string matching algorithm. The proposed technique MDR (Mining Data Records in Web pages) [14] is able to mine both contiguous and noncontiguous data records. It currently finds all data records formed by table and form related tags, i.e.,

table, form, tr, td, etc. Identifying data records from each generalized node is relatively easy because they are nodes (together with their sub-trees) at the same level as the generalized node, or nodes at a lower level of the tag tree. MDR only identifies data records but does not align or extract data items from the data records. So the two-step approach called DEPTA (Data Extraction based Partial Tree Alignment) [15], which is very different from all existing methods, is introduced. As long as a page contains at least two data records, the system will automatically find them. A group of data records that contains descriptions of a set of similar objects are typically presented in a contiguous region of a page and are formatted using similar HTML tags. The problem with this approach is that the computation is prohibitive because a data record can start from any tag and end at any tag. The key task is how to match corresponding data items or fields from all data records. Web pages contain a combination of unique content and template material, which is present across multiple pages and used primarily for formatting, navigation, and branding. Two algorithms, one based on the DOM structure of the web page, and the other based on syntactic sequences of characters are proposed. These methods need multiple pages from the same website for template detection and are error prone when the number of pages analyzed from a site is statistically insignificant.

You might also like