CN110020312A - The method and apparatus for extracting Web page text - Google Patents
The method and apparatus for extracting Web page text Download PDFInfo
- Publication number
- CN110020312A CN110020312A CN201711306108.3A CN201711306108A CN110020312A CN 110020312 A CN110020312 A CN 110020312A CN 201711306108 A CN201711306108 A CN 201711306108A CN 110020312 A CN110020312 A CN 110020312A
- Authority
- CN
- China
- Prior art keywords
- text
- region
- webpage
- value
- constituent parts
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/957—Browsing optimisation, e.g. caching or content distillation
- G06F16/9577—Optimising the visualization of content, e.g. distillation of HTML documents
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of method and apparatus for extracting Web page text, are related to field of computer technology.One specific embodiment of this method includes: to construct Access Model according to webpage to be extracted;Calculate the constituent parts region of main part and the similar value of characteristic;According to the first index value of similar value and constituent parts region, unit text region is screened from Access Model;Determine the beginning and end of the text of webpage to be extracted, according to unit text region to obtain the complete text of webpage to be extracted.The embodiment accurately can completely extract Web page text, reduce cost of labor, improve the efficiency for extracting Web page text.
Description
Technical field
The present invention relates to field of computer technology more particularly to a kind of method and apparatus for extracting Web page text.
Background technique
With the rapid development of society, internet is increasingly becoming the main platform of information publication and acquisition, thereon data one
Directly increase by geometric progression.Internet data has covered the every field of the real worlds such as economy, politics, culture, constitutes very much
The important information source of application.But the content of webpage, other than the text that people need, there are also copyright information, advertisement, navigation
The content unrelated with text such as column, decoration information, referred to as noise information.How shielding noise information, the text in webpage is mentioned
It takes out, has become a hot spot of current research.
The method for extracting Web page text at present has following three categories: one, the method that the Web page text based on template extracts;
Two, the method based on block text Density extraction text;Three, the method that view-based access control model Web-page segmentation extracts text.Wherein, it is based on mould
In the method that the Web page text of plate extracts, one Template Information of manual maintenance is needed, is then extracted in text according to Template Information
Hold;In method based on block text Density extraction text, row block distribution function is obtained according to text ratio in the row of every row first,
Then it is calculated over the high row block of the text ratio of threshold value, so that it is determined that body matter;View-based access control model Web-page segmentation extracts text
Method in, be first multiple page blocks by Web-page segmentation according to visual information, then will using the separator bar in html tag
Page merged block, to obtain Web page text.
In realizing process of the present invention, at least there are the following problems in the prior art for inventor's discovery: one, being mentioned based on template
Web page text is taken, needs manually to participate in, heavy workload, and needs to reconfigure template when structure of web page variation;Two, it is based on block
Text density extracts text, is difficult to determine that the beginning and end of text, percentage of head rice be not high;Three, view-based access control model Web-page segmentation extracts
The method of text needs the engines such as javascript, and complexity is high, very time-consuming;Four, prior art none of these methods is applicable in
It is extracted in all types of Web page texts.
Summary of the invention
In view of this, the embodiment of the present invention provides a kind of method and apparatus for extracting Web page text, it can be accurately complete
Web page text is extracted, cost of labor is reduced, improves the efficiency for extracting Web page text.
To achieve the above object, according to an aspect of an embodiment of the present invention, a kind of side for extracting Web page text is provided
Method.
A kind of method of extraction Web page text of the embodiment of the present invention includes: to construct Access Model according to webpage to be extracted,
The Access Model includes: characteristic and main part;Calculate the main part constituent parts region and the features
The similar value divided;According to the first index value of the similar value and constituent parts region, unit is being screened just from the Access Model
Literary region;The beginning and end of the text of the webpage to be extracted is determined according to unit text region, with obtain it is described to
Extract the complete text of webpage.
Optionally, before constructing Access Model according to webpage to be extracted, the method also includes: by the net to be extracted
The source code of page is standardized.
Optionally, the similar value in the constituent parts region and the characteristic that calculate the main part includes: to calculate institute
State the second index value of the second index value of characteristic and the constituent parts region of the main part;Utilize the features
Second index value of the second index value and the constituent parts region divided, calculates the characteristic and the constituent parts region
Similar value.
Optionally, it according to the first index value of the similar value and constituent parts region, is screened from the Access Model single
Position text region includes: to be selected from the Access Model doubtful text filed according to first index value;Utilize the phase
Like value from the doubtful text filed middle screening unit text region.
Optionally, using the similar value from the doubtful text filed middle screening unit text region include: ratio
The size of the similar value in the doubtful text filed middle constituent parts region chooses the maximum unit area of similar value as unit
Text region.
Optionally, the beginning and end for determining the text of the webpage to be extracted according to unit text region includes:
Iterating over for unit area up and down is carried out centered on unit text region, judges that each unit area is
It is no to meet default text condition, if not meeting default text condition, stop iteration, so that it is determined that the webpage to be extracted is just
The beginning and end of text.
Optionally, judging whether each unit area meets default text condition includes: to judge each unit area
Similar value whether be greater than default similarity threshold, if more than, it is determined that the unit area meets default text condition;With/
Or, judging whether the Link Ratio of each unit area is less than default Link Ratio threshold value, if being less than, it is determined that the unit area
Meet default text condition;And/or judge the symbol of each unit area than whether being greater than predetermined symbol than threshold value, if greatly
In, it is determined that the unit area meets default text condition.
Optionally, the beginning and end that the text of the webpage to be extracted is determined according to unit text region it
Afterwards, the method also includes: obtain the text additional information of the webpage to be extracted, wherein the text additional information includes
It is following at least one: text title, author, date and source.
Optionally, the Access Model is document object model.
Optionally, the constituent parts region is with behavior unit.
Optionally, first index value is used to indicate the attribute information in constituent parts region, comprising: the list in constituent parts region
Bit density.
Optionally, second index value is used to indicate the attribute information in certain region in webpage, comprising: feature vector value.
To achieve the above object, according to another aspect of an embodiment of the present invention, a kind of dress for extracting Web page text is provided
It sets.
A kind of device of extraction Web page text of the embodiment of the present invention, comprising: building module, for according to webpage to be extracted
Access Model is constructed, the Access Model includes: characteristic and main part;Computing module, for calculating the main part
The similar value in the constituent parts region and the characteristic divided;Screening module, for according to the similar value and constituent parts region
The first index value, from the Access Model screen unit text region;Determining module, for according to the unit text area
Domain determines the beginning and end of the text of the webpage to be extracted, to obtain the complete text of the webpage to be extracted.
Optionally, the building module is also used to: before constructing Access Model according to webpage to be extracted, by described wait mention
The source code of webpage is taken to be standardized.
Optionally, the computing module is also used to: calculate the characteristic the second index value and the main part
Second index value in the constituent parts region divided;Using the characteristic the second index value and the constituent parts region
Two index values calculate the similar value of the characteristic Yu the constituent parts region.
Optionally, the screening module is also used to: according to first index value, being selected from the Access Model doubtful
It is text filed;Using the similar value from the doubtful text filed middle screening unit text region.
Optionally, the screening module is also used to: the similar value in the doubtful text filed middle constituent parts region
Size chooses the maximum unit area of similar value as unit text region.
Optionally, the determining module is also used to: carrying out unit up and down centered on unit text region
Region iterates over, and judges whether each unit area meets default text condition, if not meeting default text condition,
Stop iteration, so that it is determined that the beginning and end of the text of the webpage to be extracted.
Optionally, the determining module is also used to: judge the similar value of each unit area whether be greater than preset it is similar
Property threshold value, if more than, it is determined that the unit area meets default text condition;And/or judge the chain of each unit area
It connects than whether being less than default Link Ratio threshold value, if being less than, it is determined that the unit area meets default text condition;And/or sentence
Break each unit area symbol than whether being greater than predetermined symbol than threshold value, if more than, it is determined that the unit area meets
Default text condition.
Optionally, the determining module is also used to: obtain the text additional information of the webpage to be extracted, wherein it is described just
Literary additional information includes following at least one: text title, author, date and source.
Optionally, the Access Model is document object model.
Optionally, the constituent parts region is with behavior unit.
Optionally, first index value is used to indicate the attribute information in constituent parts region, comprising: the list in constituent parts region
Bit density.
Optionally, second index value is used to indicate the attribute information in certain region in webpage, comprising: feature vector value.
To achieve the above object, according to an embodiment of the present invention in another aspect, providing a kind of electronic equipment.
The a kind of electronic equipment of the embodiment of the present invention, comprising: one or more processors;Storage device, for storing one
A or multiple programs, when one or more of programs are executed by one or more of processors, so that one or more
The method that a processor realizes the extraction Web page text of the embodiment of the present invention.
To achieve the above object, another aspect according to an embodiment of the present invention, provides a kind of computer-readable medium.
A kind of computer-readable medium of the embodiment of the present invention, is stored thereon with computer program, and program is held by processor
The method of the extraction Web page text of the embodiment of the present invention is realized when row.
One embodiment in foregoing invention have the following advantages that or the utility model has the advantages that can determine Web page text beginning and
Ending, so as to the intelligentized complete text for extracting webpage, reduces cost of labor, improves and extracts Web page text
Efficiency;The source code of webpage to be extracted is standardized in the embodiment of the present invention, to be conducive to according to standardized source
Code building Access Model, reduces the time for extracting Web page text, and the method for the embodiment of the present invention can be adapted for respectively
The text of the webpage of seed type extracts;Pass through the second index value and main part of calculating characteristic in the embodiment of the present invention
Constituent parts region the second index value, so as to easily the second index value be utilized to calculate characteristic and constituent parts region
Similar value;The first index value in the embodiment of the present invention by constituent parts region select it is doubtful text filed, so as to contract
The selection range of small text improves the extraction efficiency of Web page text;In the embodiment of the present invention by comparing it is doubtful it is text filed in
The similar value in constituent parts region, so as to improve text using the maximum unit area of similar value as unit text region
The accuracy rate of extraction;The iteration time of unit area up and down is carried out in the embodiment of the present invention centered on unit text region
It goes through, may thereby determine that the beginning and end of text, it is ensured that extract the complete text of webpage;In the embodiment of the present invention, from phase
Judge whether each unit area meets default text condition like multiple angles such as value, Link Ratio and/or symbol ratios, so as to
To further increase the accuracy rate of text extraction;The text additional information of webpage to be extracted is obtained in the embodiment of the present invention, is improved
The integrality of text;The first index value may include the unit intensity in constituent parts region in the embodiment of the present invention, so as to
This attribute information of tenant activity density is selected doubtful text filed;The second index value may include feature in the embodiment of the present invention
Vector value, so as to calculate similar value by feature vector value.
Further effect possessed by above-mentioned non-usual optional way adds hereinafter in conjunction with specific embodiment
With explanation.
Detailed description of the invention
Attached drawing for a better understanding of the present invention, does not constitute an undue limitation on the present invention.Wherein:
Fig. 1 is the schematic diagram of the key step of the method according to an embodiment of the present invention for extracting Web page text;
Fig. 2 is the schematic diagram of the main flow of the method according to an embodiment of the present invention for extracting Web page text;
Fig. 3 is the schematic diagram of standardized source code dom tree corresponding with its;
Fig. 4 is the calculating each line of text and the phase of characteristic information of the method according to an embodiment of the present invention for extracting Web page text
Like the schematic diagram of the key step of value;
Fig. 5 is showing for the key step for filtering out line of text of the method according to an embodiment of the present invention for extracting Web page text
It is intended to;
Fig. 6 is the schematic diagram of the line density function of the acquisition of the method according to an embodiment of the present invention for extracting Web page text;
Fig. 7 is the main of the beginning and end of the determination text of the method according to an embodiment of the present invention for extracting Web page text
The schematic diagram of step;
Fig. 8 is the schematic diagram of the main modular of the device according to an embodiment of the present invention for extracting Web page text;
Fig. 9 is that the embodiment of the present invention can be applied to exemplary system architecture figure therein;
Figure 10 is adapted for showing for the structure of the computer system of the terminal device or server of realizing the embodiment of the present invention
It is intended to.
Specific embodiment
Below in conjunction with attached drawing, an exemplary embodiment of the present invention will be described, including the various of the embodiment of the present invention
Details should think them only exemplary to help understanding.Therefore, those of ordinary skill in the art should recognize
It arrives, it can be with various changes and modifications are made to the embodiments described herein, without departing from scope and spirit of the present invention.Together
Sample, for clarity and conciseness, descriptions of well-known functions and structures are omitted from the following description.
The current method for extracting Web page text do not reach it is intended that degree, the present invention is each from current Web page text
Kind feature is set out, and in conjunction with the advantage and disadvantage of the prior art, a kind of method for devising intelligent extraction Web page text can be accurately complete
Extract Web page text, reduce cost of labor, improve extract Web page text efficiency.Wherein, the characteristics of Web page text
May include: text sentence it is long, the sentence number of text is more, title with text has certain correlation, text in webpage
The punctuation mark of the ratio of middle position, hyperlink in the body of the email less, in text is more than other modules.
Fig. 1 is the schematic diagram of the key step of the method according to an embodiment of the present invention for extracting Web page text, such as Fig. 1 institute
Show, the method for the extraction Web page text of the embodiment of the present invention mainly comprises the steps that
Step S101: Access Model is constructed according to webpage to be extracted.Wherein, Access Model may include: spy in the present invention
Sign part and main part.Characteristic can store the characteristic information of webpage, for example, the information such as title, keyword and abstract.
Main part can store the text message of webpage.
Step S102: constituent parts region and the similar value of characteristic of main part are calculated.It, can in the embodiment of the present invention
The main part of Access Model is divided into multiple unit areas, it is similar to characteristic then to calculate each unit area
Value.
Step S103: according to the first index value of similar value and constituent parts region, unit text is screened from Access Model
Region.In the embodiment of the present invention, the similar value of each unit area and characteristic is obtained by step S102, and is combined each
First index value of unit area judges whether each unit area is unit text region.
Step S104: the beginning and end of the text of webpage to be extracted is determined, according to unit text region to obtain wait mention
Take the complete text of webpage.
In the embodiment of the present invention, before constructing Access Model according to webpage to be extracted, the method for Web page text is extracted also
It may include: to be standardized the source code of webpage to be extracted.In the embodiment of the present invention, standardization includes: removal
Scripting language, spcial character conversion.In order to meet the visual experience of user, a large amount of JS can be embedded in (i.e. in webpage source code
JavaScript is a kind of scripting language for belonging to network, is used to add miscellaneous dynamic function for webpage, mention for user
For the result of browse of more smooth beauty) and CSS (i.e. Cascading Style Sheets, a kind of machine word of file pattern
Speech, not only can statically modified web page, various scripting languages can also be cooperated dynamically to be formatted to webpage each element)
Equal scripting languages, the effect of these scripting languages is modified web page, unrelated with Web page text content, and these scripting languages are
It extracts text and brings very big interference, therefore the scripting language unrelated with text can be removed.In addition, for subsequent processing, it can
To convert conventionally form for the spcial character in source code, for example, convert < to<, by > be converted into>etc..
In the embodiment of the present invention, the similar value in the constituent parts region and characteristic that calculate main part may include: meter
Calculate the second index value of the second index value of characteristic and the constituent parts region of main part;Utilize the second of characteristic
Index value and second index value in constituent parts region calculate the similar value of characteristic and constituent parts region.The present invention is implemented
In example, characteristic can store the characteristic information of webpage, for example, the information such as title, keyword and abstract, therefore can basis
These characteristic informations generate the second index value of characteristic as the second index model value.Then, the second index model is utilized
Value and second index value in constituent parts region, calculate the similar value of characteristic and constituent parts region.In the embodiment of the present invention,
By cosine law formula, the cosine value conduct of second index value of second index model value and each unit area can be calculated
The similar value of characteristic and constituent parts region, wherein cosine value 1 illustrates that similarity is higher more leveling off to.Certainly, the present invention is real
The similar value that can also obtain characteristic and constituent parts region in example by other algorithms is applied, this is not construed as limiting.
In the embodiment of the present invention, according to the first index value of similar value and constituent parts region, screened from Access Model single
Position text region may include: to be selected from Access Model doubtful text filed according to the first index value;Using similar value from doubt
Like text filed middle screening unit text region.Wherein, it is doubtful it is text filed can for one or more, unit text region can
Think one or more unit areas.
In the embodiment of the present invention, using similar value from doubtful text filed middle screening unit text region may include: ratio
The size of the similar value in more doubtful text filed middle constituent parts region chooses the maximum unit area of similar value as unit text
Region.If it is doubtful it is text filed in, the maximum unit area of similar value has multiple, then the maximum unit area of multiple similar values can
To be unit text region, it also can choose any one unit area in the maximum unit area of multiple similar values and be used as list
Position text region, naturally it is also possible to be selected by other methods, this is not limited by the present invention.
In the embodiment of the present invention, determine that the beginning and end of the text of webpage to be extracted can wrap according to unit text region
It includes: carrying out iterating over for unit area up and down centered on unit text region, judge that each unit area is
It is no to meet default text condition, if not meeting default text condition, stop iteration, so that it is determined that the text of webpage to be extracted
Beginning and end.After filtering out unit text region in step s 103, then carried out centered on unit text region to
Upper unit area iterates over.First determine whether unit text region a upward unit area whether symbol preset condition,
If meeting preset condition, illustrate to belong to Web page text, and continue up iteration, if not meeting, is illustrating to be not belonging to webpage just
Text has determined the beginning of webpage.Similarly, same method can be used, is carried out centered on unit text region to placing an order
Iterating over for position region, determines the ending of Web page text.
In the embodiment of the present invention, judging whether each unit area meets default text condition may include: that judgement is every
Whether the similar value of one unit area is greater than default similarity threshold, if more than, it is determined that unit area meets default text
Condition;And/or judge whether the Link Ratio of each unit area is less than default Link Ratio threshold value, if being less than, it is determined that unit
Region meets default text condition;And/or judge the symbol of each unit area than whether being greater than predetermined symbol than threshold value,
If more than, it is determined that unit area meets default text condition.Wherein, default similarity threshold can be by calculating constituent parts area
The arithmetic average of the similar value in domain obtains, and can also be calculated and be obtained by other methods.Link Ratio can be link number and word
Accord with the ratio of number, ratio of the symbol than can be symbolic number and number of characters.
In the embodiment of the present invention, the beginning and end that the text of webpage to be extracted is determined according to unit text region it
Afterwards, the method for extracting Web page text can also include: to obtain the text additional information of webpage to be extracted.Wherein, the additional letter of text
Breath may include following at least one: text title, author, date and source.In the embodiment of the present invention, features can be passed through
The characteristic information divided searches text title from main part.In the embodiment of the present invention, it is determined that the position of text title and text
After setting, can by regular expression (also known as regular expression, a concept of computer science, be usually used to retrieval,
Replace those texts for meeting some rule) extract the information such as author, date and source.Wherein, the date of text general position
It stores, therefore can be extracted using regular expression in the centre of title and body matter, and with a regular pattern.Text comes
The centre or the position below text that the information such as source and author are normally at title and body matter, and with a regular pattern
Storage, therefore can be extracted using regular expression.
In the embodiment of the present invention, Access Model can be document object model, such as dom (DOM Document Object Model
Document Object Model, abbreviation DOM are the standard programs for the expansible markup language of processing that World Wide Web Consortium is recommended
Interface) tree.
In the embodiment of the present invention, constituent parts region can be with behavior unit.Certainly, it also can choose in the embodiment of the present invention
Other unit.
In the embodiment of the present invention, it may include: constituent parts that the first index value, which is used to indicate the attribute information in constituent parts region,
The unit intensity in region.In order to facilitate understanding, it is that row carries out unit of account density with unit area, " unit intensity " is taken as
" line density " is described in detail, and certain " row " is not used to be defined the protection scope of technical solution of the present invention, this hair
" line density " can be adaptively adjusted according to specific business scenario in bright.In the embodiment of the present invention, unit intensity can lead to
Following calculation method is crossed to obtain.Firstly, obtaining the row block of every a line, with the 1st behavior example explanation, k row is taken downwards, k is according to tool
Body situation setting, take k be 3 when, then the row block of the 1st row be " text of the 1st row to the 4th row ".Then, every a line is calculated
Row block length illustrates by taking the row block of the 1st row as an example, after the blank character for removing the row block of the 1st row, counts the row block of the 1st row
Character sum, then add (punctuation mark number * k of the 1st row).In view of the text in webpage has punctuation mark, other ground
Side does not have punctuation mark, and (punctuation mark number * k) is the equal of weighting.Finally, obtaining the line density of every a line are as follows: row block length/
(k+1).In the present invention, it also can choose other methods unit of account density, this be not construed as limiting.
In the embodiment of the present invention, it may include: feature that the second index value, which is used to indicate the attribute information in certain region in webpage,
Vector value.In the present invention, the feature vector value of characteristic and the feature vector value in constituent parts region can use, calculate special
The similar value of sign part and constituent parts region.
In order to facilitate understanding, Fig. 2 to Fig. 7 is described the embodiment of the present invention with behavior unit, and " Access Model " is taken
Be taken as " line density " for " dom tree ", " the first index value ", " the second index value " is taken as " feature vector value " and is described in detail,
Certainly be not used to be defined the protection scope of technical solution of the present invention " with behavior unit ", the present invention in " dom tree ",
" line density ", " feature vector value " can be adaptively adjusted according to specific business scenario.
Fig. 2 is the schematic diagram of the main flow of the method according to an embodiment of the present invention for extracting Web page text, such as Fig. 2 institute
Show, the method for the extraction Web page text of the embodiment of the present invention mainly includes following below scheme: step S201 loads webpage to be extracted
Source code, and source code is standardized;Step S202 constructs text dom tree according to standardized source code;Step S203,
The characteristic information of webpage is extracted according to dom tree, and determines the heading message of Web page text;Step S204, calculate each line of text with
The similar value of characteristic information and the line density of each line of text;Step S205 selects doubtful text according to similar value and line density
Then block filters out line of text from doubtful text block;Step S206 carries out iteration time capable up and down to line of text
It goes through, determines the beginning and end of text;Step S207 determines the additional information of text.
Step S201 is the source code for loading webpage to be extracted, and is standardized to source code, and detailed process can wrap
It includes: loading the source code of webpage to be extracted by Jsoup (software package for analyzing web page content);Analyze source code, conversion load
Source code format;Remove the scripting languages such as JS and CSS;Spcial character is handled.
Step S202 is to construct text dom tree according to standardized source code.Fig. 3 is that standardized source code is corresponding with its
The schematic diagram of dom tree.In the present invention, dom tree can be constructed by Jsoup, then by dom tree with text information corresponding node mark
The form of label group is stored, and forms a text list, every a line is done an object and is handled, a line is a text, right
Answer a label, while link number, punctuate number, the number of characters of sequence of the row in the page, the row are maintained in text list
In.Wherein, the dom tree of Fig. 3 corresponding " form of text information corresponding node set of tags " can be as follows:
" HTML Tree ": html → head → title → text;
" hello!": html → body → table → tr → td → text;
" this is a HTML tree.": html → body → table → tr → td → text.Wherein, text information " you
It is good!" and " this is a HTML tree." corresponding node label group is identical.
Step S203 is the characteristic information that webpage is extracted according to dom tree, and determines the heading message of Web page text.Wherein,
Dom tree shows the characteristic information and subject information of webpage, and it is the characteristic information of webpage, example that the head label of dom tree is corresponding
Such as, title content, keyword and abstract, and the text message of webpage is corresponding in body label.According to dom tree, pass through html
The text informations such as tag extraction title content, keyword and abstract as → head → title → text.According to the mark of extraction
Content is inscribed, position of the title content in body label can be found.
Step S204 is the line density for calculating the similar value and each line of text of each line of text and characteristic information.In step
In S203, the characteristic information of webpage, the i.e. information such as title content, keyword and abstract, Web page text and these information are obtained
It is to have certain correlation.
Fig. 4 is the calculating each line of text and the phase of characteristic information of the method according to an embodiment of the present invention for extracting Web page text
Like the schematic diagram of the key step of value.As shown in figure 4, the key step for calculating the similar value of each line of text and characteristic information can be with
Include: step S401, stop words is carried out to characteristic information and word segmentation processing obtains n Feature Words, and counts these Feature Words
Word frequency, wherein stop words refers in information retrieval, for save memory space and improve search efficiency, processing natural language
Certain words or word are fallen in meeting automatic fitration before or after data;Step S402 calculates each Feature Words according to TF-IDF algorithm
TF-IDF value, wherein TF-IDF, that is, term frequency-inverse document frequency is a kind of for believing
The common weighting technique of breath retrieval and data mining, the TF-IDF value an of word can be calculated according to TF-IDF algorithm, some
Word is higher to the importance of article, its TF-IDF value is bigger;Step S403 obtains one group of feature vector as net to be extracted
The model eigenvectors value D=D (W of page1,W2,…,Wn), wherein W1For word frequency * the 1st Feature Words of the 1st Feature Words
TF-IDF value;Step S404 traverses each style of writing originally and is segmented, calculates the feature vector value of every a line;Step S405, meter
Similar value of the cosine value of the feature vector value and model eigenvectors value of calculating every a line as each style of writing this and characteristic information,
Wherein cosine law formula indicates are as follows:
Wherein Sim (D, Di) represent the similar value of the i-th style of writing sheet and feature vector, Di=D (Wi1,Wi2,…,Win) represent
The feature vector value of i row.
Step S205 is to select doubtful text block according to similar value and line density, is then filtered out just from doubtful text block
Wen Hang.Fig. 5 is the signal of the key step for filtering out line of text of the method according to an embodiment of the present invention for extracting Web page text
Figure.As shown in figure 5, the key step for filtering out line of text may include: step S501, obtained by the line density of each line of text
One line density function;Step S502 obtains doubtful text block by the rapid drawdown region that rises sharply of line density function;Step S503,
It traverses doubtful text block and finds out that maximum line of text of similitude as line of text.
Fig. 6 is the schematic diagram of the line density function of the acquisition of the method according to an embodiment of the present invention for extracting Web page text.
In Fig. 6, horizontal axis is the line number of every a line, and the longitudinal axis is the line density of each row.Pass through the rapid drawdown meeting that rises sharply of this journey density function
Obtain each piece of position of doubtful text.For example, the point of horizontal axis is X1 ... Xn, the point of the longitudinal axis is Y (X1) ..., and Y (Xn) is needed
Initial position Xstart and end position Xend it is confirmed that text are wanted, specifically determines that the algorithm of doubtful text block can be as
Under:
(1) determination rises sharply point Xstart (Y (Xstart)-Y (X (start-1)) > Y (Xt) * 30%), and wherein Y (Xt) is capable
The maximum value of density;
(2) in order to avoid noise, there is Y (X (start+1)) ≠ 0;
(3) Y (Xend)=0, i.e. rapid drawdown point are 0, indicate to terminate;
(4) guarantee between Xstart and Xend there are 80 the percent of line density maximum value, i.e. Y (Xt) * 80%.
By above-mentioned algorithm, 49 rows to 73 rows and 91 rows to 97 rows are exactly doubtful text block in available Fig. 6.When
So, in the embodiment of the present invention, other methods is also can choose and obtain doubtful text block, this is not limited by the present invention.
Step S206 is to carry out row up and down to line of text to iterate over, and determines the beginning and end of text.Fig. 7
It is the signal of the key step of the beginning and end of the determination text of the method according to an embodiment of the present invention for extracting Web page text
Figure.As shown in fig. 7, the key step for determining the beginning and end of text according to embodiments of the present invention may include: step S701,
The corresponding node label group of line of text can be determined by line of text;Step S702 determines position of the node label group on dom tree
It sets, and text is extracted to the node label group;Step S703 is carried out centered on the node label group to uplink and to downlink
It iterates over;Step S704, judges whether the similar value of every a line is greater than default similarity threshold;Step S705, if more than then
Text is extracted to the corresponding node label group of this article current row and continues iteration;Step S706 stops iteration, really if being not more than
The beginning and end of the text of fixed webpage to be extracted.
It is to be to determine the row by comparing the similar value of every a line and the size of similarity threshold in the embodiment of the present invention
No to meet default text condition, certainly, the symbol that also can use every a line Link Ratio or every a line in the present invention is more true than coming
Whether the fixed row meets default text condition.
Step S207 is the additional information of determining text.Wherein, additional information may include: author, date and source.On
The position for finding title content and text in step in dom tree is stated, therefore author, day can be extracted by regular expression
The information such as phase and source.
The technical solution according to an embodiment of the present invention for extracting Web page text, which can be seen that, can determine opening for Web page text
Head and ending, so as to the intelligentized complete text for extracting webpage, reduce cost of labor, improve and are extracting webpage just
The efficiency of text;The source code of webpage to be extracted is standardized in the embodiment of the present invention, to be conducive to according to standardization
Source code construct Access Model, reduce the time for extracting Web page text, and the method for the embodiment of the present invention be applicable in
It is extracted in the text of various types of webpages;Pass through the second index value and main body of calculating characteristic in the embodiment of the present invention
Second index value in partial constituent parts region, so as to easily the second index value be utilized to calculate characteristic and constituent parts
The similar value in region;The first index value in the embodiment of the present invention by constituent parts region select it is doubtful text filed, so as to
To reduce the selection range of text, the extraction efficiency of Web page text is improved;By comparing doubtful text area in the embodiment of the present invention
The similar value in constituent parts region in domain, so as to improve using the maximum unit area of similar value as unit text region
The accuracy rate that text extracts;Carried out centered on unit text region in the embodiment of the present invention unit area up and down repeatedly
Generation traversal, may thereby determine that the beginning and end of text, it is ensured that extract the complete text of webpage;In the embodiment of the present invention,
Judge whether each unit area meets default text condition from multiple angles such as similar value, Link Ratio and/or symbol ratios, from
And it can be further improved the accuracy rate of text extraction;The text additional information of webpage to be extracted is obtained in the embodiment of the present invention,
Improve the integrality of text;The first index value may include the unit intensity in constituent parts region in the embodiment of the present invention, thus
It can be selected with tenant activity density this attribute information doubtful text filed;The second index value may include in the embodiment of the present invention
Feature vector value, so as to calculate similar value by feature vector value.
Fig. 8 is the schematic diagram of the main modular of the device according to an embodiment of the present invention for extracting Web page text.Such as Fig. 8 institute
Show, the device 800 of extraction Web page text of the invention mainly includes following module: building module 801, computing module 802, screening
Module 803 and determining module 804.
Wherein, building module 801 can be used for: construct Access Model according to webpage to be extracted.Access Model may include: spy
Sign part and main part.Computing module 802 can be used for: the constituent parts region for calculating main part is similar to characteristic
Value.Screening module 803 can be used for: according to the first index value of similar value and constituent parts region, unit is screened from Access Model
Text region.Determining module 804 can be used for: the beginning and end of the text of webpage to be extracted is determined according to unit text region,
To obtain the complete text of webpage to be extracted.
In the embodiment of the present invention, building module 801 can also be used in: before constructing Access Model according to webpage to be extracted,
The source code of webpage to be extracted is standardized.
In the embodiment of the present invention, computing module 802 can also be used in: calculate the second index value and main body for stating characteristic
Second index value in partial constituent parts region;Utilize the second index value of characteristic and second index in constituent parts region
Value calculates the similar value of characteristic and constituent parts region.
In the embodiment of the present invention, screening module 803 can also be used in: according to the first index value, select from Access Model doubtful
Like text filed;Using similar value from doubtful text filed middle screening unit text region.
In the embodiment of the present invention, screening module 803 can also be used in: more doubtful text filed middle constituent parts region it is similar
The size of value chooses the maximum unit area of similar value as unit text region.
In the embodiment of the present invention, determining module 804 can also be used in: be carried out up and down centered on unit text region
Unit area iterates over, and judges whether each unit area meets default text condition, if not meeting default text item
Part then stops iteration, so that it is determined that the beginning and end of the text of webpage to be extracted.
In the embodiment of the present invention, determining module 804 can also be used in: judge whether the similar value of each unit area is greater than
Default similarity threshold, if more than, it is determined that unit area meets default text condition;And/or judge each unit area
Link Ratio whether be less than default Link Ratio threshold value, if being less than, it is determined that unit area meets default text condition;And/or sentence
Break each unit area symbol than whether being greater than predetermined symbol than threshold value, if more than, it is determined that unit area meets default
Text condition.
In the embodiment of the present invention, determining module 804 can also be used in: obtain the text additional information of webpage to be extracted.Wherein,
Text additional information may include following at least one: text title, author, date and source.
In the embodiment of the present invention, Access Model can be document object model.
In the embodiment of the present invention, constituent parts region can be with behavior unit.
In the embodiment of the present invention, the first index value can be used to indicate that the attribute information in constituent parts region, comprising: constituent parts
The unit intensity in region.
In the embodiment of the present invention, the second index value can be used to indicate that the attribute information in certain region in webpage, comprising: feature
Vector value.
From the above, it can be seen that can determine the beginning and end of Web page text, so as to intelligentized extraction
The complete text of webpage out, reduces cost of labor, improves the efficiency for extracting Web page text;It treats and mentions in the embodiment of the present invention
It takes the source code of webpage to be standardized, to be conducive to construct Access Model according to standardized source code, reduces and extract net
The time of page text, and the text for making the method for the embodiment of the present invention can be adapted for various types of webpages extracts;This
By calculating the second index value of the second index value of characteristic and the constituent parts region of main part in inventive embodiments,
So as to easily utilize the second index value to calculate the similar value of characteristic and constituent parts region;Lead in the embodiment of the present invention
The first index value for crossing constituent parts region is selected doubtful text filed, so as to reduce the selection range of text, improves webpage
The extraction efficiency of text;In the embodiment of the present invention by comparing doubtful text filed middle constituent parts region similar value, so as to
The maximum unit area of similar value as unit text region, is improved the accuracy rate of text extraction;The embodiment of the present invention
In iterating over for unit area up and down is carried out centered on unit text region, may thereby determine that the beginning of text
And ending, it is ensured that extract the complete text of webpage;It is more from similar value, Link Ratio and/or symbol ratio etc. in the embodiment of the present invention
A angle judges whether each unit area meets default text condition, so as to further increase the accurate of text extraction
Rate;The text additional information that webpage to be extracted is obtained in the embodiment of the present invention, improves the integrality of text;The embodiment of the present invention
In the first index value may include constituent parts region unit intensity, so as to tenant activity density, this attribute information is selected
It is doubtful text filed;The second index value may include feature vector value in the embodiment of the present invention, so as to by feature vector
Value calculates similar value.
Fig. 9 is shown can be using the method for the extraction Web page text of the embodiment of the present invention or the device of extraction Web page text
Exemplary system architecture 900.
As shown in figure 9, system architecture 900 may include terminal device 901,902,903, network 904 and server 905.
Network 904 between terminal device 901,902,903 and server 905 to provide the medium of communication link.Network 904 can be with
Including various connection types, such as wired, wireless communication link or fiber optic cables etc..
User can be used terminal device 901,902,903 and be interacted by network 904 with server 905, to receive or send out
Send message etc..Various telecommunication customer end applications, such as the application of shopping class, net can be installed on terminal device 901,902,903
(merely illustrative) such as the application of page browsing device, searching class application, instant messaging tools, mailbox client, social platform softwares.
Terminal device 901,902,903 can be the various electronic equipments with display screen and supported web page browsing, packet
Include but be not limited to smart phone, tablet computer, pocket computer on knee and desktop computer etc..
Server 905 can be to provide the server of various services, such as utilize terminal device 901,902,903 to user
The shopping class website browsed provides the back-stage management server (merely illustrative) supported.Back-stage management server can be to reception
To the data such as information query request analyze etc. processing, and by processing result (such as target push information, product letter
Breath -- merely illustrative) feed back to terminal device.
It should be noted that the method for extracting Web page text provided by the embodiment of the present invention is generally held by server 905
Row, correspondingly, the device for extracting Web page text is generally positioned in server 905.
It should be understood that the number of terminal device, network and server in Fig. 9 is only schematical.According to realization need
It wants, can have any number of terminal device, network and server.
Below with reference to Figure 10, it illustrates the computer systems for the terminal device for being suitable for being used to realize the embodiment of the present invention
1000 structural schematic diagram.Terminal device shown in Figure 10 is only an example, should not function to the embodiment of the present invention and
Use scope brings any restrictions.
As shown in Figure 10, computer system 1000 include central processing unit (CPU) 1001, can according to be stored in only
It reads the program in memory (ROM) 1002 or is loaded into random access storage device (RAM) 1003 from storage section 1008
Program and execute various movements appropriate and processing.In RAM 1003, also it is stored with system 1000 and operates required various journeys
Sequence and data.CPU 1001, ROM 1002 and RAM 1003 are connected with each other by bus 1004.Input/output (I/O) interface
1005 are also connected to bus 1004.
I/O interface 1005 is connected to lower component: the importation 1006 including keyboard, mouse etc.;Including such as cathode
The output par, c 1007 of ray tube (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.;Storage section including hard disk etc.
1008;And the communications portion 1009 of the network interface card including LAN card, modem etc..Communications portion 1009 passes through
Communication process is executed by the network of such as internet.Driver 1010 is also connected to I/O interface 1005 as needed.It is detachable to be situated between
Matter 1011, such as disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 1010, so as to
In being mounted into storage section 1008 as needed from the computer program read thereon.
Particularly, disclosed embodiment, the process described above with reference to flow chart may be implemented as counting according to the present invention
Calculation machine software program.For example, embodiment disclosed by the invention includes a kind of computer program product comprising be carried on computer
Computer program on readable medium, the computer program include the program code for method shown in execution flow chart.?
In such embodiment, which can be downloaded and installed from network by communications portion 1009, and/or from can
Medium 1011 is dismantled to be mounted.When the computer program is executed by central processing unit (CPU) 1001, executes and of the invention be
The above-mentioned function of being limited in system.
It should be noted that computer-readable medium shown in the present invention can be computer-readable signal media or meter
Calculation machine readable storage medium storing program for executing either the two any combination.Computer readable storage medium for example can be --- but not
Be limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above combination.Meter
The more specific example of calculation machine readable storage medium storing program for executing can include but is not limited to: have the electrical connection, just of one or more conducting wires
Taking formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type may be programmed read-only storage
Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device,
Or above-mentioned any appropriate combination.In the present invention, computer readable storage medium can be it is any include or storage journey
The tangible medium of sequence, the program can be commanded execution system, device or device use or in connection.And at this
In invention, computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal,
Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited
In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can
Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for
By the use of instruction execution system, device or device or program in connection.Include on computer-readable medium
Program code can transmit with any suitable medium, including but not limited to: wireless, electric wire, optical cable, RF etc. are above-mentioned
Any appropriate combination.
Flow chart and block diagram in attached drawing are illustrated according to the system of various embodiments of the invention, method and computer journey
The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation
A part of one module, program segment or code of table, a part of above-mentioned module, program segment or code include one or more
Executable instruction for implementing the specified logical function.It should also be noted that in some implementations as replacements, institute in box
The function of mark can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are practical
On can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it wants
It is noted that the combination of each box in block diagram or flow chart and the box in block diagram or flow chart, can use and execute rule
The dedicated hardware based systems of fixed functions or operations is realized, or can use the group of specialized hardware and computer instruction
It closes to realize.
Being described in module involved in the embodiment of the present invention can be realized by way of software, can also be by hard
The mode of part is realized.Described module also can be set in the processor, for example, can be described as: a kind of processor packet
Include building module, computing module, screening module and determining module.Wherein, the title of these modules not structure under certain conditions
The restriction of the pairs of module itself, for example, building module is also described as " constructing Access Model according to webpage to be extracted
Module ".
As on the other hand, the present invention also provides a kind of computer-readable medium, which be can be
Included in equipment described in above-described embodiment;It is also possible to individualism, and without in the supplying equipment.Above-mentioned calculating
Machine readable medium carries one or more program, when said one or multiple programs are executed by the equipment, makes
Obtaining the equipment includes: to construct Access Model according to webpage to be extracted;Calculate constituent parts regions and the characteristic of main part
Similar value;According to the first index value of similar value and constituent parts region, unit text region is screened from Access Model;According to list
Position text region determines the beginning and end of the text of webpage to be extracted, to obtain the complete text of webpage to be extracted.
Technical solution according to an embodiment of the present invention can determine the beginning and end of Web page text, so as to intelligence
The complete text for extracting webpage changed, reduces cost of labor, improves the efficiency for extracting Web page text;The embodiment of the present invention
In the source code of webpage to be extracted is standardized, thus be conducive to according to standardized source code construct Access Model, subtract
The time of Web page text is extracted less, and the method for the embodiment of the present invention is made to can be adapted for the texts of various types of webpages
It extracts;By calculating the second of the second index value of characteristic and the constituent parts region of main part in the embodiment of the present invention
Index value, so as to easily utilize the second index value to calculate the similar value of characteristic and constituent parts region;The present invention is real
Apply the first index value in example by constituent parts region select it is doubtful text filed, so as to reduce the selection range of text,
Improve the extraction efficiency of Web page text;By comparing the similar of doubtful text filed middle constituent parts region in the embodiment of the present invention
Value, so as to improve the accuracy rate of text extraction using the maximum unit area of similar value as unit text region;This hair
Iterating over for unit area up and down is carried out in bright embodiment centered on unit text region, may thereby determine that just
The beginning and end of text, it is ensured that extract the complete text of webpage;In the embodiment of the present invention, from similar value, Link Ratio and/or symbol
Number than etc. multiple angles judge whether each unit area meets default text condition, mentioned so as to further increase text
The accuracy rate taken;The text additional information that webpage to be extracted is obtained in the embodiment of the present invention, improves the integrality of text;This hair
The first index value may include the unit intensity in constituent parts region in bright embodiment, so as to this attribute of tenant activity density
Information is selected doubtful text filed;The second index value may include feature vector value in the embodiment of the present invention, so as to by
Feature vector value calculates similar value.
Above-mentioned specific embodiment, does not constitute a limitation on the scope of protection of the present invention.Those skilled in the art should be bright
It is white, design requirement and other factors are depended on, various modifications, combination, sub-portfolio and substitution can occur.It is any
Made modifications, equivalent substitutions and improvements etc. within the spirit and principles in the present invention, should be included in the scope of the present invention
Within.
Claims (26)
1. a kind of method for extracting Web page text characterized by comprising
Access Model is constructed according to webpage to be extracted, the Access Model includes: characteristic and main part;
Calculate the constituent parts region of the main part and the similar value of the characteristic;
According to the first index value of the similar value and constituent parts region, unit text region is screened from the Access Model;
The beginning and end of the text of the webpage to be extracted is determined according to unit text region, it is described to be extracted to obtain
The complete text of webpage.
2. the method according to claim 1, wherein according to webpage to be extracted construct Access Model before, institute
State method further include: be standardized the source code of the webpage to be extracted.
3. the method according to claim 1, wherein calculate the main part constituent parts region and the spy
Sign part similar value include:
Calculate the second index value of the second index value of the characteristic and the constituent parts region of the main part;
Using the second index value of the characteristic and second index value in the constituent parts region, the features are calculated
Divide the similar value with the constituent parts region.
4. the method according to claim 1, wherein according to the first index of the similar value and constituent parts region
Value, screening unit text region from the Access Model includes:
According to first index value, selected from the Access Model doubtful text filed;
Using the similar value from the doubtful text filed middle screening unit text region.
5. according to the method described in claim 4, it is characterized in that, using the similar value from the doubtful text filed middle sieve
The unit text region is selected to include:
Compare the size of the similar value in the doubtful text filed middle constituent parts region, chooses the maximum unit area of similar value and make
For unit text region.
6. the method according to claim 1, wherein determining the net to be extracted according to unit text region
The beginning and end of text of page includes:
Iterating over for unit area up and down is carried out centered on unit text region, judges each unit area
Whether domain meets default text condition, if not meeting default text condition, stops iteration, so that it is determined that the webpage to be extracted
Text beginning and end.
7. according to the method described in claim 6, it is characterized in that, judging whether each unit area meets default text item
Part includes:
Judge whether the similar value of each unit area is greater than default similarity threshold, if more than, it is determined that the unit area
Domain meets default text condition;And/or
Judge whether the Link Ratio of each unit area is less than default Link Ratio threshold value, if being less than, it is determined that the unit area
Domain meets default text condition;And/or
The symbol of each unit area is judged than whether being greater than predetermined symbol than threshold value, if more than, it is determined that the unit area
Domain meets default text condition.
8. the method according to claim 1, wherein described to be extracted being determined according to unit text region
After the beginning and end of the text of webpage, the method also includes: the text additional information of the webpage to be extracted is obtained,
In, the text additional information includes following at least one: text title, author, date and source.
9. the method according to claim 1, wherein the Access Model is document object model.
10. the method according to claim 1, wherein the constituent parts region is with behavior unit.
11. the method according to claim 1, wherein first index value is for indicating constituent parts region
Attribute information, comprising: the unit intensity in constituent parts region.
12. according to the method described in claim 3, it is characterized in that, second index value is for indicating certain region in webpage
Attribute information, comprising: feature vector value.
13. a kind of device for extracting Web page text characterized by comprising
Module is constructed, for constructing Access Model according to webpage to be extracted, the Access Model includes: characteristic and main part
Point;
Computing module, for calculating the constituent parts region of the main part and the similar value of the characteristic;
Screening module is screened from the Access Model for the first index value according to the similar value and constituent parts region
Unit text region;
Determining module, the beginning and end of the text for determining the webpage to be extracted according to unit text region, with
Obtain the complete text of the webpage to be extracted.
14. device according to claim 13, which is characterized in that the building module is also used to: according to net to be extracted
Before page building Access Model, the source code of the webpage to be extracted is standardized.
15. device according to claim 13, which is characterized in that the computing module is also used to:
Calculate the second index value of the second index value of the characteristic and the constituent parts region of the main part;
Using the second index value of the characteristic and second index value in the constituent parts region, the features are calculated
Divide the similar value with the constituent parts region.
16. device according to claim 13, which is characterized in that the screening module is also used to:
According to first index value, selected from the Access Model doubtful text filed;
Using the similar value from the doubtful text filed middle screening unit text region.
17. device according to claim 16, which is characterized in that the screening module is also used to:
Compare the size of the similar value in the doubtful text filed middle constituent parts region, chooses the maximum unit area of similar value and make
For unit text region.
18. device according to claim 13, which is characterized in that the determining module is also used to:
Iterating over for unit area up and down is carried out centered on unit text region, judges each unit area
Whether domain meets default text condition, if not meeting default text condition, stops iteration, so that it is determined that the webpage to be extracted
Text beginning and end.
19. device according to claim 18, which is characterized in that the determining module is also used to:
Judge whether the similar value of each unit area is greater than default similarity threshold, if more than, it is determined that the unit area
Domain meets default text condition;And/or
Judge whether the Link Ratio of each unit area is less than default Link Ratio threshold value, if being less than, it is determined that the unit area
Domain meets default text condition;And/or
The symbol of each unit area is judged than whether being greater than predetermined symbol than threshold value, if more than, it is determined that the unit area
Domain meets default text condition.
20. device according to claim 13, which is characterized in that the determining module is also used to: obtaining described to be extracted
The text additional information of webpage, wherein the text additional information includes following at least one: text title, author, the date and
Source.
21. device according to claim 13, which is characterized in that the Access Model is document object model.
22. device according to claim 13, which is characterized in that the constituent parts region is with behavior unit.
23. device according to claim 13, which is characterized in that first index value is for indicating constituent parts region
Attribute information, comprising: the unit intensity in constituent parts region.
24. device according to claim 15, which is characterized in that second index value is for indicating certain region in webpage
Attribute information, comprising: feature vector value.
25. a kind of electronic equipment characterized by comprising
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are executed by one or more of processors, so that one or more of processors are real
The now method as described in any in claim 1-12.
26. a kind of computer-readable medium, is stored thereon with computer program, which is characterized in that described program is held by processor
The method as described in any in claim 1-12 is realized when row.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711306108.3A CN110020312B (en) | 2017-12-11 | 2017-12-11 | Method and device for extracting webpage text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711306108.3A CN110020312B (en) | 2017-12-11 | 2017-12-11 | Method and device for extracting webpage text |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110020312A true CN110020312A (en) | 2019-07-16 |
CN110020312B CN110020312B (en) | 2022-09-06 |
Family
ID=67186859
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711306108.3A Active CN110020312B (en) | 2017-12-11 | 2017-12-11 | Method and device for extracting webpage text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110020312B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111581478A (en) * | 2020-05-07 | 2020-08-25 | 成都信息工程大学 | Cross-website general news acquisition method for specific subject |
CN111639250A (en) * | 2020-06-05 | 2020-09-08 | 深圳市小满科技有限公司 | Enterprise description information acquisition method and device, electronic equipment and storage medium |
CN111651694A (en) * | 2020-05-21 | 2020-09-11 | 深圳市比一比网络科技有限公司 | DOM tree processing method applied to webpage |
CN113722640A (en) * | 2021-08-26 | 2021-11-30 | 长沙博为软件技术股份有限公司 | Method, device and medium for collecting webpage configurable items based on RPA |
CN114172676A (en) * | 2020-09-10 | 2022-03-11 | 中国移动通信有限公司研究院 | Malicious website detection method, device, equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102253979A (en) * | 2011-06-23 | 2011-11-23 | 天津海量信息技术有限公司 | Vision-based web page extracting method |
CN102541874A (en) * | 2010-12-16 | 2012-07-04 | 中国移动通信集团公司 | Webpage text content extracting method and device |
US20150067476A1 (en) * | 2013-08-29 | 2015-03-05 | Microsoft Corporation | Title and body extraction from web page |
CN104598577A (en) * | 2015-01-14 | 2015-05-06 | 晶赞广告(上海)有限公司 | Extraction method for webpage text |
CN105095466A (en) * | 2015-07-31 | 2015-11-25 | 山东大学 | Web text information extraction method |
CN105183801A (en) * | 2015-08-25 | 2015-12-23 | 北京信息科技大学 | Web page body text extraction method and apparatus |
-
2017
- 2017-12-11 CN CN201711306108.3A patent/CN110020312B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102541874A (en) * | 2010-12-16 | 2012-07-04 | 中国移动通信集团公司 | Webpage text content extracting method and device |
CN102253979A (en) * | 2011-06-23 | 2011-11-23 | 天津海量信息技术有限公司 | Vision-based web page extracting method |
US20150067476A1 (en) * | 2013-08-29 | 2015-03-05 | Microsoft Corporation | Title and body extraction from web page |
CN104598577A (en) * | 2015-01-14 | 2015-05-06 | 晶赞广告(上海)有限公司 | Extraction method for webpage text |
CN105095466A (en) * | 2015-07-31 | 2015-11-25 | 山东大学 | Web text information extraction method |
CN105183801A (en) * | 2015-08-25 | 2015-12-23 | 北京信息科技大学 | Web page body text extraction method and apparatus |
Non-Patent Citations (2)
Title |
---|
张瑞雪等: "逆序解析DOM树及网页正文信息提取", 《计算机科学》 * |
王利: "基于内容相似度的网页正文提取", 《计算机工程》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111581478A (en) * | 2020-05-07 | 2020-08-25 | 成都信息工程大学 | Cross-website general news acquisition method for specific subject |
CN111651694A (en) * | 2020-05-21 | 2020-09-11 | 深圳市比一比网络科技有限公司 | DOM tree processing method applied to webpage |
CN111651694B (en) * | 2020-05-21 | 2023-09-29 | 深圳市比一比网络科技有限公司 | DOM tree processing method applied to webpage |
CN111639250A (en) * | 2020-06-05 | 2020-09-08 | 深圳市小满科技有限公司 | Enterprise description information acquisition method and device, electronic equipment and storage medium |
CN111639250B (en) * | 2020-06-05 | 2023-05-16 | 深圳市小满科技有限公司 | Enterprise description information acquisition method and device, electronic equipment and storage medium |
CN114172676A (en) * | 2020-09-10 | 2022-03-11 | 中国移动通信有限公司研究院 | Malicious website detection method, device, equipment and storage medium |
CN113722640A (en) * | 2021-08-26 | 2021-11-30 | 长沙博为软件技术股份有限公司 | Method, device and medium for collecting webpage configurable items based on RPA |
Also Published As
Publication number | Publication date |
---|---|
CN110020312B (en) | 2022-09-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110020312A (en) | The method and apparatus for extracting Web page text | |
CN105677764B (en) | Information extraction method and device | |
CN110334346A (en) | A kind of information extraction method and device of pdf document | |
JP5092165B2 (en) | Data construction method and system | |
CN105574092B (en) | Information mining method and device | |
CN103324666A (en) | Topic tracing method and device based on micro-blog data | |
CN102193936A (en) | Data classification method and device | |
CN110516221A (en) | Extract method, equipment and the storage medium of chart data in PDF document | |
CN103500332B (en) | Character displaying method and device in picture | |
CN109635260B (en) | Method, device, equipment and storage medium for generating article template | |
CN108829854B (en) | Method, apparatus, device and computer-readable storage medium for generating article | |
CN109214730A (en) | Information-pushing method and device | |
CN103166981A (en) | Wireless webpage transcoding method and device | |
CN112579729B (en) | Training method and device for document quality evaluation model, electronic equipment and medium | |
US20230177359A1 (en) | Method and apparatus for training document information extraction model, and method and apparatus for extracting document information | |
CN114970553B (en) | Information analysis method and device based on large-scale unmarked corpus and electronic equipment | |
CN107798622A (en) | A kind of method and apparatus for identifying user view | |
CN110276065A (en) | A kind of method and apparatus handling goods review | |
CN112084342A (en) | Test question generation method and device, computer equipment and storage medium | |
CN103942211A (en) | Text page recognition method and device | |
CN112650910A (en) | Method, device, equipment and storage medium for determining website update information | |
JP6144968B2 (en) | Information presenting apparatus, method, and program | |
US10387545B2 (en) | Processing page | |
Xiang et al. | Effective page segmentation combining pattern analysis and visual separators for browsing on small screens | |
Kucher et al. | Analysis of VINCI 2009-2017 proceedings |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |