Nothing Special   »   [go: up one dir, main page]

WO2019000303A1 - 网页的智能采集方法及系统 - Google Patents

网页的智能采集方法及系统 Download PDF

Info

Publication number
WO2019000303A1
WO2019000303A1 PCT/CN2017/090717 CN2017090717W WO2019000303A1 WO 2019000303 A1 WO2019000303 A1 WO 2019000303A1 CN 2017090717 W CN2017090717 W CN 2017090717W WO 2019000303 A1 WO2019000303 A1 WO 2019000303A1
Authority
WO
WIPO (PCT)
Prior art keywords
collection rule
rule
collection
computer device
webpage
Prior art date
Application number
PCT/CN2017/090717
Other languages
English (en)
French (fr)
Inventor
马岩
Original Assignee
麦格创科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 麦格创科技(深圳)有限公司 filed Critical 麦格创科技(深圳)有限公司
Priority to PCT/CN2017/090717 priority Critical patent/WO2019000303A1/zh
Publication of WO2019000303A1 publication Critical patent/WO2019000303A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the present invention relates to the field of software and computers, and in particular, to a method and system for intelligently collecting web pages.
  • the existing web page collection method has low performance and low efficiency.
  • the application provides an intelligent collection method for a webpage. It solves the shortcomings of low performance and low efficiency of the technical solutions of the prior art.
  • an intelligent collection method for a webpage includes the following steps:
  • the computer device obtains the add task and adds the URL
  • the computer device adds the collection rule of the webpage, and intelligently analyzes the collection rule
  • the computer equipment test collection rule after the collection rule test is passed, the task is released;
  • the computer device starts collecting webpages according to the collection rule, and publishes the collected webpage data.
  • the method further includes:
  • the method further includes: before intelligently parsing the collection rule:
  • Manually formulating rules specifically: the user formulates rules for extracting element data by analyzing the data structure of the URL; the computer device automatically locates the webpage elements, so that the user can click to obtain the element content, and the computer device generates the collection rule.
  • the intelligently analyzing the collection rule comprises:
  • the intelligently analyzing the collection rule comprises:
  • the collection rule corresponding to the added URL is searched from the rule base, and the collection rule corresponding to the added URL is extracted and reused.
  • a computer device comprising:
  • the processing unit is configured to add an collection rule of the webpage, intelligently parse the collection rule, and test the collection rule. After the collection rule passes the test, the task is released, and the webpage is collected according to the collection rule, and the collected webpage data is released.
  • processing unit is further configured to synchronize the publishing task into a database.
  • the processing unit is further configured to manually formulate rules. Specifically, the user formulates rules for extracting element data by analyzing the data structure of the URL; the computer device automatically locates the webpage element, so that the user can click to obtain the element content.
  • the collection rules are generated by the computer device.
  • the processing unit is configured to perform semantic algorithm analysis on the collection rule to identify valid content of the collection rule, and calculate an extraction rule of the data.
  • a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements an intelligent collection method of a web page provided by the first aspect.
  • the technical solution provided by the invention has the advantages of high efficiency and low cost by automatically parsing the collection rules and realizing automatic collection of web pages.
  • FIG. 1 is a flowchart of a method for intelligently collecting a webpage according to a first preferred embodiment of the present invention
  • FIG. 2 is a structural diagram of a computer device according to a second preferred embodiment of the present invention.
  • FIG. 3 is a hardware structural diagram of a computer device according to a second preferred embodiment of the present invention.
  • FIG. 1 is a schematic diagram of an intelligent collection method for a webpage according to a first preferred embodiment of the present invention. The method is performed by a computer device. The method is as shown in FIG.
  • Step S101 The computer device acquires an add task and adds a web address.
  • Step S102 The computer device adds an collection rule of the webpage, and intelligently parses the collection rule.
  • Step S103 The computer device tests the collection rule, and after the collection rule test is passed, the task is released.
  • Step S104 The computer device starts collecting webpages according to the collection rule, and publishes the collected webpage data.
  • the foregoing method may further include: between step S103 and step S104:
  • the computer device stores the published tasks in a database.
  • FIG. 2A is a schematic diagram of a method for intelligently analyzing a webpage, and by intelligently extracting the collection rule, the user configuration rule can be greatly reduced. The time required, and can help ordinary users quickly implement rule extraction.
  • the resolution scheme includes: element location extraction, rule database matching, model database matching, and semantic algorithm.
  • the element location extraction can maximize the convenience of the public, and directly click on an element in the webpage, the system can automatically extract the extraction rule corresponding to the element (XPath address, regular parameter extraction); then the rule database matching is the first intelligent analysis Step operation, if the added URL is similar to the URL in the rule base, the rule is automatically extracted and verified, and the model database is used when not passing, and the URL type is determined by semantically analyzing the verb, preposition or adverb combination probability of the title, from the model library.
  • the model rules of the type are extracted for data extraction test, and the first three rules are prompted to the user for selection. When the user selects or modifies, the rule is corrected or recorded to achieve a more substantial and accurate rule base.
  • the semantic algorithm is used as a supplementary function. When the rule base and the model library cannot extract the rules, the required type elements are uniformly extracted, and the available rules are calculated according to the content.
  • Element rule extraction Through the click on the web page element, the system automatically extracts the extraction rule of the element, including XPath address and regular parameter extraction;
  • Rule database matching as the first step of intelligent resolution, matching based on the primary domain name (for example: GD.*.CN is similar SZ.*.CN). Find the corresponding rule for background parsing, and recommend the rule if its rules can effectively extract data.
  • the primary domain name for example: GD.*.CN is similar SZ.*.CN.
  • Model database matching/semantic analysis As the second step of intelligent analysis, the model library field is determined by the acquisition source type. The collection source type is confirmed by semantic analysis, and the model belonging to the type in the library is extracted. If the rule can effectively extract the data, the rule is recommended. (When multiple rules collect data, it is recommended at the same time, and the length of the character is taken before. 3). After the user adjusts the rule and performs a valid test, the new rule is recorded (the correction count is increased, the reference count is only recorded when the adjustment is not adjusted), and the correction is directly made when the source rule contains the new rule. When a user adds a field, the system adds the field and the initial rule to the model library of that type.
  • Semantic analysis judges the type of URL by combining the verb/preposition/adverb combination probability of the title and content; the semantic algorithm calculates the content and filters out the available rules (for example, the content field of the news type: by reversing the length of the content) Inferred, the underlying DIV with the largest content (text) of the DIV is not included to confirm the element containing the content and generate the extraction rule).
  • Semantic algorithm Filter the invalid content by semantic analysis of the collected data, identify the effective content, and calculate the extraction rule of the data.
  • Rule base Stores the collection rules of existing URLs. When the same or similar new collection URLs appear, they can be reused.
  • Model library Stores the mainstream data collection model (supports model extension), so that the computer device can quickly identify the elements and generation rules that need to be extracted.
  • the news model when collecting a news detail page URL, the system can automatically extract the "title”, “author”, “source”, “release time”, “content” and other fields according to the news model).
  • the foregoing method may further include: before intelligently parsing the rule:
  • the user formulates rules for extracting element data by analyzing the data structure of the URL.
  • Element positioning The computer device automatically locates the webpage element, so that the user can click to obtain the element content, and the computer device generates the collection rule.
  • FIG. 2B provides a computer device, the computer device including:
  • the obtaining unit 201 is configured to acquire an add task and add a web address.
  • the processing unit 202 is configured to add an collection rule of the webpage, intelligently parse the collection rule, and test the collection rule. After the collection rule is passed, the task is released, and the webpage is started according to the collection rule, and the collected webpage data is released.
  • processing unit 202 is further configured to synchronize the publishing task into a database.
  • the processing unit 202 is further configured to manually formulate rules. Specifically, the user formulates rules for extracting element data by analyzing the data structure of the URL; the computer device automatically locates the webpage element, so that the user can click to obtain the element content. And the collection rules are generated by the computer device.
  • the processing unit 202 is configured to perform semantic algorithm analysis on the collection rule to identify the effective content of the collection rule, and calculate an extraction rule of the data.
  • FIG. 3 is a computer device 30 including a processor 301, a transceiver 302, a memory 303, and a bus 304.
  • the transceiver 302 is configured to transmit and receive data with and from an external device.
  • the number of processors 301 can be one or more.
  • processor 301, memory 302, and transceiver 303 may be connected by bus 304 or other means.
  • Computer device 30 can be used to perform the steps of FIG. For the meaning and examples of the terms involved in the embodiment, reference may be made to the corresponding embodiment of FIG. 1. I will not repeat them here.
  • the program code is stored in the memory 303.
  • the processor 301 is configured to call the program code stored in the memory 303 for performing the following operations:
  • the processor 301 is configured to: after starting, receive a plurality of location information sent by the location sensor, identify the plurality of location information to obtain a first motion trend, query a first operation corresponding to the first motion trend, and perform the first operation.
  • the processor 301 herein may be a processing component or a general term of multiple processing components.
  • the processing element can be a central processor (Central) Processing Unit, CPU), or a specific integrated circuit (Application Specific Integrated) Circuit, ASIC), or one or more integrated circuits configured to cost an embodiment, such as one or more microprocessors (digital signal) Processor, DSP), or one or more Field Programmable Gate Arrays (FPGAs).
  • CPU central processor
  • ASIC Application Specific Integrated Circuit
  • DSP digital signal Processor
  • FPGAs Field Programmable Gate Arrays
  • the memory 303 may be a storage device or a collective name of a plurality of storage elements, and is used to store executable program code or parameters, data, and the like required for the application running device to operate. And the memory 303 may include random access memory (RAM), and may also include non-volatile memory (non-volatile memory) Memory), such as disk storage, flash (Flash), etc.
  • RAM random access memory
  • non-volatile memory non-volatile memory
  • flash flash
  • Bus 304 can be an industry standard architecture (Industry Standard Architecture, ISA) bus, Peripheral Component (PCI) bus or extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc.
  • the bus can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one thick line is shown in Figure 3, but it does not mean that there is only one bus or one type of bus.
  • the terminal may further include input and output means connected to the bus 304 for connection to other parts such as the processor 301 via the bus.
  • the input/output device can provide an input interface for the operator, so that the operator can select the control item through the input interface, and can also be other interfaces through which other devices can be externally connected.
  • the program may be stored in a computer readable storage medium, and the storage medium may include: Flash drive, read-only memory (English: Read-Only Memory, referred to as: ROM), random accessor (English: Random Access Memory, referred to as: RAM), disk or CD.
  • ROM Read-Only Memory
  • RAM Random Access Memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种网页的智能采集方法,所述方法包括如下步骤:计算机设备获取添加任务以及添加网址(S101);计算机设备添加网页的采集规则,对该采集规则智能解析(S102);计算机设备测试采集规则,在该采集规则测试通过后,发布任务(S103);计算机设备依据该采集规则开始采集网页,并将采集的网页数据发布(S104)。所述方法具有效率高的优点。

Description

网页的智能采集方法及系统 技术领域
本发明涉及软件及计算机领域,尤其涉及一种网页的智能采集方法及系统。
背景技术
在传统采集中,用户需要具备基本的网页代码阅读能力,这样就局限住了那些需要使用采集功能而不会配置的用户,并且对于拥有配置能力的用户也会消耗大量的配置时间。由此大量的采集源配置工作,采集效率、采集数量的压力日益增大,市面普通的采集器已经不满足我们的业务需求,所以研发部门需要实现以围绕业务为核心的高性能、高可用性的智能型采集器及采集方法。
现有的网页的采集方法性能低,效率低。
技术问题
本申请提供一种网页的智能采集方法。其解决现有技术的技术方案性能低,效率低的缺点。
技术解决方案
一方面,提供一种网页的智能采集方法,所述方法包括如下步骤:
计算机设备获取添加任务以及添加网址;
计算机设备添加网页的采集规则,对该采集规则智能解析;
计算机设备测试采集规则,在该采集规则测试通过后,发布任务;
计算机设备依据该采集规则开始采集网页,并将采集的网页数据发布。
可选的,所述方法在发任务之后还包括:
将所述发布任务同步到数据库中。
可选的,所述方法在对该采集规则智能解析之前还包括:
人工制定规则,具体的:用户通过对网址数据结构的分析,制定出提取元素数据的规则;计算机设备自动定位网页元素,使用户点击即可获取元素内容,并由计算机设备生成采集规则。
可选的,所述对该采集规则智能解析具体,包括:
对所述采集规则进行语义算法解析识别出所述采集规则的有效内容,计算出该数据的提取规则。
可选的,所述对该采集规则智能解析具体,包括:
从规则库中查找所述添加网址对应的采集规则,如规则库存所述添加网址,则将所述添加网址对应的采集规则提取复用。
第二方面,提供一种计算机设备,所述计算机设备包括:
获取单元,用于获取添加任务以及添加网址;
处理单元,用于添加网页的采集规则,对该采集规则智能解析,测试采集规则,在该采集规则测试通过后,发布任务,依据该采集规则开始采集网页,并将采集的网页数据发布。
可选的,所述处理单元,还用于将所述发布任务同步到数据库中。
可选的,处理单元,还用于人工制定规则,具体的:用户通过对网址数据结构的分析,制定出提取元素数据的规则;计算机设备自动定位网页元素,使用户点击即可获取元素内容,并由计算机设备生成采集规则。
可选的,所述处理单元,具体用于对所述采集规则进行语义算法解析识别出所述采集规则的有效内容,计算出该数据的提取规则。
第三方面,提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现第一方面提供的网页的智能采集方法。
有益效果
本发明提供的技术方案通过自动解析采集规则,实现网页的自动采集,所以其具有效率高、成本低的优点。
附图说明
为了更清楚地说明本发明实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明第一较佳实施方式提供的一种网页的智能采集方法的流程图;
图2为本发明第二较佳实施方式提供的一种计算机设备的结构图。
图3为本发明第二较佳实施方式提供的一种计算机设备的硬件结构图。
本发明的实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
请参考图1,图1是本发明第一较佳实施方式提出的一种网页的智能采集方法,该方法由计算机设备执行,该方法如图1所示,包括如下步骤:
步骤S101、计算机设备获取添加任务以及添加网址。
步骤S102、计算机设备添加网页的采集规则,对该采集规则智能解析。
步骤S103、计算机设备测试采集规则,在该采集规则测试通过后,发布任务。
步骤S104、计算机设备依据该采集规则开始采集网页,并将采集的网页数据发布。
可选的,上述方法在步骤S103与步骤S104之间还可以包括:
计算机设备将发布的任务存储在数据库。
可选的,上述方法中对该采集规则进行智能解析的流程如图2A所示,其中,图2A一种网页的智能解析方法的示意图,通过智能提取采集规则,可大大减少用户配置采集规则所需的时间,并且可帮助普通用户快速实现规则提取。该解析方案包含:元素定位提取、规则数据库匹配、模型数据库匹配、语义算法。其中元素定位提取可最大限度的方便大众,直接点击网页中的某元素,系统便可自动提取出该元素对应的提取规则(XPath地址、正则参数提取);然后规则数据库匹配作为智能解析的第一步操作,若添加的网址类似于规则库中网址,则自动提取规则并校验,不通过时使用模型数据库,通过语义分析标题的动词、介词或副词组合概率以判断网址类型,从模型库中提取该类型的模型规则进行数据提取测验,将有效的前三个规则提示给用户选择,在用户进行选择或修改时,对该规则进行修正或记录以达到规则库越来越充实、精确。语义算法作为补充功能,当规则库和模型库无法提取出规则时,对所需类型元素进行统一提取,根据内容进行计算并筛选出可用规则。
元素规则提取:通过对网页元素的点击,系统自动提取出该元素的提取规则,包括XPath地址、正则参数提取;
规则数据库匹配:作为智能解析的第一步,根据主域名进行匹配(例如:GD.*.CN 类似于 SZ.*.CN)。找到对应的规则进行后台解析,如果其规则能有效提取数据,则推荐该规则。
模型数据库匹配/语义分析:作为智能解析的第二步,模型库字段由采集源类型决定。由语义分析确认采集源类型,并提取库中属于该类型的模型,如果其规则能有效提取数据,则推荐该规则(当多个规则均采集到数据时,同时推荐,取字符长度大小的前3个)。当用户对规则进行调整,并进行有效测试之后,记录新规则(增加修正计数、未调整则仅记录引用计数),当源规则包含新规则时,直接进行修正。当用户新增字段时,则系统对该类型的模型库添加该字段及初始规则。
语义算法:语义分析通过对标题、内容的动词/介词/副词组合概率以判断网址类型;语义算法通过对内容进行计算并筛选出可用规则(例如新闻类型的内容字段:通过对内容长度的反向推算,取不包含DIV的内容(文字)长度最大的底层DIV以确认包含内容的元素并生成提取规则)。
智能解析规则
1、语义算法:通过对采集数据的语义分析过滤无效内容,识别出有效内容,并计算出该数据的提取规则。
2、规则库:存储已有网址的采集规则,当出现相同或相似的新增采集网址时可以提取复用。
3、模型库:存储主流数据采集模型(支持模型扩展),从而使计算机设备快速识别出所需要提取的元素及生成规则。
例如:新闻模型,当采集一篇新闻详情页网址时,系统可根据新闻模型自动提取出“标题”、“作者”、“来源”、“发布时间”、“内容”等字段)。
4、自我修正:当通过自动解析生成规则,用户变更规则并进行有效测试(或有效采集)时,计算机设备会自动更新模型库。
可选的,上述方法在对该规则智能解析之前还可以包括:
人工制定规则,具体的:
1、用户通过对网址数据结构的分析,制定出提取元素数据的规则。
2、元素定位:计算机设备自动定位网页元素,使用户点击即可获取元素内容,并由计算机设备生成采集规则。
参阅图2B,图2B提供一种计算机设备,所述计算机设备包括:
获取单元201,用于获取添加任务以及添加网址;
处理单元202,用于添加网页的采集规则,对该采集规则智能解析,测试采集规则,在该采集规则测试通过后,发布任务,依据该采集规则开始采集网页,并将采集的网页数据发布。
可选的,处理单元202,还用于将所述发布任务同步到数据库中。
可选的,处理单元202,还用于人工制定规则,具体的:用户通过对网址数据结构的分析,制定出提取元素数据的规则;计算机设备自动定位网页元素,使用户点击即可获取元素内容,并由计算机设备生成采集规则。
可选的,处理单元202,具体用于对所述采集规则进行语义算法解析识别出所述采集规则的有效内容,计算出该数据的提取规则。
参阅图3,图3为一种计算机设备30,包括:处理器301、收发器302、存储器303和总线304,收发器302用于与外部设备之间收发数据。处理器301的数量可以是一个或多个。本申请的一些实施例中,处理器301、存储器302和收发器303可通过总线304或其他方式连接。计算机设备30可以用于执行图1的步骤。关于本实施例涉及的术语的含义以及举例,可以参考图1对应的实施例。此处不再赘述。
其中,存储器303中存储程序代码。处理器301用于调用存储器303中存储的程序代码,用于执行以下操作:
处理器301,用于在启动后,接收位置传感器发送的多个位置信息,对多个位置信息进行识别得到第一运动趋势,查询第一运动趋势对应的第一操作,执行该第一操作。
需要说明的是,这里的处理器301可以是一个处理元件,也可以是多个处理元件的统称。例如,该处理元件可以是中央处理器(Central Processing Unit,CPU),也可以是特定集成电路(Application Specific Integrated Circuit,ASIC),或者是被配置成本申请实施例的一个或多个集成电路,例如:一个或多个微处理器(digital signal processor,DSP),或,一个或者多个现场可编程门阵列(Field Programmable Gate Array, FPGA)。
存储器303可以是一个存储装置,也可以是多个存储元件的统称,且用于存储可执行程序代码或应用程序运行装置运行所需要参数、数据等。且存储器303可以包括随机存储器(RAM),也可以包括非易失性存储器(non-volatile memory),例如磁盘存储器,闪存(Flash)等。
总线304可以是工业标准体系结构(Industry Standard Architecture,ISA)总线、外部设备互连(Peripheral Component,PCI)总线或扩展工业标准体系结构(Extended Industry Standard Architecture,EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。为便于表示,图3中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
该终端还可以包括输入输出装置,连接于总线304,以通过总线与处理器301等其它部分连接。该输入输出装置可以为操作人员提供一输入界面,以便操作人员通过该输入界面选择布控项,还可以是其它接口,可通过该接口外接其它设备。
需要说明的是,对于前述的各个方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某一些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本发明所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详细描述的部分,可以参见其他实施例的相关描述。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:闪存盘、只读存储器(英文:Read-Only Memory ,简称:ROM)、随机存取器(英文:Random Access Memory,简称:RAM)、磁盘或光盘等。
以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。

Claims (10)

  1. 一种网页的智能采集方法,其特征在于,所述方法包括如下步骤:
    计算机设备获取添加任务以及添加网址;
    计算机设备添加网页的采集规则,对该采集规则智能解析;
    计算机设备测试采集规则,在该采集规则测试通过后,发布任务;
    计算机设备依据该采集规则开始采集网页,并将采集的网页数据发布。
  2. 根据权利要求1所述的方法,其特征在于,所述方法在发任务之后还包括:
    将所述发布任务同步到数据库中。
  3. 根据权利要求1所述的方法,其特征在于,所述方法在对该采集规则智能解析之前还包括:
    人工制定规则,具体的:用户通过对网址数据结构的分析,制定出提取元素数据的规则;计算机设备自动定位网页元素,使用户点击即可获取元素内容,并由计算机设备生成采集规则。
  4. 根据权利要求1所述的方法,其特征在于,所述对该采集规则智能解析具体,包括:
    对所述采集规则进行语义算法解析识别出所述采集规则的有效内容,计算出该数据的提取规则。
  5. 根据权利要求1所述的方法,其特征在于,所述对该采集规则智能解析具体,包括:
    从规则库中查找所述添加网址对应的采集规则,如规则库存所述添加网址,则将所述添加网址对应的采集规则提取复用。
  6. 一种计算机设备,其特征在于,所述计算机设备包括:
    获取单元,用于获取添加任务以及添加网址;
    处理单元,用于添加网页的采集规则,对该采集规则智能解析,测试采集规则,在该采集规则测试通过后,发布任务,依据该采集规则开始采集网页,并将采集的网页数据发布。
  7. 根据权利要求6所述的计算机设备,其特征在于,所述处理单元,还用于将所述发布任务同步到数据库中。
  8. 根据权利要求6所述的计算机设备,其特征在于,处理单元,还用于人工制定规则,具体的:用户通过对网址数据结构的分析,制定出提取元素数据的规则;计算机设备自动定位网页元素,使用户点击即可获取元素内容,并由计算机设备生成采集规则。
  9. 根据权利要求6所述的计算机设备,其特征在于,所述处理单元,具体用于对所述采集规则进行语义算法解析识别出所述采集规则的有效内容,计算出该数据的提取规则。
  10. 一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如权利要求1-5任意一项所述的网页的智能采集方法。
PCT/CN2017/090717 2017-06-29 2017-06-29 网页的智能采集方法及系统 WO2019000303A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/090717 WO2019000303A1 (zh) 2017-06-29 2017-06-29 网页的智能采集方法及系统

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/090717 WO2019000303A1 (zh) 2017-06-29 2017-06-29 网页的智能采集方法及系统

Publications (1)

Publication Number Publication Date
WO2019000303A1 true WO2019000303A1 (zh) 2019-01-03

Family

ID=64740711

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/090717 WO2019000303A1 (zh) 2017-06-29 2017-06-29 网页的智能采集方法及系统

Country Status (1)

Country Link
WO (1) WO2019000303A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101561802A (zh) * 2008-04-18 2009-10-21 上海复旦光华信息科技股份有限公司 网页结构化数据提取方法与系统
CN101582075A (zh) * 2009-06-24 2009-11-18 大连海事大学 Web信息抽取系统
CN103761330A (zh) * 2014-02-10 2014-04-30 赛特斯信息科技股份有限公司 基于模版配置实现互联网信息自动提取的系统及方法
CN103838796A (zh) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 一种网页结构化信息抽取方法
CN104881488A (zh) * 2015-06-05 2015-09-02 焦点科技股份有限公司 基于关系表的可配置信息抽取方法
CN105468664A (zh) * 2015-05-12 2016-04-06 北京众标网络科技有限公司 一种信息采集方法及装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101561802A (zh) * 2008-04-18 2009-10-21 上海复旦光华信息科技股份有限公司 网页结构化数据提取方法与系统
CN101582075A (zh) * 2009-06-24 2009-11-18 大连海事大学 Web信息抽取系统
CN103838796A (zh) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 一种网页结构化信息抽取方法
CN103761330A (zh) * 2014-02-10 2014-04-30 赛特斯信息科技股份有限公司 基于模版配置实现互联网信息自动提取的系统及方法
CN105468664A (zh) * 2015-05-12 2016-04-06 北京众标网络科技有限公司 一种信息采集方法及装置
CN104881488A (zh) * 2015-06-05 2015-09-02 焦点科技股份有限公司 基于关系表的可配置信息抽取方法

Similar Documents

Publication Publication Date Title
CN109086409B (zh) 微服务数据处理方法、装置、电子设备及计算机可读介质
WO2014079322A1 (zh) 音频流媒体的跟踪方法及系统、存储介质
WO2010127582A1 (zh) 一种基于wordnet的语义服务注册与查询方法
WO2020019724A1 (zh) 服务器中传感器数据的获取方法、获取系统和相关装置
CN111949850A (zh) 多源数据的采集方法、装置、设备及存储介质
WO2019000303A1 (zh) 网页的智能采集方法及系统
EP1710718B1 (en) Systems and methods for performing streaming checks on data format for UDTs
WO2018223354A1 (zh) 基于定位的考勤记录方法及系统
WO2013100415A1 (ko) 분산 데이터 품질 관리 시스템 및 그 방법
WO2018157391A1 (zh) 大数据企业评价的方法及系统
WO2018157332A1 (zh) 应用于大数据的统计方法及系统
WO2018170887A1 (zh) 大数据的list的显示方法及系统
WO2018209586A1 (zh) 蓝牙的定位方法及系统
WO2018223375A1 (zh) 终端流量的控制提醒方法及系统
WO2018006254A1 (zh) 基于局域网邮件数据的抓取方法及系统
WO2018157331A1 (zh) 应用于大数据的存储方法及系统
CN116126886A (zh) 字段血缘关系解析方法、装置、电子设备及存储介质
WO2018209550A1 (zh) 终端的系统更新方法及系统
WO2018209504A1 (zh) 基于分组的终端app管理方法及系统
WO2018157392A1 (zh) 基于大数据确定关联企业的方法及系统
WO2018170888A1 (zh) 大数据list的子控件的组合选择方法及系统
TWM656888U (zh) 通知與紀錄來電內容的系統
WO2018209502A1 (zh) 终端app的分组方法及系统
WO2018027927A1 (zh) 网页数据的搜索方法及系统
WO2018006255A1 (zh) 网络邮件数据的搜集方法及系统

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17915510

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 17915510

Country of ref document: EP

Kind code of ref document: A1