Nothing Special   »   [go: up one dir, main page]

CN108614849B - A Webpage Advertisement Detection Method Based on Dynamic Insertion and Static Multi-Script Page Feature Extraction - Google Patents

A Webpage Advertisement Detection Method Based on Dynamic Insertion and Static Multi-Script Page Feature Extraction Download PDF

Info

Publication number
CN108614849B
CN108614849B CN201710033452.3A CN201710033452A CN108614849B CN 108614849 B CN108614849 B CN 108614849B CN 201710033452 A CN201710033452 A CN 201710033452A CN 108614849 B CN108614849 B CN 108614849B
Authority
CN
China
Prior art keywords
advertisement
advertisements
webpage
dynamic
script
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710033452.3A
Other languages
Chinese (zh)
Other versions
CN108614849A (en
Inventor
张卫丰
赵晨
刘蕊成
陈贵美
许蕾
张迎周
周国强
王子元
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nupt Institute Of Big Data Research At Yancheng
Original Assignee
Nupt Institute Of Big Data Research At Yancheng
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nupt Institute Of Big Data Research At Yancheng filed Critical Nupt Institute Of Big Data Research At Yancheng
Priority to CN201710033452.3A priority Critical patent/CN108614849B/en
Publication of CN108614849A publication Critical patent/CN108614849A/en
Application granted granted Critical
Publication of CN108614849B publication Critical patent/CN108614849B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/43Checking; Contextual analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a novel detection method for webpage advertisements, which uses a program analysis method combining dynamic analysis and static analysis to identify and detect advertisement codes contained in a webpage. Firstly, acquiring a possible advertisement position in a webpage by using a dynamic analysis method, then recording and tracking the advertisement on the position, and acquiring function call path information generated by the advertisement so as to acquire a generated script file set; classifying the file set by using the static characteristics, and extracting the static characteristics on the advertisement generation path; and regulating and controlling the types and the quantity of the contained static features according to the test set. On the basis of reducing the omission factor, the detection precision of the webpage dynamic advertisement is improved.

Description

一种基于动态插桩和静态多脚本页特征提取的网页广告检测 方法A Webpage Advertisement Detection Based on Dynamic Insertion and Static Multi-Script Page Feature Extraction method

技术领域technical field

本发明涉及一种基于动态插桩和静态多脚本页特征提取的网页广告检测方法,属于互联网和软件工程领域。The invention relates to a webpage advertisement detection method based on dynamic stub insertion and static multi-script page feature extraction, belonging to the field of Internet and software engineering.

背景技术Background technique

随着互联网的普及和发展,网络广告成为免费网站盈利和持续发展的重要前提;而随着Web2.0技术的发展,JavaScript的广泛使用使得网页广告成为网络广告的主流形式:当用户打开一个网页时,网页广告可能以弹出窗口的形式出现,也可能占据网页的部分篇幅用以吸引用户点击。这种广告的呈现,促进了以电子商务为主的网络销售商的业务发展,同时也提高了免费网站为用户提供服务的质量。With the popularization and development of the Internet, online advertising has become an important prerequisite for the profitability and sustainable development of free websites; and with the development of Web2. , webpage advertisements may appear in the form of pop-up windows, or occupy part of the webpage to attract users to click. The presentation of such advertisements has promoted the business development of online sellers mainly based on e-commerce, and has also improved the quality of services provided by free websites to users.

网络广告的迅速发展为网络商业模式提供了诸多便利,但同时也为众多恶意网站的传播提供了方便:有的恶意网站通过广告联盟在正常页面中加载自己的恶意脚本,诱导用户点击这些伪装成广告的恶意链接;有的网页广告占据了网页的大量篇幅,严重影响用户浏览网页时的阅读体验;有的网页广告严重干扰用户的正常访问,收集、侵犯用户的隐私信息。The rapid development of online advertising has provided a lot of convenience for the network business model, but it has also provided convenience for the spread of many malicious websites: some malicious websites load their own malicious scripts on normal pages through advertising alliances, and induce users to click on these websites disguised as Malicious links of advertisements; some webpage advertisements occupy a large amount of space on the webpage, seriously affecting the reading experience of users when browsing the webpage; some webpage advertisements seriously interfere with the normal visit of users, and collect and violate the privacy information of users.

近年来,对于网络广告的检测方法主要集中在静态模式匹配、静态特征匹配等方法上。In recent years, detection methods for online advertisements mainly focus on methods such as static pattern matching and static feature matching.

静态模式匹配方法有两种:通过收集所有广告服务公司的域名,生成黑名单;通过使用选择器模式匹配,识别浏览器中的广告元素。There are two static pattern matching methods: by collecting the domain names of all advertising service companies to generate a blacklist; by using selector pattern matching to identify advertising elements in the browser.

静态特征匹配的检测方法通过获取对包含有网页广告的网页进行特征抽取,通过获取页面中原生函数调用、eval函数的使用、代码长度、是否包含特定字符串、是否使用混淆技术等特征来对网页广告进行识别和检测。The detection method of static feature matching extracts features from webpages containing webpage advertisements, and analyzes webpages by acquiring features such as native function calls, eval function usage, code length, whether they contain specific strings, and whether obfuscation techniques are used. Ads are identified and detected.

现有技术中使用静态模式匹配方法无法正确检测混淆后的域名和选择器;另外,使用静态特征匹配方法只使用单一页面作为数据集,检测精度不高。The static pattern matching method in the prior art cannot correctly detect obfuscated domain names and selectors; in addition, the static feature matching method only uses a single page as a data set, and the detection accuracy is not high.

由于网页中的静态广告往往是在原生网页中插入一个含有链接的图片,甚至有时候是单纯的链接标签,这与门户网站上包含大量外部网站的图片链接没有实质上的不同,因此静态广告不在我们的关注范围之内。Since static advertisements on webpages often insert a picture containing a link in the original webpage, sometimes even a simple link label, which is not substantially different from the picture links of a large number of external websites on the portal website, so static advertisements are not included. within our focus.

对于广告的动态特性,我们关注的是动态广告而不是单纯的图片链接。对于动态广告的分析,我们主要考虑通过广告联盟进行传播的广告,这类广告通常只需要网页的发布者在编码页面时插入一个指定的标签,用于对于广告进行定位和放置,广告联盟通过识别该标签,根据浏览该页面的用户Cookie等信息动态生成需要显示的广告内容。广告的显示是在页面中插入一些JavaScript脚本文件,这些脚本文件往往自动执行,经过一系列函数调用,最终在页面上显示不同的广告。Regarding the dynamic nature of advertisements, we focus on dynamic advertisements rather than simple image links. For the analysis of dynamic advertisements, we mainly consider the advertisements spread through the advertising network. This type of advertisement usually only requires the publisher of the web page to insert a specified tag when coding the page for positioning and placement of the advertisement. The advertising network recognizes the This tag dynamically generates the advertisement content to be displayed according to information such as the cookie of the user browsing the page. The display of advertisements is to insert some JavaScript script files into the page, these script files are often executed automatically, and after a series of function calls, different advertisements are finally displayed on the page.

发明内容Contents of the invention

技术问题:本发明的目的是克服现有技术的不足,根据网页中广告的传播路径,动态获取广告生成路径上所有的脚本文件,以此作为特征识别网页广告。Technical problem: the purpose of the present invention is to overcome the deficiencies of the prior art, according to the propagation path of the advertisement in the webpage, dynamically acquire all the script files on the advertisement generation path, and use it as a feature to identify the webpage advertisement.

为实现上述目的,本发明首先通过执行包含网页广告的页面,动态获取网页中广告生成路径的函数调用路径信息,通过调用路径信息获取网页广告生成所需要的所有JavaScript脚本文件;在此基础上,对所有脚本文件进行静态分析,通过特征识别出网页广告。To achieve the above object, the present invention first dynamically obtains the function call path information of the advertisement generation path in the webpage by executing the page that includes the webpage advertisement, and obtains all JavaScript script files needed for webpage advertisement generation by calling the path information; on this basis, Static analysis is performed on all script files, and web page advertisements are identified through features.

该发明通过使用动态插桩的方式获得网页中的广告以及广告生成的调用路径,克服了静态模式匹配方法中对于混淆域名和选择器无法检测的不足;又由于根据广告生成过程中使用的多个脚本来抽取特征,针对性强,克服了静态特征匹配方法数据噪音大的缺陷。The invention obtains the advertisement in the web page and the call path generated by the advertisement by using dynamic insertion, which overcomes the inability to detect the confusing domain name and the selector in the static pattern matching method; The script is used to extract features, which is highly targeted and overcomes the defect of large data noise in the static feature matching method.

本发明方法具体包括如下步骤:The inventive method specifically comprises the steps:

步骤1:对广告进行分析,获取其动态特性,并在网页中定位广告Step 1: Analyze the ad, get its dynamic characteristics, and position the ad in the web page

通过对网页中动态广告的分析,获取网页广告的动态特性,为此需要对动态广告的完整生成过程进行分析,并对比其与普通网页页面元素的不同之处,定位具体的网页广告。Through the analysis of the dynamic advertisements in the webpage, the dynamic characteristics of the webpage advertisements can be obtained. For this purpose, it is necessary to analyze the complete generation process of the dynamic advertisements, and compare the differences between them and ordinary webpage elements to locate specific webpage advertisements.

步骤2:对网页广告进行调用路径的追踪Step 2: Track the calling path of the webpage advertisement

通过步骤1定位了页面中网页广告的具体位置,再对网页广告进行完整的调用路径追踪,其中包括广告生成过程中的函数调用路径以及具体执行的脚本代码等信息,通过对函数调用路径的获取可以得到广告生成路径上所有的JavaScript脚本文件。这样,对于广告的分析可以不局限于单个页面上元素的特征分析。Through step 1, the specific location of the webpage advertisement in the page is located, and then the complete call path tracking of the webpage advertisement is carried out, including the function call path during the advertisement generation process and the specific execution script code and other information, through the acquisition of the function call path You can get all the JavaScript script files on the advertisement generation path. In this way, the analysis of advertisements may not be limited to feature analysis of elements on a single page.

步骤3:对于获得的多个脚本文件抽取特征Step 3: Extract features for multiple script files obtained

根据步骤2的方法,获得与广告相关的JavaScript脚本文件,收集大量广告相关与无关的脚本文件,并进行特征抽取,获取广告生成相关的静态特征,包括HTML DOM元素特征、JavaScript脚本特征、CSS特征等,使用分类器对此进行训练,生成广告代码检测模型。According to the method in step 2, obtain JavaScript script files related to advertisements, collect a large number of advertisement-related and irrelevant script files, and perform feature extraction to obtain static features related to advertisement generation, including HTML DOM element features, JavaScript script features, and CSS features etc., train this with a classifier to generate an ad code detection model.

步骤4:对结果进行反馈Step 4: Feedback on the results

使用广告代码检测模型运行测试数据,将其结果与实际的广告进行比较,对分类过程中的阈值进行调整,对实际的网页广告进行检测和识别。Use the ad code detection model to run test data, compare its results with actual ads, adjust thresholds in the classification process, and detect and identify actual web ads.

与现有技术相比,本发明具有如下优点:Compared with prior art, the present invention has following advantage:

通过动态检测能够克服混淆代码的干扰,降低了页面的漏检率;通过获取广告调用路径上的多个JavaScript脚本文件,多层次地进行特征抽取,更加全面和可靠。Through dynamic detection, the interference of obfuscated code can be overcome, and the missed detection rate of the page can be reduced; by obtaining multiple JavaScript script files on the advertisement call path, multi-level feature extraction is performed, which is more comprehensive and reliable.

附图说明Description of drawings

图1是获取广告起始位置的流程图;Fig. 1 is a flow chart of acquiring the starting position of an advertisement;

图2是获取广告函数调用路径的流程图;Fig. 2 is a flow chart of obtaining an advertisement function call path;

具体实施方式Detailed ways

下面结合附图对本发明作进一步详细说明。The present invention will be described in further detail below in conjunction with the accompanying drawings.

步骤1:对广告进行分析获取其动态特性并在网页中定位广告Step 1: Analyze the ad to get its dynamic characteristics and position the ad in the web page

通过对网页中动态广告的分析,获取网页广告的动态特性,这些动态特性需要对动态广告的完整生成过程进行分析,并对比其与普通网页页面元素的不同之处,定位具体的网页广告代码。Through the analysis of the dynamic advertisements in the webpage, the dynamic characteristics of the webpage advertisements are obtained. These dynamic characteristics need to analyze the complete generation process of the dynamic advertisements, and compare the differences between them and the common webpage elements to locate the specific webpage advertisement codes.

对于广告的动态特性,我们关注的是动态广告而不是单纯的图片链接。对于动态广告的分析,我们主要考虑通过广告联盟进行传播的广告,这类广告通常只需要网页的发布者在编码页面时,插入一个指定的标签用于对于广告进行定位和放置,广告联盟通过识别该标签,根据浏览该页面的用户cookie等信息动态生成需要显示的广告内容。广告的显示是在页面中插入一些JavaScript脚本文件,这些脚本文件往往自动执行,经过一系列函数调用,最终在页面上显示不同的广告。Regarding the dynamic nature of advertisements, we focus on dynamic advertisements rather than simple image links. For the analysis of dynamic advertisements, we mainly consider the advertisements spread through the advertising network. This type of advertisement usually only requires the publisher of the web page to insert a specified tag for positioning and placement of the advertisement when coding the page. The advertising network recognizes the This tag dynamically generates the advertisement content to be displayed according to information such as the cookie of the user browsing the page. The display of advertisements is to insert some JavaScript script files into the page, these script files are often executed automatically, and after a series of function calls, different advertisements are finally displayed on the page.

由广告生成的过程可以知道,动态广告的生成一定调用了第三方的即广告联盟的脚本库,用于生成广告。因此我们使用的广告生成动态特性是来自于第三方脚本库JavaScript代码的自动执行。It can be known from the process of advertisement generation that the generation of dynamic advertisements must call a third-party script library, that is, the advertising network, to generate advertisements. Therefore, the dynamic feature of advertisement generation we use comes from the automatic execution of the JavaScript code of the third-party script library.

图1根据这个特征对网页元素进行插桩,记录主动执行的函数,并判断是否来自于第三方脚本库,对满足条件的执行进行标记。Figure 1 inserts web page elements according to this feature, records the functions that are actively executed, and judges whether they come from a third-party script library, and marks the execution that meets the conditions.

步骤2:对网页广告进行调用路径的追踪Step 2: Track the calling path of the webpage advertisement

通过步骤1定位了页面中网页广告的具体位置,再对网页广告进行完整的调用路径追踪。The specific location of the webpage advertisement in the page is located through step 1, and then a complete call path tracking is performed on the webpage advertisement.

网页广告的生成经历了多次函数调用,调用路径上实际经历了多次跳转,也调用了多个脚本文件,网页广告生成调用路径中的脚本文件是我们获取特征所需要使用的数据集,因此考虑对于已经经过步骤1确定的可能广告位,需要对其调用路径进行追踪和记录。The generation of web advertisements has undergone multiple function calls, and the calling path has actually experienced multiple jumps, and multiple script files have also been called. The script files in the calling path of web advertisement generation are the datasets we need to use to obtain features. Therefore, it is considered that for the possible advertising slots that have been determined in step 1, it is necessary to track and record their calling paths.

JavaScript中的函数调用,无法获得函数执行时调用了哪个函数,但可以获得调用自己的函数。根据这个特性,如图2所示,在网页运行中对每个函数获取其调用者,判断调用者是否已经被标记,如果其调用者被标记,则给本身函数加入路径信息,对路径进行追踪和保存;如果没有被标记,则不做任何操作。For function calls in JavaScript, you cannot get which function is called when the function is executed, but you can get the function that calls itself. According to this feature, as shown in Figure 2, the caller of each function is obtained during the operation of the web page, and it is judged whether the caller has been marked. If the caller is marked, path information is added to the function itself to track the path and save; if not marked, do nothing.

使用动态插桩方法可以对JavaScript中的函数加入包含调用信息的自定义属性,通过对调用信息的显示,可以获取路径上具体的脚本文件集合。由此可以实现对广告生成路径上所有脚本文件的特征进行分析。Using the dynamic instrumentation method, you can add custom attributes containing call information to the functions in JavaScript. By displaying the call information, you can obtain the specific set of script files on the path. In this way, the characteristics of all script files on the advertisement generation path can be analyzed.

步骤3:对于获得的多个脚本文件抽取特征Step 3: Extract features for multiple script files obtained

将步骤2中获得的多脚本文件作为数据集,抽取广告的静态特征。对大量包含广告的网站使用步骤2中的动态插桩方法得到与广告相关的JavaScript脚本文件,使用批处理的方法将相应文件进行保存,以作为抽取广告静态特征时的数据集。将于广告无关的JavaScript脚本文件用相同的方法进行保存并作为与广告无关的对照数据集。Use the multi-script file obtained in step 2 as a data set to extract the static features of the advertisement. For a large number of websites containing advertisements, use the dynamic insertion method in step 2 to obtain the JavaScript script files related to the advertisements, and use the batch processing method to save the corresponding files as the data set for extracting the static characteristics of the advertisements. The same method will be used to save the JavaScript script files irrelevant to advertisements as a control data set irrelevant to advertisements.

静态特征包括广告生成过程中函数调用路径的深度、脚本文件中字符串拼接次数、动态执行代码次数、使用原生函数的种类和次数、使用JavaScript事件处理函数的种类和次数等。根据这些特征对广告文件进行特征抽取,判定满足一定特征的脚本文件是广告。Static features include the depth of the function call path during the advertisement generation process, the number of string splicing in the script file, the number of dynamic code execution, the type and number of native functions used, the type and number of JavaScript event processing functions used, etc. Feature extraction is performed on the advertisement file according to these features, and it is determined that the script file satisfying certain features is an advertisement.

步骤4:对结果进行反馈Step 4: Feedback on the results

将分类器运行测试数据的结果与实际的广告进行比较,对分类过程中的阈值进行调整,对网页广告进行检测和识别。Compare the result of the classifier running the test data with the actual advertisement, adjust the threshold in the classification process, and detect and identify the webpage advertisement.

将实验数据分为两类,第一类为训练数据,用于对步骤3中所使用的特征分类为广告相关特征和广告无关特征;第二类为测试数据,用于对训练好的模型进行检验,用于评估模型的准确性。Divide the experimental data into two categories. The first category is training data, which is used to classify the features used in step 3 into advertisement-related features and advertisement-independent features; the second category is test data, which is used to perform training on the trained model. Tests to evaluate the accuracy of the model.

本发明不限于上述实例,一切采用等同替换或等效替换形成的技术方案均属于本发明要求保护的范围。The present invention is not limited to the above-mentioned examples, and all equivalent replacements or technical solutions formed by equivalent replacements fall within the protection scope of the present invention.

Claims (5)

1.一种基于动态插桩和静态多脚本页特征提取的网页广告检测方法,其特征在于,首先通过执行包含网页广告的页面,动态获取网页中广告生成路径的函数调用路径信息,通过调用路径信息获取网页广告生成所需要的所有JavaScript脚本文件;在此基础上,对所有脚本文件进行静态分析,通过特征分类模型识别出网页广告;1. A webpage advertisement detection method based on dynamic stub insertion and static multi-script page feature extraction, it is characterized in that, first by executing the page that contains webpage advertisement, dynamically obtain the function call path information of the advertisement generation path in the webpage, by calling the path Information acquisition of all JavaScript script files required for generating web page advertisements; on this basis, static analysis is performed on all script files, and web page advertisements are identified through feature classification models; 包括以下步骤:Include the following steps: 1)对广告进行分析,获取其动态特性,并在网页中定位广告,通过对网页中动态广告的分析,获取网页广告的动态特性,为此需要对动态广告的完整生成过程进行分析,并对比其与普通网页页面元素的不同之处,定位到具体的网页广告代码;1) Analyze the advertisement, obtain its dynamic characteristics, and locate the advertisement in the webpage, and obtain the dynamic characteristics of the webpage advertisement through the analysis of the dynamic advertisement in the webpage. For this reason, it is necessary to analyze the complete generation process of the dynamic advertisement, and compare The difference between it and ordinary web page elements is to locate specific web page advertising codes; 2)使用1)的方法,定位页面中网页广告的具体位置,再对网页广告进行完整的调用路径追踪,其中包括广告生成过程中的函数调用路径以及具体执行的脚本代码信息,通过对函数调用路径的获取得到广告生成路径上所有的JavaScript脚本文件;2) Use the method of 1) to locate the specific location of the webpage advertisement on the page, and then trace the complete calling path of the webpage advertisement, including the function calling path during the advertisement generation process and the specific execution script code information, through calling the function Obtaining the path obtains all JavaScript script files on the advertisement generation path; 3)对于获得的多个脚本文件抽取特征,即根据2)获得的JavaScript脚本文件,对多个JS文件抽取特征,获取广告生成的静态特征,包括HTMLDOM元素特征、JavaScript脚本特征、CSS特征,使用分类器对此进行训练,生成广告代码检测模型;3) extract features for the multiple script files obtained, that is, according to the JavaScript script files obtained in 2), extract features for multiple JS files, and obtain static features generated by advertisements, including HTMLDOM element features, JavaScript script features, CSS features, using The classifier is trained on this to generate an ad code detection model; 4)对结果进行反馈,使用广告代码检测模型运行测试数据,将其结果与实际的广告进行比较,对分类过程中的阈值进行调整,对实际的网页广告进行检测和识别。4) Feedback on the results, run the test data using the advertising code detection model, compare the results with the actual advertisements, adjust the thresholds in the classification process, and detect and identify the actual web advertisements. 2.根据权利要求1所述的基于动态插桩和静态多脚本页特征提取的网页广告检测方法,其特征是步骤1)中,对广告进行分析获取其动态特性并在网页中定位广告,具体为:2. the web page advertisement detection method based on dynamic insertion and static multi-script page feature extraction according to claim 1, is characterized in that in step 1), the advertisement is analyzed to obtain its dynamic characteristics and locate the advertisement in the webpage, specifically for: 通过对网页中动态广告的分析,获取网页广告的动态特性,为此需要对动态广告的完整生成过程进行分析,并对比其与普通网页页面元素的不同之处,定位到具体的网页广告代码;Through the analysis of the dynamic advertisements in the webpage, the dynamic characteristics of the webpage advertisements are obtained. For this reason, it is necessary to analyze the complete generation process of the dynamic advertisements, and compare the differences between them and the elements of ordinary webpages, and locate the specific webpage advertisement codes; 对于动态广告的分析,需要考虑通过广告联盟进行传播的广告,这类广告的生成调用了第三方即广告联盟的脚本库,用于生成广告,其广告生成动态特性是来自于第三方脚本库的JavaScript代码自动执行。For the analysis of dynamic advertisements, it is necessary to consider the advertisements spread through the advertising network. The generation of such advertisements calls the script library of the third party, namely the advertising network, to generate advertisements. The dynamic characteristics of the advertisement generation come from the third-party script library. JavaScript code is executed automatically. 3.根据权利要求1所述的基于动态插桩和静态多脚本页特征提取的网页广告检测方法,其特征是步骤2)中,对网页广告进行调用路径的追踪:3. the webpage advertisement detection method based on dynamic instrumentation and static multi-script page feature extraction according to claim 1, is characterized in that in step 2), the webpage advertisement is carried out the tracking of calling path: 当定位了页面中网页广告的具体位置后,再对网页广告进行完整的调用路径追踪;网页广告的生成经历了多次函数调用,调用路径包括了多次跳转,即调用了多个脚本文件,这些脚本文件是后续获取广告特征所需要使用的数据集,因此,对于已经确定的可能广告位,需要对其调用路径进行追踪和记录;After locating the specific position of the webpage advertisement on the page, the complete calling path of the webpage advertisement is tracked; the generation of the webpage advertisement has undergone multiple function calls, and the calling path includes multiple jumps, that is, calling multiple script files , these script files are the data sets needed for subsequent acquisition of advertising features, therefore, for the possible advertising positions that have been determined, it is necessary to track and record their calling paths; 在网页运行中对每个函数获取其调用者,判断调用者是否已经被标记,如果其调用者被标记,则给本身函数加入路径信息,对路径进行追踪和保存;如果没有被标记则不做任何操作;Obtain the caller of each function during webpage operation, and judge whether the caller has been marked. If the caller is marked, add path information to the function itself, track and save the path; if not marked, do not any operation; 使用动态插桩方法实现对JavaScript中函数添加包含调用信息的自定义属性,通过对调用信息的显示,获取路径上具体的脚本文件集合,由此实现对广告生成路径上所有脚本文件的特征进行分析。Use the dynamic instrumentation method to add custom attributes containing call information to the functions in JavaScript, and obtain the specific script file collection on the path by displaying the call information, so as to analyze the characteristics of all script files on the advertisement generation path . 4.根据权利要求1所述的基于动态插桩和静态多脚本页特征提取的网页广告检测方法,其特征在于步骤3)中,对于获得的多个脚本文件抽取特征:4. the web page advertisement detection method based on dynamic insertion and static multi-script page feature extraction according to claim 1, is characterized in that in step 3), extracts feature for a plurality of script files obtained: 对大量包含广告的网站使用动态插桩方法得到与广告相关的JavaScript脚本文件;使用批处理的方法将相应文件进行保存,以作为抽取广告静态特征时的数据集; 将于广告无关的JavaScript脚本文件用相同的方法进行保存并作为与广告无关的对照数据集; 对于广告相关的JavaScript脚本文件进行静态特征的抽取,包括广告生成过程中函数调用路径的深度、脚本文件中字符串拼接次数、动态执行代码次数、使用原生函数的种类和次数、使用JavaScript事件处理函数的种类和次数,据此对广告文件进行特征抽取,判定满足一定特征的脚本文件是否为广告。Use the dynamic insertion method for a large number of websites that contain advertisements to obtain advertisement-related JavaScript script files; use the batch method to save the corresponding files as a data set for extracting static features of advertisements; save JavaScript script files that have nothing to do with advertisements Use the same method to save and use it as a control data set that has nothing to do with advertisements; extract static features of JavaScript script files related to advertisements, including the depth of function call paths during advertisement generation, the number of string splicing in script files, and dynamic execution Code times, types and times of using native functions, types and times of using JavaScript event processing functions, based on which feature extraction is performed on advertisement files to determine whether script files satisfying certain characteristics are advertisements. 5.根据权利要求1所述的基于动态插桩和静态多脚本页特征提取的网页广告检测方法,其特征在于步骤4)中,对结果进行反馈,将分类器运行测试数据的结果与实际的广告进行比较,对分类过程中的阈值进行调整,对网页广告进行检测和识别;实验数据中的训练数据用于对使用的特征分类为广告相关特征和广告无关特征;测试数据用于对训练好的模型进行检验,用于评估模型的准确性。5. the web page advertisement detection method based on dynamic insertion and static multi-script page feature extraction according to claim 1, is characterized in that in step 4), the result is fed back, and the result of the classifier operation test data is compared with the actual Advertisements are compared, the threshold in the classification process is adjusted, and web advertisements are detected and identified; the training data in the experimental data is used to classify the used features into advertisement-related features and advertisement-independent features; the test data is used to train good The model is tested to evaluate the accuracy of the model.
CN201710033452.3A 2017-01-13 2017-01-13 A Webpage Advertisement Detection Method Based on Dynamic Insertion and Static Multi-Script Page Feature Extraction Active CN108614849B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710033452.3A CN108614849B (en) 2017-01-13 2017-01-13 A Webpage Advertisement Detection Method Based on Dynamic Insertion and Static Multi-Script Page Feature Extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710033452.3A CN108614849B (en) 2017-01-13 2017-01-13 A Webpage Advertisement Detection Method Based on Dynamic Insertion and Static Multi-Script Page Feature Extraction

Publications (2)

Publication Number Publication Date
CN108614849A CN108614849A (en) 2018-10-02
CN108614849B true CN108614849B (en) 2022-11-18

Family

ID=63658174

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710033452.3A Active CN108614849B (en) 2017-01-13 2017-01-13 A Webpage Advertisement Detection Method Based on Dynamic Insertion and Static Multi-Script Page Feature Extraction

Country Status (1)

Country Link
CN (1) CN108614849B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110278212A (en) * 2019-06-26 2019-09-24 中国工商银行股份有限公司 Link detection method and device
CN111177614A (en) * 2019-11-22 2020-05-19 中国电子产品可靠性与环境试验研究所((工业和信息化部电子第五研究所)(中国赛宝实验室)) Source tracking method and device for injecting content to third parties on webpages
CN113870064A (en) * 2020-06-30 2021-12-31 北京奇虎科技有限公司 Advertising evidence collection method, system, storage medium and computer equipment of intelligent terminal

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103093148A (en) * 2012-12-28 2013-05-08 广东欧珀移动通信有限公司 Method, system and device for detecting malicious advertisement
CN103177382B (en) * 2013-03-19 2015-11-11 武汉大学 Key propagation path in microblog and the detection method of Centroid
CN103905423B (en) * 2013-12-25 2017-08-11 武汉安天信息技术有限责任公司 A kind of harmful advertising member detection method and system analyzed based on dynamic behaviour
CN104766014B (en) * 2015-04-30 2017-12-01 安一恒通(北京)科技有限公司 Method and system for detecting malicious website

Also Published As

Publication number Publication date
CN108614849A (en) 2018-10-02

Similar Documents

Publication Publication Date Title
CA3018196C (en) Visual regresssion testing tool
US8412569B1 (en) Determining advertising statistics for advertisers and/or advertising networks
Atterer et al. Knowing the user's every move: user activity tracking for website usability evaluation and implicit interaction
CN107908959B (en) Website information detection method and device, electronic equipment and storage medium
US20160140626A1 (en) Web page advertisement configuration and optimization with visual editor and automatic website and webpage analysis
CN110399291A (en) User Page test method and relevant device based on image recognition
US20130054672A1 (en) Systems and methods for contextualizing a toolbar
KR20110035960A (en) Method and system to identify advertisement in web page
CN104766014A (en) Method and system used for detecting malicious website
TW200917057A (en) Automatically instrumenting a set of web documents
CN108614849B (en) A Webpage Advertisement Detection Method Based on Dynamic Insertion and Static Multi-Script Page Feature Extraction
US20220269736A1 (en) Identifying web elements based on user browsing activity and machine learning
US10282761B2 (en) Systems and processes for detecting content blocking software
CN105160027A (en) Advertisement data processing method and device
US20200225927A1 (en) Methods and systems for automating computer application tasks using application guides, markups and computer vision
CN109191158A (en) The processing method and processing equipment of user's portrait label data
US20130091415A1 (en) Systems and methods for invisible area detection and contextualization
Martínez et al. Web-tracking compliance: websites’ level of confidence in the use of information-gathering technologies
CN111914199B (en) A method, device, equipment and storage medium for filtering page elements
CN108171074B (en) An Automatic Detection Method of Web Tracking Based on Content Association
Bevendorff et al. The impact of online affiliate marketing on web search
KR102732928B1 (en) Method and apparatus for analyzing posting of advertising marketer based on posting website
CN118820607B (en) Digital school enrollment information matching method and system combined with dynamic content recommendation
CN114662145B (en) Access tracking detection method, device, readable storage medium and electronic device
US20230096058A1 (en) Data Correlation System And Method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20181002

Assignee: Jiangsu Yanan Information Technology Co.,Ltd.

Assignor: NUPT INSTITUTE OF BIG DATA RESEARCH AT YANCHENG

Contract record no.: X2023980047097

Denomination of invention: A web ad detection method based on dynamic instrumentation and static multi script page feature extraction

Granted publication date: 20221118

License type: Common License

Record date: 20231117

EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20181002

Assignee: Yanmi Technology (Yancheng) Co.,Ltd.

Assignor: NUPT INSTITUTE OF BIG DATA RESEARCH AT YANCHENG

Contract record no.: X2023980047098

Denomination of invention: A web ad detection method based on dynamic instrumentation and static multi script page feature extraction

Granted publication date: 20221118

License type: Common License

Record date: 20231115

EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20181002

Assignee: Yancheng Nongfu Technology Co.,Ltd.

Assignor: NUPT INSTITUTE OF BIG DATA RESEARCH AT YANCHENG

Contract record no.: X2023980049126

Denomination of invention: A web ad detection method based on dynamic instrumentation and static multi script page feature extraction

Granted publication date: 20221118

License type: Common License

Record date: 20231203

EE01 Entry into force of recordation of patent licensing contract
EC01 Cancellation of recordation of patent licensing contract

Assignee: Yancheng Nongfu Technology Co.,Ltd.

Assignor: NUPT INSTITUTE OF BIG DATA RESEARCH AT YANCHENG

Contract record no.: X2023980049126

Date of cancellation: 20241028

Assignee: Yanmi Technology (Yancheng) Co.,Ltd.

Assignor: NUPT INSTITUTE OF BIG DATA RESEARCH AT YANCHENG

Contract record no.: X2023980047098

Date of cancellation: 20241028

Assignee: Jiangsu Yanan Information Technology Co.,Ltd.

Assignor: NUPT INSTITUTE OF BIG DATA RESEARCH AT YANCHENG

Contract record no.: X2023980047097

Date of cancellation: 20241028

EC01 Cancellation of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20181002

Assignee: Yancheng Nongfu Technology Co.,Ltd.

Assignor: NUPT INSTITUTE OF BIG DATA RESEARCH AT YANCHENG

Contract record no.: X2024980021382

Denomination of invention: A web advertisement detection method based on dynamic staking and static multi script page feature extraction

Granted publication date: 20221118

License type: Common License

Record date: 20241030

Application publication date: 20181002

Assignee: Yancheng Hongrui Huicheng Technology Co.,Ltd.

Assignor: NUPT INSTITUTE OF BIG DATA RESEARCH AT YANCHENG

Contract record no.: X2024980020857

Denomination of invention: A web advertisement detection method based on dynamic staking and static multi script page feature extraction

Granted publication date: 20221118

License type: Common License

Record date: 20241028

Application publication date: 20181002

Assignee: Shuzhilian (Yancheng) Technology Co.,Ltd.

Assignor: NUPT INSTITUTE OF BIG DATA RESEARCH AT YANCHENG

Contract record no.: X2024980020855

Denomination of invention: A web advertisement detection method based on dynamic staking and static multi script page feature extraction

Granted publication date: 20221118

License type: Common License

Record date: 20241028

Application publication date: 20181002

Assignee: Borui Hengchuang (Yancheng) Technology Co.,Ltd.

Assignor: NUPT INSTITUTE OF BIG DATA RESEARCH AT YANCHENG

Contract record no.: X2024980020851

Denomination of invention: A web advertisement detection method based on dynamic staking and static multi script page feature extraction

Granted publication date: 20221118

License type: Common License

Record date: 20241028

EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20181002

Assignee: Jiangsu Yanan Information Technology Co.,Ltd.

Assignor: NUPT INSTITUTE OF BIG DATA RESEARCH AT YANCHENG

Contract record no.: X2024980022197

Denomination of invention: A web advertisement detection method based on dynamic staking and static multi script page feature extraction

Granted publication date: 20221118

License type: Common License

Record date: 20241101

Application publication date: 20181002

Assignee: Yanmi Technology (Yancheng) Co.,Ltd.

Assignor: NUPT INSTITUTE OF BIG DATA RESEARCH AT YANCHENG

Contract record no.: X2024980021700

Denomination of invention: A web advertisement detection method based on dynamic staking and static multi script page feature extraction

Granted publication date: 20221118

License type: Common License

Record date: 20241031

EE01 Entry into force of recordation of patent licensing contract