Nothing Special   »   [go: up one dir, main page]

CN108614849B - Webpage advertisement detection method based on dynamic instrumentation and static multi-script page feature extraction - Google Patents

Webpage advertisement detection method based on dynamic instrumentation and static multi-script page feature extraction Download PDF

Info

Publication number
CN108614849B
CN108614849B CN201710033452.3A CN201710033452A CN108614849B CN 108614849 B CN108614849 B CN 108614849B CN 201710033452 A CN201710033452 A CN 201710033452A CN 108614849 B CN108614849 B CN 108614849B
Authority
CN
China
Prior art keywords
advertisement
webpage
dynamic
script
advertisements
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710033452.3A
Other languages
Chinese (zh)
Other versions
CN108614849A (en
Inventor
张卫丰
赵晨
刘蕊成
陈贵美
许蕾
张迎周
周国强
王子元
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nupt Institute Of Big Data Research At Yancheng
Original Assignee
Nupt Institute Of Big Data Research At Yancheng
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nupt Institute Of Big Data Research At Yancheng filed Critical Nupt Institute Of Big Data Research At Yancheng
Priority to CN201710033452.3A priority Critical patent/CN108614849B/en
Publication of CN108614849A publication Critical patent/CN108614849A/en
Application granted granted Critical
Publication of CN108614849B publication Critical patent/CN108614849B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/43Checking; Contextual analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a novel detection method for webpage advertisements, which uses a program analysis method combining dynamic analysis and static analysis to identify and detect advertisement codes contained in a webpage. Firstly, acquiring a possible advertisement position in a webpage by using a dynamic analysis method, then recording and tracking the advertisement on the position, and acquiring function call path information generated by the advertisement so as to acquire a generated script file set; classifying the file set by using the static characteristics, and extracting the static characteristics on the advertisement generation path; and regulating and controlling the types and the quantity of the contained static features according to the test set. On the basis of reducing the omission factor, the detection precision of the webpage dynamic advertisement is improved.

Description

Webpage advertisement detection method based on dynamic instrumentation and static multi-script page feature extraction
Technical Field
The invention relates to a webpage advertisement detection method based on dynamic instrumentation and static multi-script page feature extraction, and belongs to the field of Internet and software engineering.
Background
With the popularization and development of the internet, the network advertisement becomes an important precondition for the profit and the continuous development of free websites; with the development of the web2.0 technology, the wide use of JavaScript makes web advertisements become the mainstream form of web advertisements: when a user opens a web page, the web page advertisement may appear as a pop-up window or may occupy a portion of the web page to attract the user to click on. The advertisement presentation promotes the business development of network sellers mainly based on electronic commerce, and simultaneously improves the quality of the service provided by free websites for users.
The rapid development of web advertisements provides a great deal of convenience for web business models, but also provides convenience for the spread of many malicious websites: some malicious websites load own malicious scripts in a normal page through an advertisement alliance and induce users to click the malicious links disguised as advertisements; some web advertisements occupy a large amount of pages, which seriously affects the reading experience of a user when browsing the web; some web advertisements seriously interfere with normal access of users and collect and infringe privacy information of the users.
In recent years, the detection methods for network advertisements mainly focus on methods such as static pattern matching and static feature matching.
There are two static pattern matching methods: generating a blacklist by collecting domain names of all advertisement service companies; using selector pattern matching, advertisement elements in the browser are identified.
The static feature matching detection method extracts features of a webpage containing the webpage advertisement by obtaining, and identifies and detects the webpage advertisement by obtaining features of original function call in the webpage, use of eval function, code length, whether specific character strings are contained or not, whether confusion technology is used or not and the like.
In the prior art, a static pattern matching method cannot be used for correctly detecting the confused domain name and the selector; in addition, the static feature matching method only uses a single page as a data set, and the detection precision is not high.
Because the static advertisement in the web page is usually to insert a picture containing a link in the original web page, even sometimes a simple link label, which is not substantially different from the picture links on the portal site containing a large number of external sites, the static advertisement is not in our attention scope.
With respect to the dynamic nature of advertisements, we focus on dynamic advertisements rather than mere picture links. For analysis of dynamic advertisements, we mainly consider advertisements propagated through an advertisement alliance, and such advertisements generally only require a publisher of a web page to insert a specified tag when encoding the web page for positioning and placing the advertisements, and the advertisement alliance dynamically generates advertisement content to be displayed according to information such as cookies of a user browsing the web page by identifying the tag. The advertisement display is to insert some JavaScript script files into the page, and the script files are usually automatically executed and finally display different advertisements on the page through a series of function calls.
Disclosure of Invention
The technical problem is as follows: the invention aims to overcome the defects of the prior art, and dynamically acquires all script files on an advertisement generation path according to the propagation path of the advertisement in a webpage, and the script files are used as features to identify the webpage advertisement.
In order to achieve the purpose, the method comprises the steps of firstly, executing a page containing the webpage advertisement, dynamically obtaining function call path information of an advertisement generation path in the webpage, and obtaining all JavaScript files required by webpage advertisement generation through the call path information; on the basis, all script files are subjected to static analysis, and the webpage advertisements are identified through the characteristics.
The method obtains the advertisements in the webpage and the call paths generated by the advertisements by using a dynamic instrumentation mode, and overcomes the defect that a confusion domain name and a selector cannot be detected in a static pattern matching method; and because the characteristics are extracted according to a plurality of scripts used in the advertisement generating process, the method has strong pertinence and overcomes the defect of high data noise of a static characteristic matching method.
The method specifically comprises the following steps:
step 1: analyzing the advertisement to obtain its dynamic characteristics and positioning advertisement in web page
The dynamic characteristics of the web page advertisements are obtained by analyzing the dynamic advertisements in the web pages, so that the complete generation process of the dynamic advertisements needs to be analyzed, and the differences between the dynamic advertisements and common web page elements are compared to position specific web page advertisements.
Step 2: tracing call paths for web advertisements
The specific position of the webpage advertisement in the page is positioned through the step 1, then the webpage advertisement is completely called and tracked, wherein the webpage advertisement comprises information such as a function calling path in the advertisement generating process and a specifically executed script code, and all JavaScript files on the advertisement generating path can be obtained through obtaining the function calling path. Thus, analysis of advertisements may not be limited to feature analysis of elements on a single page.
And step 3: extracting features for the obtained plurality of script files
According to the method in the step 2, javaScript files related to the advertisement are obtained, a large number of script files related to the advertisement and unrelated to the advertisement are collected, feature extraction is carried out, static features related to advertisement generation, including HTML DOM element features, javaScript script features, CSS features and the like, are obtained, a classifier is used for training the static features, and an advertisement code detection model is generated.
And 4, step 4: feeding back the result
And running test data by using the advertisement code detection model, comparing the result with the actual advertisement, adjusting a threshold value in the classification process, and detecting and identifying the actual webpage advertisement.
Compared with the prior art, the invention has the following advantages:
the interference of the confusion codes can be overcome through dynamic detection, and the missing rate of the page is reduced; by acquiring a plurality of JavaScript files on the advertisement calling path, the characteristics are extracted in a multi-level manner, so that the method is more comprehensive and reliable.
Drawings
FIG. 1 is a flow chart for obtaining an advertisement starting location;
FIG. 2 is a flow diagram of obtaining an advertisement function call path;
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
Step 1: analyzing advertisement to obtain its dynamic characteristics and positioning advertisement in web page
The dynamic characteristics of the web page advertisements are obtained by analyzing the dynamic advertisements in the web page, and the dynamic characteristics need to analyze the complete generation process of the dynamic advertisements and compare the differences with the common web page elements to locate specific web page advertisement codes.
With respect to the dynamic nature of advertisements, we focus on dynamic advertisements rather than mere picture links. For analysis of dynamic advertisements, we mainly consider advertisements propagated through an advertisement alliance, and such advertisements generally only require that a publisher of a web page inserts a specified tag for positioning and placing the advertisements when the web page is coded, and the advertisement alliance dynamically generates advertisement content to be displayed according to information such as a user cookie and the like browsing the web page by identifying the tag. The advertisement display is to insert some JavaScript script files into the page, and the script files are usually automatically executed and finally display different advertisements on the page through a series of function calls.
The process of generating the advertisement can know that the generation of the dynamic advertisement calls a script library of a third party, namely an advertisement alliance, for generating the advertisement. The advertisement generation dynamics we use is therefore from the automatic execution of third party scripting library JavaScript code.
Fig. 1 performs instrumentation on web page elements according to this feature, records an actively executed function, and determines whether the function comes from a third-party script library, and marks execution satisfying conditions.
Step 2: tracing call paths for web advertisements
The specific position of the webpage advertisement in the page is positioned through the step 1, and then the webpage advertisement is completely called and tracked.
The generation of the webpage advertisement is subjected to a plurality of function calls, a calling path is actually subjected to a plurality of jumps, and a plurality of script files are also called, and the script files in the calling path generated by the webpage advertisement are data sets required to be used by the user for acquiring features, so that the calling path of the possible advertisement position determined in the step 1 needs to be tracked and recorded.
Function call in JavaScript cannot obtain which function was called when the function was executed, but can obtain the function that called itself. According to the characteristic, as shown in fig. 2, a caller is acquired for each function in the running of the webpage, whether the caller is marked or not is judged, if the caller is marked, path information is added to the function per se, and the path is tracked and saved; if not, no action is taken.
By using the dynamic instrumentation method, a user-defined attribute containing calling information can be added to a function in JavaScript, and a specific script file set on a path can be obtained by displaying the calling information. Therefore, the characteristics of all script files on the advertisement generation path can be analyzed.
And step 3: extracting features for the obtained plurality of script files
And taking the multi-script file obtained in the step 2 as a data set, and extracting static characteristics of the advertisement. And (3) obtaining JavaScript files related to the advertisements for a large number of websites containing the advertisements by using the dynamic instrumentation method in the step (2), and storing the corresponding files by using a batch processing method to serve as a data set when the static characteristics of the advertisements are extracted. And saving the JavaScript file irrelevant to the advertisement by the same method and using the JavaScript file as a comparison data set irrelevant to the advertisement.
The static characteristics comprise the depth of a function call path in the advertisement generation process, the splicing times of character strings in a script file, the times of dynamically executing codes, the types and times of using a native function, the types and times of using a JavaScript event processing function and the like. Feature extraction is performed on the advertisement file according to the features, and the script file meeting certain features is judged to be the advertisement.
And 4, step 4: feeding back the result
And comparing the result of the classifier operation test data with the actual advertisement, adjusting a threshold value in the classification process, and detecting and identifying the webpage advertisement.
Dividing the experimental data into two types, wherein the first type is training data and is used for classifying the features used in the step 3 into advertisement-related features and advertisement-unrelated features; the second category is test data used for testing the trained model and for evaluating the accuracy of the model.
The invention is not limited to the above examples, and all technical solutions formed by equivalents or equivalent substitutes are within the scope of the invention as claimed.

Claims (5)

1. A webpage advertisement detection method based on dynamic instrumentation and static multi-script page feature extraction is characterized in that firstly, function call path information of an advertisement generation path in a webpage is dynamically acquired by executing a page containing a webpage advertisement, and all JavaScript files required by webpage advertisement generation are acquired by calling the path information; on the basis, all script files are subjected to static analysis, and the webpage advertisements are identified through a feature classification model;
the method comprises the following steps:
1) Analyzing the advertisement to obtain the dynamic characteristic of the advertisement, positioning the advertisement in a webpage, and analyzing the dynamic advertisement in the webpage to obtain the dynamic characteristic of the webpage advertisement, so that the complete generation process of the dynamic advertisement needs to be analyzed, the difference between the dynamic advertisement and the common webpage elements is compared, and a specific webpage advertisement code is positioned;
2) Using the method of 1), positioning the specific position of the webpage advertisement in the page, then performing complete calling path tracking on the webpage advertisement, wherein the function calling path and the specifically executed script code information in the advertisement generating process are included, and all JavaScript files on the advertisement generating path are obtained by acquiring the function calling path;
3) Extracting features of the obtained multiple script files, namely extracting features of the multiple JS files according to the JavaScript files obtained in step 2), obtaining static features generated by the advertisement, including HTMLDOM element features, javaScript features and CSS features, and training the static features by using a classifier to generate an advertisement code detection model;
4) And feeding back the result, running test data by using an advertisement code detection model, comparing the result with the actual advertisement, adjusting a threshold value in the classification process, and detecting and identifying the actual webpage advertisement.
2. The method for detecting the web page advertisement based on the dynamic instrumentation and the static multi-script page feature extraction as claimed in claim 1, wherein in the step 1), the advertisement is analyzed to obtain the dynamic characteristics thereof and is positioned in the web page, specifically:
the dynamic characteristics of the webpage advertisements are obtained by analyzing the dynamic advertisements in the webpage, so that the complete generation process of the dynamic advertisements needs to be analyzed, the differences between the dynamic advertisements and common webpage elements are compared, and specific webpage advertisement codes are positioned;
for the analysis of dynamic advertisements, the advertisements propagated through the advertisement alliance need to be considered, the generation of the advertisements invokes a script library of a third party, namely the advertisement alliance, for generating the advertisements, and the dynamic characteristics of the generation of the advertisements are automatically executed by JavaScript codes from the script library of the third party.
3. The web advertisement detection method based on dynamic instrumentation and static multi-script page feature extraction as claimed in claim 1, wherein in step 2), the web advertisement is followed by calling path:
after the specific position of the webpage advertisement in the page is positioned, carrying out complete calling path tracking on the webpage advertisement; the generation of the webpage advertisement is subjected to multiple function calls, the call path comprises multiple jumps, namely, multiple script files are called, and the script files are data sets required to be used for subsequently acquiring advertisement characteristics, so that the call path of the determined possible advertisement position is required to be tracked and recorded;
acquiring a caller of each function in the running process of the webpage, judging whether the caller is marked or not, if the caller is marked, adding path information into the function per se, and tracking and storing the path; if not marked, no operation is done;
and adding a custom attribute containing calling information to a function in the JavaScript by using a dynamic instrumentation method, and acquiring a specific script file set on a path by displaying the calling information, thereby analyzing the characteristics of all script files on an advertisement generation path.
4. The web advertisement detection method based on dynamic instrumentation and static multi-script page feature extraction according to claim 1, wherein in step 3), for the obtained plurality of script files, features are extracted:
a JavaScript script file related to the advertisement is obtained by using a dynamic pile inserting method for a large number of websites containing the advertisement; storing the corresponding file by using a batch processing method to serve as a data set when the static characteristics of the advertisement are extracted; saving the JavaScript file irrelevant to the advertisement by the same method and using the JavaScript file as a comparison data set irrelevant to the advertisement; and extracting static characteristics of the JavaScript files related to the advertisement, wherein the static characteristics comprise the depth of a function call path in the advertisement generation process, the splicing times of character strings in the script files, the times of dynamically executing codes, the types and times of using a native function and the types and times of using a JavaScript event processing function, and accordingly, the characteristics of the advertisement files are extracted, and whether the script files meeting certain characteristics are the advertisement or not is judged.
5. The web advertisement detection method based on dynamic instrumentation and static multi-script page feature extraction according to claim 1, wherein in step 4), the result is fed back, the result of the classifier running test data is compared with the actual advertisement, the threshold value in the classification process is adjusted, and the web advertisement is detected and identified; training data in the experimental data is used for classifying the used features into advertisement-related features and advertisement-unrelated features; the test data is used for testing the trained model and evaluating the accuracy of the model.
CN201710033452.3A 2017-01-13 2017-01-13 Webpage advertisement detection method based on dynamic instrumentation and static multi-script page feature extraction Active CN108614849B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710033452.3A CN108614849B (en) 2017-01-13 2017-01-13 Webpage advertisement detection method based on dynamic instrumentation and static multi-script page feature extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710033452.3A CN108614849B (en) 2017-01-13 2017-01-13 Webpage advertisement detection method based on dynamic instrumentation and static multi-script page feature extraction

Publications (2)

Publication Number Publication Date
CN108614849A CN108614849A (en) 2018-10-02
CN108614849B true CN108614849B (en) 2022-11-18

Family

ID=63658174

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710033452.3A Active CN108614849B (en) 2017-01-13 2017-01-13 Webpage advertisement detection method based on dynamic instrumentation and static multi-script page feature extraction

Country Status (1)

Country Link
CN (1) CN108614849B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110278212A (en) * 2019-06-26 2019-09-24 中国工商银行股份有限公司 Link detection method and device
CN111177614A (en) * 2019-11-22 2020-05-19 中国电子产品可靠性与环境试验研究所((工业和信息化部电子第五研究所)(中国赛宝实验室)) Source tracking method and device for injecting content to third party of webpage
CN113870064A (en) * 2020-06-30 2021-12-31 北京奇虎科技有限公司 Advertisement evidence obtaining method and system of intelligent terminal, storage medium and computer equipment thereof

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103093148A (en) * 2012-12-28 2013-05-08 广东欧珀移动通信有限公司 Detection method, system and device of malicious advertisements
CN103177382B (en) * 2013-03-19 2015-11-11 武汉大学 Key propagation path in microblog and the detection method of Centroid
CN103905423B (en) * 2013-12-25 2017-08-11 武汉安天信息技术有限责任公司 A kind of harmful advertising member detection method and system analyzed based on dynamic behaviour
CN104766014B (en) * 2015-04-30 2017-12-01 安一恒通(北京)科技有限公司 Method and system for detecting malicious website

Also Published As

Publication number Publication date
CN108614849A (en) 2018-10-02

Similar Documents

Publication Publication Date Title
CN107908959B (en) Website information detection method and device, electronic equipment and storage medium
US8869025B2 (en) Method and system for identifying advertisement in web page
CN104766014A (en) Method and system used for detecting malicious website
US20160140626A1 (en) Web page advertisement configuration and optimization with visual editor and automatic website and webpage analysis
CN110399291A (en) User Page test method and relevant device based on image recognition
CN108614849B (en) Webpage advertisement detection method based on dynamic instrumentation and static multi-script page feature extraction
CN110020339B (en) Webpage data acquisition method and device based on non-buried point
US8458584B1 (en) Extraction and analysis of user-generated content
US11436133B2 (en) Comparable user interface object identifications
CN111783016A (en) Website classification method, device and equipment
US20220269736A1 (en) Identifying web elements based on user browsing activity and machine learning
Choudhary et al. A cross-browser web application testing tool
WO2012135690A1 (en) Systems and methods for invisible area detection and contextualization
TeBlunthuis et al. Dwelling on Wikipedia: Investigating time spent by global encyclopedia readers
CN109948080A (en) A kind of counteradvertising based on machine learning intercepts the application method of detection system
CN111125704B (en) Webpage Trojan horse recognition method and system
CN112667502A (en) Page testing method, device and medium
CN108171074B (en) Web tracking automatic detection method based on content association
JP2015001795A (en) Personality analysis system and personality analysis program
TW201931817A (en) Method and system for identifying users on internet
WO2023275887A1 (en) System and method for automated software testing
CN108256338A (en) A kind of Chrome rewritten based on extension API extends sensitive data tracking
CN114239689A (en) Multi-mode-based website type judgment method and device
CN109902004B (en) Method and device for testing application program link channel
Su et al. Research and design of website user behavior data acquisition based on customized event tracking

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20181002

Assignee: Jiangsu Yanan Information Technology Co.,Ltd.

Assignor: NUPT INSTITUTE OF BIG DATA RESEARCH AT YANCHENG

Contract record no.: X2023980047097

Denomination of invention: A web ad detection method based on dynamic instrumentation and static multi script page feature extraction

Granted publication date: 20221118

License type: Common License

Record date: 20231117

EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20181002

Assignee: Yanmi Technology (Yancheng) Co.,Ltd.

Assignor: NUPT INSTITUTE OF BIG DATA RESEARCH AT YANCHENG

Contract record no.: X2023980047098

Denomination of invention: A web ad detection method based on dynamic instrumentation and static multi script page feature extraction

Granted publication date: 20221118

License type: Common License

Record date: 20231115

EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20181002

Assignee: Yancheng Nongfu Technology Co.,Ltd.

Assignor: NUPT INSTITUTE OF BIG DATA RESEARCH AT YANCHENG

Contract record no.: X2023980049126

Denomination of invention: A web ad detection method based on dynamic instrumentation and static multi script page feature extraction

Granted publication date: 20221118

License type: Common License

Record date: 20231203