CN113821754B

CN113821754B - Method and device for identifying crawler of sensitive data interface

Info

Publication number: CN113821754B
Application number: CN202111100833.1A
Authority: CN
Inventors: 葛胜利; 魏国富; 夏玉明
Original assignee: Information and Data Security Solutions Co Ltd
Current assignee: Information and Data Security Solutions Co Ltd
Priority date: 2021-09-18
Filing date: 2021-09-18
Publication date: 2024-08-16
Anticipated expiration: 2041-09-18
Also published as: CN113821754A

Abstract

The invention discloses a method and a device for identifying a crawler of a sensitive data interface, wherein the method comprises the following steps: acquiring a web access log of a website; identifying the crawler according to the web access log; judging the type of the crawler; according to different crawler types, using parameters of the crawler to initiate a request to a website, acquiring content of a request response, collecting the content of the request response according to request url, and storing text parts of the content returned by the website according to a collection domain name group; extracting feature data of the stored text, and correspondingly extracting important link addresses and text keyword results from the text under each domain name; identifying whether the text keyword results are sensitive or not, and outputting whether the text keyword results are sensitive or not, wherein the sensitive data type is related; the invention has the advantages that: and the crawler motor is effectively identified, the crawler behavior related to the sensitive information is identified, and the network information safety is ensured.

Description

Method and device for identifying crawler of sensitive data interface

Technical Field

The invention relates to the field of crawler identification, in particular to a method and a device for identifying a sensitive data interface crawler.

Background

In the prior art, data in a network can be acquired through means such as a web crawler and the like, and a program or a script for removing website information is automatically grasped according to a certain rule. Most of the prior art aims at intercepting crawlers, but the crawlers can bypass by changing programs or simulating the behaviors of real users, and particularly, interfaces of websites have valuable sensitive information.

The crawler identification method in the prior art can be roughly classified into two types, wherein one type is an expert rule engine scheme, data are collected through a business log, single or multiple attribute events are configured to be accumulated in number, and the event exceeding a threshold value is intercepted through a threshold value type rule; or intercepting through a blacklist acquired by attributes such as the attribute, the IP, the user agent and the like. As technology is gradually improved, a simulator is used by black products, and special software is used for carrying out trial and error on a wind control rule engine and bypassing, so that the information security of a website is difficult to continuously ensure, and particularly, the information security of the website is more difficult to maintain under the condition that a certain valuable sensitive information exists on an interface of the website.

The other type is a crawler scheme based on anomaly detection and recognition of a user behavior sequence, wherein the user access behavior path is constructed, the behavior path probability is calculated by using technical schemes such as a probability model, and the access paths of abnormal users and related users are output by using an unsupervised learning method. However, the technical scheme has a large number of false alarms, the workload of manual secondary analysis is more increased and complicated, and the information security of a website interface with sensitive information is difficult to maintain.

China patent grant bulletin No. CN108712426B discloses a crawler identification method and system based on user behavior buried points, wherein the method comprises the following steps: s1, a client receives an access request initiated by a user and asynchronously sends the access request to a back-end service system; s2, after receiving the access request, the back-end service system synchronizes an access log of the user, wherein the access log comprises access behavior data of the user; s3, the back-end service system aggregates access behavior data through a rule engine; s4, the back-end service system judges whether the user belongs to a crawler according to the aggregated access behavior data, if so, the crawler characteristic data for identifying the user as the crawler is aggregated according to the access log, and then the crawler characteristic data is asynchronously pushed to a crawler list in the client through a message queue; s5, the client responds to the access request according to the crawler list. According to the invention, the crawler is identified by synchronizing the access logs and aggregating the access behavior data in the logs, so that the crawler identification rate is improved and the accuracy is improved. However, not all crawlers need to be intercepted, and this scheme only identifies crawlers and cannot identify crawlers with sensitive data.

In summary, most of the prior art is directed to intercepting crawlers, and crawlers with sensitive data cannot be identified, so that security of network information is difficult to ensure.

Disclosure of Invention

The technical problem to be solved by the invention is that the prior art lacks a method for identifying crawlers on interfaces with sensitive information.

The invention solves the technical problems by the following technical means: a method of sensitive data interface crawler identification, the method comprising the steps of:

step one: acquiring a web access log of a website;

Step two: identifying the crawler according to the web access log;

Step three: judging the type of the crawler;

step four: according to different crawler types, using parameters of the crawler to initiate a request to a website, acquiring content of a request response, collecting the content of the request response according to request url, and storing text parts of the content returned by the website according to a collection domain name group;

Step five: extracting feature data of the stored text, and correspondingly extracting important link addresses and text keyword results from the text under each domain name;

step six: and identifying whether the text keyword result has sensitive information or not by using a sensitive data discovery technology, and outputting a corresponding result.

According to the invention, a request is initiated to a website by using parameters of the crawler according to different crawler types, content of a request response is obtained, content of the request response is collected according to request url, text parts of the content returned by the website are stored in groups according to a collection domain name, whether sensitive information is in a text keyword result is identified by using a sensitive data discovery technology, whether sensitive is output, and the type of sensitive data is output, so that the crawler engine is effectively identified, the crawler behavior related to the sensitive information is identified, and the security of network information is ensured.

Further, the web access log includes a time of the request, an IP address, user identity information, sessionid, requestbody, responbody, method, status, including an account number, a cookie, and a uuid.

Further, in the second step, a user behavior sequence-based anomaly detection method or a rule engine method is adopted to identify the crawler.

Further, the crawler type in the third step includes modifying parameters in url to perform page switching or modifying POST content request to transmit different parameters for the same url to perform page switching.

Still further, the fourth step includes:

step 401: according to different crawler types, using parameters of the crawler to initiate a Request to a website, wherein the Request contains additional headers information, so that crawler Request simulation is performed;

Step 402: carrying out page analysis on a website accessed by a crawler to acquire information returned by the website page, and obtaining content of a request response;

step 403: according to the content of the response of the request url aggregation request, if the crawler address of the page switching mode is carried out by modifying parameters in url, a non-parameter part of the crawler address is reserved and used as an aggregation domain name, and if the crawler address of the page switching mode is carried out by transmitting different parameters by modifying POST content request, the domain name of the crawler address is directly used as the aggregation domain name; and storing the text parts returned by the website according to the grouping of the collected domain names.

Further, the fifth step includes:

by the formula

Calculating word frequency, extracting words with word frequency exceeding a threshold value in the stored text as characteristic data, and correspondingly extracting important link addresses and text keyword results from the text under each domain name according to the word frequency; where n _i,j denotes the number of occurrences of the word t _i in text j,Represents the sum of all word words in text j,Representing the sum of all word frequencies in the corpus, nt _i represents the total frequency of occurrence of word t _i in the corpus.

Further, the sensitive information comprises a mobile phone number, a name, an address, a license plate number and an identity card number.

Further, the sensitive data interface crawler identification method further comprises a step seven:

And D, counting the number of url collection requests, the access rate, the number of request IP addresses, the number of IP access urls, the number of requests useragent, the number of returned 200, the number of access references, the type of access Method and the type of url sensitive data of the crawlers with sensitive data interfaces identified in the step six, and outputting the risk level and the attack type of the crawlers according to the counting result.

The invention also provides a sensitive data interface crawler identification device, which comprises:

The log acquisition module is used for acquiring a web access log of a website;

The crawler identification module is used for identifying the crawler according to the web access log;

the judging module is used for judging the type of the crawler;

the crawler request simulation module is used for initiating requests to websites by using parameters of crawlers according to different crawler types, acquiring contents of request responses and collecting the contents of request responses according to request url, and storing text parts of the contents returned by the websites according to collection domain name groups;

the feature extraction module is used for extracting feature data of the stored texts, and the texts under each domain name correspondingly extract important link addresses and text keyword results;

And the sensitive judging module is used for identifying whether sensitive information exists in the text keyword result by using a sensitive data discovery technology and outputting a corresponding result.

Further, the crawler identification module adopts an anomaly detection method or a rule engine method based on a user behavior sequence to identify the crawler.

Further, the crawler type in the judging module comprises modifying parameters in url to switch pages or modifying POST content requests to transmit different parameters for the same url to switch pages.

Still further, the crawler request simulation module includes:

the Request simulation unit is used for initiating a Request to a website by using parameters of a crawler according to different crawler types, wherein the Request comprises additional headers information so as to simulate the crawler Request;

The request response unit is used for carrying out page analysis on the website accessed by the crawler to acquire information returned by the website page, and obtaining the content of the request response;

The grouping storage unit is used for collecting the content of the request response according to the request url, if the content is the crawler address of the page switching mode by modifying parameters in url, reserving a non-parameter part of the crawler address as a collecting domain name, and if the content request of POST is modified to transmit different parameters to the crawler address of the page switching mode, directly using the domain name of the crawler address as the collecting domain name; and storing the text parts returned by the website according to the grouping of the collected domain names.

Further, the feature extraction module is further configured to:

by the formula

Further, the sensitive data interface crawler identification device further comprises a statistics module, wherein the statistics module is used for counting the number of url collection requests, the access rate, the number of request IP addresses, the number of IP access urls, the number of requests useragent, the number of returned 200, the number of access references, the access Method type and the url sensitive data type of the crawler with the sensitive data interface identified by the sensitive judgment module, and outputting the crawler risk level and the attack type according to the statistics result.

The invention has the advantages that: according to the invention, a request is initiated to a website by using parameters of the crawler according to different crawler types, content of a request response is obtained, content of the request response is collected according to request url, text parts of the content returned by the website are stored in groups according to a collection domain name, whether sensitive information is in a text keyword result is identified by using a sensitive data discovery technology, whether sensitive is output, and the type of sensitive data is output, so that the crawler engine is effectively identified, the crawler behavior related to the sensitive information is identified, and the security of network information is ensured.

Drawings

Fig. 1 is a flowchart of a method for identifying a crawler of a sensitive data interface according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described in the following in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

A method of sensitive data interface crawler identification, the method comprising the steps of:

S1: acquiring a web access log of a website; the web access log includes the time of the request, the IP address, user identity information, sessionid, requestbody, responbody, method, status, including account number, cookie, uuid.

S2: identifying the crawler according to the web access log; in this embodiment, the crawler identification is performed by using the prior art, and the crawler identification process does not involve identification of sensitive information, and any mature technology capable of performing crawler identification can be used, specifically, an anomaly detection method or a rule engine method based on a user behavior sequence is used to identify the crawler, for example, a scheme disclosed in patent literature listed in the background art is adopted.

S3: judging the type of the crawler; the crawler type comprises modifying parameters in url to switch pages or modifying POST content request to transmit different parameters for the same url to switch pages. If the address is http://www.xxx.com.cn/service/api/getMoreInfo.actionproject_id＝ab922d56d&ID＝b7fb6e72ddcdb4&startID＝4c8dsd4147148., the values of the parameters project, ID and startID of url are modified to switch the accessed domain name, so that the switching of different pages is realized, and the information of different pages is obtained in the process of continuously attempting to change the values of the parameters project, ID and startID of url, and sensitive information possibly exists in the information. If the address http:// www.xxx.com.cn/login/, in the request, the returned result from the failed accoun_name is collected by modifying the post parameter { 'accoun_name': '123456789' }, for example, the account number of a user is a mobile phone number, but the login password of different software may be different, and by modifying the post parameter (in this specific example, the post parameter is a password), the information of the user on different software is obtained after continuous attempts.

S4: according to different crawler types, using parameters of the crawler to initiate a request to a website, acquiring content of a request response, collecting the content of the request response according to request url, and storing text parts of the content returned by the website according to a collection domain name group; the specific process is as follows:

Step 402: carrying out page analysis on a website accessed by a crawler to obtain information returned by the website page, wherein the types comprise HTML, json character strings, binary data (such as picture video) and the like, so as to obtain content of request response;

S5: extracting feature data of the stored text, and correspondingly extracting important link addresses and text keyword results from the text under each domain name;

The weights found by the conventional methods TF-IDF are typically very small, close to 0, nor are the accuracies very high, in essence IDF is a weight that attempts to suppress noise, and the more important the word that is simply considered to be of small text frequency, the more useless the word that is of large text frequency. This is not entirely correct for most text messages. The simple structure of the IDF can not make the extracted keywords very effectively reflect the importance degree of the words and the distribution condition of the feature words, so that the IDF can not well complete the function of adjusting the weight. Especially in similar corpuses, the method has great disadvantages, and often the keywords of similar texts are covered. For example: if the education articles are more in the corpus D and the text j is an article belonging to the education, the IDF value of the words related to the education is smaller, so that the recall rate of extracting the text keywords is lower, and the keyword extraction result is inaccurate. On the basis of the invention, a word reverse frequency mode calculation weighting algorithm is provided, namely

By the formula

The weighting method reduces the influence of the texts of the same type in the corpus on the word weight, and more accurately expresses the importance degree of the word in the document to be checked. The calculation result of the formula can just solve the problem that the final weight is too small, and in practical application, 6 valid digits are reserved, so that the calculation result is more accurate.

S6: and identifying whether the text keyword result has sensitive information or not by using a sensitive data discovery technology, and outputting a corresponding result. The sensitive information comprises a mobile phone number, a name, an address, a license plate number and an identity card number.

As a further improvement of the present invention, the method for identifying a crawler of a sensitive data interface further includes S7:

And (3) counting the number of url collection requests, the access rate, the number of request IP addresses, the number of IP access urls, the number of requests useragent, the number of return 200, the number of access references, the type of access methods and the type of url sensitive data by the crawlers with sensitive data interfaces identified in the step (S6), outputting a crawler risk level and an attack type according to the counting result, wherein the specific crawler risk level can be judged by selecting one or more indexes according to actual needs, for example, three indexes of the number of url collection requests, the access rate and the number of return 200 can be selected, and the crawlers with the url collection request number exceeding a first preset value, the access rate exceeding a second preset value and the return 200 number exceeding a third preset value are classified as high risk.

According to the technical scheme, the method and the device for identifying the crawler type of the network information, the crawler type is used for initiating requests to the website by using parameters of the crawler according to different crawler types, obtaining contents of request responses, collecting the contents of request responses according to request url, storing text parts of the contents returned by the website according to collected domain name grouping, identifying whether sensitive information is in text keyword results by using a sensitive data discovery technology, outputting whether sensitive data is related or not, and the sensitive data type is related or not, so that a crawler motor is effectively identified, crawler behaviors related to the sensitive information are identified, and network information safety is guaranteed.

Example 2

Based on embodiment 1, embodiment 2 of the present invention further provides a sensitive data interface crawler recognition device, where the device includes:

The log acquisition module is used for acquiring a web access log of a website;

the judging module is used for judging the type of the crawler;

Specifically, the web access log includes a requested time, an IP address, user identity information, sessionid, requestbody, responbody, method, status, and the user identity information includes an account number, a cookie, and a uuid.

Specifically, the crawler identification module adopts an anomaly detection method or a rule engine method based on a user behavior sequence to identify the crawler.

Specifically, the crawler type in the judging module includes modifying parameters in url to switch pages or modifying POST content requests to transmit different parameters for the same url to switch pages.

More specifically, the crawler request simulation module includes:

Specifically, the feature extraction module is further configured to:

by the formula

Specifically, the sensitive information comprises a mobile phone number, a name, an address, a license plate number and an identity card number.

Specifically, the sensitive data interface crawler identification device further comprises a statistics module, wherein the statistics module is used for counting the number of url collection requests, the access rate, the number of request IP addresses, the number of IP access urls, the number of requests useragent, the number of returned 200, the number of access references, the access Method type and the url sensitive data type of the crawlers with the sensitive data interface identified by the sensitive judgment module, and outputting the crawler risk level and the attack type according to the statistics result.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for identifying a crawler of a sensitive data interface, the method comprising the steps of:

step one: acquiring a web access log of a website;

Step two: identifying the crawler according to the web access log;

step three: judging the type of the crawler; the crawler type in the third step comprises modifying parameters in url to switch pages or modifying POST content requests to transmit different parameters for the same url to switch pages;

Step four: according to different crawler types, using parameters of the crawler to initiate a request to a website, acquiring content of a request response, and according to the content of a request url aggregation request response, storing text parts of the content returned by the website according to aggregation domain name groups;

Step 403: according to the content of the response of the request url aggregation request, if the crawler address of the page switching mode is carried out by modifying parameters in url, a non-parameter part of the crawler address is reserved and used as an aggregation domain name, and if the crawler address of the page switching mode is carried out by transmitting different parameters by modifying POST content request, the domain name of the crawler address is directly used as the aggregation domain name; storing a plurality of text parts returned by the website according to the grouping of the collected domain names;

by the formula

Calculating word frequency, extracting words with word frequency exceeding a threshold value in the stored text as characteristic data, and correspondingly extracting important link addresses and text keyword results from the text under each domain name according to the word frequency; where n _i,j denotes the number of occurrences of the word t _i in text j,Represents the sum of all word words in text j,Representing the sum of all word frequency numbers in the corpus, and nt _i represents the total frequency of the word t _i in the corpus;

2. The method of claim 1, wherein the web access log includes a time of the request, an IP address, user identity information, sessionid, requestbody, responbody, method, status, and the user identity information includes an account number, a cookie, and a uuid.

3. The method for identifying a crawler on a sensitive data interface according to claim 1, wherein in the second step, the crawler is identified by using an anomaly detection method or a rule engine method based on a user behavior sequence.

4. The method for identifying a crawler on a sensitive data interface according to claim 1, wherein the sensitive information includes a mobile phone number, a name, an address, a license plate number, and an identification card number.

5. The method for identifying a crawler of a sensitive data interface of claim 1, further comprising the step of:

6. A sensitive data interface crawler identification device, the device comprising:

The log acquisition module is used for acquiring a web access log of a website;

The judging module is used for judging the type of the crawler; the crawler type in the judging module comprises the steps of modifying parameters in url to switch pages or transmitting different parameters to switch pages by modifying POST content requests by the same url;

The crawler request simulation module is used for initiating requests to websites by using parameters of crawlers according to different crawler types, acquiring contents of request responses and collecting the contents of request responses according to request url, and storing text parts of the contents returned by the websites according to collection domain name groups; the crawler request simulation module is further configured to:

by the formula

7. The sensitive data interface crawler identification of claim 6 wherein the web access log includes time of request, IP address, user identity information, sessionid, requestbody, responbody, method, status, user identity information including account number, cookie, uuid.