Nothing Special   »   [go: up one dir, main page]

CN113821754B - Method and device for identifying crawler of sensitive data interface - Google Patents

Method and device for identifying crawler of sensitive data interface Download PDF

Info

Publication number
CN113821754B
CN113821754B CN202111100833.1A CN202111100833A CN113821754B CN 113821754 B CN113821754 B CN 113821754B CN 202111100833 A CN202111100833 A CN 202111100833A CN 113821754 B CN113821754 B CN 113821754B
Authority
CN
China
Prior art keywords
crawler
request
text
website
url
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111100833.1A
Other languages
Chinese (zh)
Other versions
CN113821754A (en
Inventor
葛胜利
魏国富
夏玉明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information and Data Security Solutions Co Ltd
Original Assignee
Information and Data Security Solutions Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information and Data Security Solutions Co Ltd filed Critical Information and Data Security Solutions Co Ltd
Priority to CN202111100833.1A priority Critical patent/CN113821754B/en
Publication of CN113821754A publication Critical patent/CN113821754A/en
Application granted granted Critical
Publication of CN113821754B publication Critical patent/CN113821754B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/552Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/60Scheduling or organising the servicing of application requests, e.g. requests for application data transmissions using the analysis and optimisation of the required network resources

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Signal Processing (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method and a device for identifying a crawler of a sensitive data interface, wherein the method comprises the following steps: acquiring a web access log of a website; identifying the crawler according to the web access log; judging the type of the crawler; according to different crawler types, using parameters of the crawler to initiate a request to a website, acquiring content of a request response, collecting the content of the request response according to request url, and storing text parts of the content returned by the website according to a collection domain name group; extracting feature data of the stored text, and correspondingly extracting important link addresses and text keyword results from the text under each domain name; identifying whether the text keyword results are sensitive or not, and outputting whether the text keyword results are sensitive or not, wherein the sensitive data type is related; the invention has the advantages that: and the crawler motor is effectively identified, the crawler behavior related to the sensitive information is identified, and the network information safety is ensured.

Description

Method and device for identifying crawler of sensitive data interface
Technical Field
The invention relates to the field of crawler identification, in particular to a method and a device for identifying a sensitive data interface crawler.
Background
In the prior art, data in a network can be acquired through means such as a web crawler and the like, and a program or a script for removing website information is automatically grasped according to a certain rule. Most of the prior art aims at intercepting crawlers, but the crawlers can bypass by changing programs or simulating the behaviors of real users, and particularly, interfaces of websites have valuable sensitive information.
The crawler identification method in the prior art can be roughly classified into two types, wherein one type is an expert rule engine scheme, data are collected through a business log, single or multiple attribute events are configured to be accumulated in number, and the event exceeding a threshold value is intercepted through a threshold value type rule; or intercepting through a blacklist acquired by attributes such as the attribute, the IP, the user agent and the like. As technology is gradually improved, a simulator is used by black products, and special software is used for carrying out trial and error on a wind control rule engine and bypassing, so that the information security of a website is difficult to continuously ensure, and particularly, the information security of the website is more difficult to maintain under the condition that a certain valuable sensitive information exists on an interface of the website.
The other type is a crawler scheme based on anomaly detection and recognition of a user behavior sequence, wherein the user access behavior path is constructed, the behavior path probability is calculated by using technical schemes such as a probability model, and the access paths of abnormal users and related users are output by using an unsupervised learning method. However, the technical scheme has a large number of false alarms, the workload of manual secondary analysis is more increased and complicated, and the information security of a website interface with sensitive information is difficult to maintain.
China patent grant bulletin No. CN108712426B discloses a crawler identification method and system based on user behavior buried points, wherein the method comprises the following steps: s1, a client receives an access request initiated by a user and asynchronously sends the access request to a back-end service system; s2, after receiving the access request, the back-end service system synchronizes an access log of the user, wherein the access log comprises access behavior data of the user; s3, the back-end service system aggregates access behavior data through a rule engine; s4, the back-end service system judges whether the user belongs to a crawler according to the aggregated access behavior data, if so, the crawler characteristic data for identifying the user as the crawler is aggregated according to the access log, and then the crawler characteristic data is asynchronously pushed to a crawler list in the client through a message queue; s5, the client responds to the access request according to the crawler list. According to the invention, the crawler is identified by synchronizing the access logs and aggregating the access behavior data in the logs, so that the crawler identification rate is improved and the accuracy is improved. However, not all crawlers need to be intercepted, and this scheme only identifies crawlers and cannot identify crawlers with sensitive data.
In summary, most of the prior art is directed to intercepting crawlers, and crawlers with sensitive data cannot be identified, so that security of network information is difficult to ensure.
Disclosure of Invention
The technical problem to be solved by the invention is that the prior art lacks a method for identifying crawlers on interfaces with sensitive information.
The invention solves the technical problems by the following technical means: a method of sensitive data interface crawler identification, the method comprising the steps of:
step one: acquiring a web access log of a website;
Step two: identifying the crawler according to the web access log;
Step three: judging the type of the crawler;
step four: according to different crawler types, using parameters of the crawler to initiate a request to a website, acquiring content of a request response, collecting the content of the request response according to request url, and storing text parts of the content returned by the website according to a collection domain name group;
Step five: extracting feature data of the stored text, and correspondingly extracting important link addresses and text keyword results from the text under each domain name;
step six: and identifying whether the text keyword result has sensitive information or not by using a sensitive data discovery technology, and outputting a corresponding result.
According to the invention, a request is initiated to a website by using parameters of the crawler according to different crawler types, content of a request response is obtained, content of the request response is collected according to request url, text parts of the content returned by the website are stored in groups according to a collection domain name, whether sensitive information is in a text keyword result is identified by using a sensitive data discovery technology, whether sensitive is output, and the type of sensitive data is output, so that the crawler engine is effectively identified, the crawler behavior related to the sensitive information is identified, and the security of network information is ensured.
Further, the web access log includes a time of the request, an IP address, user identity information, sessionid, requestbody, responbody, method, status, including an account number, a cookie, and a uuid.
Further, in the second step, a user behavior sequence-based anomaly detection method or a rule engine method is adopted to identify the crawler.
Further, the crawler type in the third step includes modifying parameters in url to perform page switching or modifying POST content request to transmit different parameters for the same url to perform page switching.
Still further, the fourth step includes:
step 401: according to different crawler types, using parameters of the crawler to initiate a Request to a website, wherein the Request contains additional headers information, so that crawler Request simulation is performed;
Step 402: carrying out page analysis on a website accessed by a crawler to acquire information returned by the website page, and obtaining content of a request response;
step 403: according to the content of the response of the request url aggregation request, if the crawler address of the page switching mode is carried out by modifying parameters in url, a non-parameter part of the crawler address is reserved and used as an aggregation domain name, and if the crawler address of the page switching mode is carried out by transmitting different parameters by modifying POST content request, the domain name of the crawler address is directly used as the aggregation domain name; and storing the text parts returned by the website according to the grouping of the collected domain names.
Further, the fifth step includes:
by the formula
Calculating word frequency, extracting words with word frequency exceeding a threshold value in the stored text as characteristic data, and correspondingly extracting important link addresses and text keyword results from the text under each domain name according to the word frequency; where n i,j denotes the number of occurrences of the word t i in text j,Represents the sum of all word words in text j,Representing the sum of all word frequencies in the corpus, nt i represents the total frequency of occurrence of word t i in the corpus.
Further, the sensitive information comprises a mobile phone number, a name, an address, a license plate number and an identity card number.
Further, the sensitive data interface crawler identification method further comprises a step seven:
And D, counting the number of url collection requests, the access rate, the number of request IP addresses, the number of IP access urls, the number of requests useragent, the number of returned 200, the number of access references, the type of access Method and the type of url sensitive data of the crawlers with sensitive data interfaces identified in the step six, and outputting the risk level and the attack type of the crawlers according to the counting result.
The invention also provides a sensitive data interface crawler identification device, which comprises:
The log acquisition module is used for acquiring a web access log of a website;
The crawler identification module is used for identifying the crawler according to the web access log;
the judging module is used for judging the type of the crawler;
the crawler request simulation module is used for initiating requests to websites by using parameters of crawlers according to different crawler types, acquiring contents of request responses and collecting the contents of request responses according to request url, and storing text parts of the contents returned by the websites according to collection domain name groups;
the feature extraction module is used for extracting feature data of the stored texts, and the texts under each domain name correspondingly extract important link addresses and text keyword results;
And the sensitive judging module is used for identifying whether sensitive information exists in the text keyword result by using a sensitive data discovery technology and outputting a corresponding result.
Further, the web access log includes a time of the request, an IP address, user identity information, sessionid, requestbody, responbody, method, status, including an account number, a cookie, and a uuid.
Further, the crawler identification module adopts an anomaly detection method or a rule engine method based on a user behavior sequence to identify the crawler.
Further, the crawler type in the judging module comprises modifying parameters in url to switch pages or modifying POST content requests to transmit different parameters for the same url to switch pages.
Still further, the crawler request simulation module includes:
the Request simulation unit is used for initiating a Request to a website by using parameters of a crawler according to different crawler types, wherein the Request comprises additional headers information so as to simulate the crawler Request;
The request response unit is used for carrying out page analysis on the website accessed by the crawler to acquire information returned by the website page, and obtaining the content of the request response;
The grouping storage unit is used for collecting the content of the request response according to the request url, if the content is the crawler address of the page switching mode by modifying parameters in url, reserving a non-parameter part of the crawler address as a collecting domain name, and if the content request of POST is modified to transmit different parameters to the crawler address of the page switching mode, directly using the domain name of the crawler address as the collecting domain name; and storing the text parts returned by the website according to the grouping of the collected domain names.
Further, the feature extraction module is further configured to:
by the formula
Calculating word frequency, extracting words with word frequency exceeding a threshold value in the stored text as characteristic data, and correspondingly extracting important link addresses and text keyword results from the text under each domain name according to the word frequency; where n i,j denotes the number of occurrences of the word t i in text j,Represents the sum of all word words in text j,Representing the sum of all word frequencies in the corpus, nt i represents the total frequency of occurrence of word t i in the corpus.
Further, the sensitive information comprises a mobile phone number, a name, an address, a license plate number and an identity card number.
Further, the sensitive data interface crawler identification device further comprises a statistics module, wherein the statistics module is used for counting the number of url collection requests, the access rate, the number of request IP addresses, the number of IP access urls, the number of requests useragent, the number of returned 200, the number of access references, the access Method type and the url sensitive data type of the crawler with the sensitive data interface identified by the sensitive judgment module, and outputting the crawler risk level and the attack type according to the statistics result.
The invention has the advantages that: according to the invention, a request is initiated to a website by using parameters of the crawler according to different crawler types, content of a request response is obtained, content of the request response is collected according to request url, text parts of the content returned by the website are stored in groups according to a collection domain name, whether sensitive information is in a text keyword result is identified by using a sensitive data discovery technology, whether sensitive is output, and the type of sensitive data is output, so that the crawler engine is effectively identified, the crawler behavior related to the sensitive information is identified, and the security of network information is ensured.
Drawings
Fig. 1 is a flowchart of a method for identifying a crawler of a sensitive data interface according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described in the following in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
A method of sensitive data interface crawler identification, the method comprising the steps of:
S1: acquiring a web access log of a website; the web access log includes the time of the request, the IP address, user identity information, sessionid, requestbody, responbody, method, status, including account number, cookie, uuid.
S2: identifying the crawler according to the web access log; in this embodiment, the crawler identification is performed by using the prior art, and the crawler identification process does not involve identification of sensitive information, and any mature technology capable of performing crawler identification can be used, specifically, an anomaly detection method or a rule engine method based on a user behavior sequence is used to identify the crawler, for example, a scheme disclosed in patent literature listed in the background art is adopted.
S3: judging the type of the crawler; the crawler type comprises modifying parameters in url to switch pages or modifying POST content request to transmit different parameters for the same url to switch pages. If the address is http://www.xxx.com.cn/service/api/getMoreInfo.actionproject_id=ab922d56d&ID=b7fb6e72ddcdb4&startID=4c8dsd4147148., the values of the parameters project, ID and startID of url are modified to switch the accessed domain name, so that the switching of different pages is realized, and the information of different pages is obtained in the process of continuously attempting to change the values of the parameters project, ID and startID of url, and sensitive information possibly exists in the information. If the address http:// www.xxx.com.cn/login/, in the request, the returned result from the failed accoun_name is collected by modifying the post parameter { 'accoun_name': '123456789' }, for example, the account number of a user is a mobile phone number, but the login password of different software may be different, and by modifying the post parameter (in this specific example, the post parameter is a password), the information of the user on different software is obtained after continuous attempts.
S4: according to different crawler types, using parameters of the crawler to initiate a request to a website, acquiring content of a request response, collecting the content of the request response according to request url, and storing text parts of the content returned by the website according to a collection domain name group; the specific process is as follows:
step 401: according to different crawler types, using parameters of the crawler to initiate a Request to a website, wherein the Request contains additional headers information, so that crawler Request simulation is performed;
Step 402: carrying out page analysis on a website accessed by a crawler to obtain information returned by the website page, wherein the types comprise HTML, json character strings, binary data (such as picture video) and the like, so as to obtain content of request response;
step 403: according to the content of the response of the request url aggregation request, if the crawler address of the page switching mode is carried out by modifying parameters in url, a non-parameter part of the crawler address is reserved and used as an aggregation domain name, and if the crawler address of the page switching mode is carried out by transmitting different parameters by modifying POST content request, the domain name of the crawler address is directly used as the aggregation domain name; and storing the text parts returned by the website according to the grouping of the collected domain names.
S5: extracting feature data of the stored text, and correspondingly extracting important link addresses and text keyword results from the text under each domain name;
The weights found by the conventional methods TF-IDF are typically very small, close to 0, nor are the accuracies very high, in essence IDF is a weight that attempts to suppress noise, and the more important the word that is simply considered to be of small text frequency, the more useless the word that is of large text frequency. This is not entirely correct for most text messages. The simple structure of the IDF can not make the extracted keywords very effectively reflect the importance degree of the words and the distribution condition of the feature words, so that the IDF can not well complete the function of adjusting the weight. Especially in similar corpuses, the method has great disadvantages, and often the keywords of similar texts are covered. For example: if the education articles are more in the corpus D and the text j is an article belonging to the education, the IDF value of the words related to the education is smaller, so that the recall rate of extracting the text keywords is lower, and the keyword extraction result is inaccurate. On the basis of the invention, a word reverse frequency mode calculation weighting algorithm is provided, namely
By the formula
Calculating word frequency, extracting words with word frequency exceeding a threshold value in the stored text as characteristic data, and correspondingly extracting important link addresses and text keyword results from the text under each domain name according to the word frequency; where n i,j denotes the number of occurrences of the word t i in text j,Represents the sum of all word words in text j,Representing the sum of all word frequencies in the corpus, nt i represents the total frequency of occurrence of word t i in the corpus.
The weighting method reduces the influence of the texts of the same type in the corpus on the word weight, and more accurately expresses the importance degree of the word in the document to be checked. The calculation result of the formula can just solve the problem that the final weight is too small, and in practical application, 6 valid digits are reserved, so that the calculation result is more accurate.
S6: and identifying whether the text keyword result has sensitive information or not by using a sensitive data discovery technology, and outputting a corresponding result. The sensitive information comprises a mobile phone number, a name, an address, a license plate number and an identity card number.
As a further improvement of the present invention, the method for identifying a crawler of a sensitive data interface further includes S7:
And (3) counting the number of url collection requests, the access rate, the number of request IP addresses, the number of IP access urls, the number of requests useragent, the number of return 200, the number of access references, the type of access methods and the type of url sensitive data by the crawlers with sensitive data interfaces identified in the step (S6), outputting a crawler risk level and an attack type according to the counting result, wherein the specific crawler risk level can be judged by selecting one or more indexes according to actual needs, for example, three indexes of the number of url collection requests, the access rate and the number of return 200 can be selected, and the crawlers with the url collection request number exceeding a first preset value, the access rate exceeding a second preset value and the return 200 number exceeding a third preset value are classified as high risk.
According to the technical scheme, the method and the device for identifying the crawler type of the network information, the crawler type is used for initiating requests to the website by using parameters of the crawler according to different crawler types, obtaining contents of request responses, collecting the contents of request responses according to request url, storing text parts of the contents returned by the website according to collected domain name grouping, identifying whether sensitive information is in text keyword results by using a sensitive data discovery technology, outputting whether sensitive data is related or not, and the sensitive data type is related or not, so that a crawler motor is effectively identified, crawler behaviors related to the sensitive information are identified, and network information safety is guaranteed.
Example 2
Based on embodiment 1, embodiment 2 of the present invention further provides a sensitive data interface crawler recognition device, where the device includes:
The log acquisition module is used for acquiring a web access log of a website;
The crawler identification module is used for identifying the crawler according to the web access log;
the judging module is used for judging the type of the crawler;
the crawler request simulation module is used for initiating requests to websites by using parameters of crawlers according to different crawler types, acquiring contents of request responses and collecting the contents of request responses according to request url, and storing text parts of the contents returned by the websites according to collection domain name groups;
the feature extraction module is used for extracting feature data of the stored texts, and the texts under each domain name correspondingly extract important link addresses and text keyword results;
And the sensitive judging module is used for identifying whether sensitive information exists in the text keyword result by using a sensitive data discovery technology and outputting a corresponding result.
Specifically, the web access log includes a requested time, an IP address, user identity information, sessionid, requestbody, responbody, method, status, and the user identity information includes an account number, a cookie, and a uuid.
Specifically, the crawler identification module adopts an anomaly detection method or a rule engine method based on a user behavior sequence to identify the crawler.
Specifically, the crawler type in the judging module includes modifying parameters in url to switch pages or modifying POST content requests to transmit different parameters for the same url to switch pages.
More specifically, the crawler request simulation module includes:
the Request simulation unit is used for initiating a Request to a website by using parameters of a crawler according to different crawler types, wherein the Request comprises additional headers information so as to simulate the crawler Request;
The request response unit is used for carrying out page analysis on the website accessed by the crawler to acquire information returned by the website page, and obtaining the content of the request response;
The grouping storage unit is used for collecting the content of the request response according to the request url, if the content is the crawler address of the page switching mode by modifying parameters in url, reserving a non-parameter part of the crawler address as a collecting domain name, and if the content request of POST is modified to transmit different parameters to the crawler address of the page switching mode, directly using the domain name of the crawler address as the collecting domain name; and storing the text parts returned by the website according to the grouping of the collected domain names.
Specifically, the feature extraction module is further configured to:
by the formula
Calculating word frequency, extracting words with word frequency exceeding a threshold value in the stored text as characteristic data, and correspondingly extracting important link addresses and text keyword results from the text under each domain name according to the word frequency; where n i,j denotes the number of occurrences of the word t i in text j,Represents the sum of all word words in text j,Representing the sum of all word frequencies in the corpus, nt i represents the total frequency of occurrence of word t i in the corpus.
Specifically, the sensitive information comprises a mobile phone number, a name, an address, a license plate number and an identity card number.
Specifically, the sensitive data interface crawler identification device further comprises a statistics module, wherein the statistics module is used for counting the number of url collection requests, the access rate, the number of request IP addresses, the number of IP access urls, the number of requests useragent, the number of returned 200, the number of access references, the access Method type and the url sensitive data type of the crawlers with the sensitive data interface identified by the sensitive judgment module, and outputting the crawler risk level and the attack type according to the statistics result.
The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (7)

1. A method for identifying a crawler of a sensitive data interface, the method comprising the steps of:
step one: acquiring a web access log of a website;
Step two: identifying the crawler according to the web access log;
step three: judging the type of the crawler; the crawler type in the third step comprises modifying parameters in url to switch pages or modifying POST content requests to transmit different parameters for the same url to switch pages;
Step four: according to different crawler types, using parameters of the crawler to initiate a request to a website, acquiring content of a request response, and according to the content of a request url aggregation request response, storing text parts of the content returned by the website according to aggregation domain name groups;
step 401: according to different crawler types, using parameters of the crawler to initiate a Request to a website, wherein the Request contains additional headers information, so that crawler Request simulation is performed;
Step 402: carrying out page analysis on a website accessed by a crawler to acquire information returned by the website page, and obtaining content of a request response;
Step 403: according to the content of the response of the request url aggregation request, if the crawler address of the page switching mode is carried out by modifying parameters in url, a non-parameter part of the crawler address is reserved and used as an aggregation domain name, and if the crawler address of the page switching mode is carried out by transmitting different parameters by modifying POST content request, the domain name of the crawler address is directly used as the aggregation domain name; storing a plurality of text parts returned by the website according to the grouping of the collected domain names;
Step five: extracting feature data of the stored text, and correspondingly extracting important link addresses and text keyword results from the text under each domain name;
by the formula
Calculating word frequency, extracting words with word frequency exceeding a threshold value in the stored text as characteristic data, and correspondingly extracting important link addresses and text keyword results from the text under each domain name according to the word frequency; where n i,j denotes the number of occurrences of the word t i in text j,Represents the sum of all word words in text j,Representing the sum of all word frequency numbers in the corpus, and nt i represents the total frequency of the word t i in the corpus;
step six: and identifying whether the text keyword result has sensitive information or not by using a sensitive data discovery technology, and outputting a corresponding result.
2. The method of claim 1, wherein the web access log includes a time of the request, an IP address, user identity information, sessionid, requestbody, responbody, method, status, and the user identity information includes an account number, a cookie, and a uuid.
3. The method for identifying a crawler on a sensitive data interface according to claim 1, wherein in the second step, the crawler is identified by using an anomaly detection method or a rule engine method based on a user behavior sequence.
4. The method for identifying a crawler on a sensitive data interface according to claim 1, wherein the sensitive information includes a mobile phone number, a name, an address, a license plate number, and an identification card number.
5. The method for identifying a crawler of a sensitive data interface of claim 1, further comprising the step of:
And D, counting the number of url collection requests, the access rate, the number of request IP addresses, the number of IP access urls, the number of requests useragent, the number of returned 200, the number of access references, the type of access Method and the type of url sensitive data of the crawlers with sensitive data interfaces identified in the step six, and outputting the risk level and the attack type of the crawlers according to the counting result.
6. A sensitive data interface crawler identification device, the device comprising:
The log acquisition module is used for acquiring a web access log of a website;
The crawler identification module is used for identifying the crawler according to the web access log;
The judging module is used for judging the type of the crawler; the crawler type in the judging module comprises the steps of modifying parameters in url to switch pages or transmitting different parameters to switch pages by modifying POST content requests by the same url;
The crawler request simulation module is used for initiating requests to websites by using parameters of crawlers according to different crawler types, acquiring contents of request responses and collecting the contents of request responses according to request url, and storing text parts of the contents returned by the websites according to collection domain name groups; the crawler request simulation module is further configured to:
step 401: according to different crawler types, using parameters of the crawler to initiate a Request to a website, wherein the Request contains additional headers information, so that crawler Request simulation is performed;
Step 402: carrying out page analysis on a website accessed by a crawler to acquire information returned by the website page, and obtaining content of a request response;
Step 403: according to the content of the response of the request url aggregation request, if the crawler address of the page switching mode is carried out by modifying parameters in url, a non-parameter part of the crawler address is reserved and used as an aggregation domain name, and if the crawler address of the page switching mode is carried out by transmitting different parameters by modifying POST content request, the domain name of the crawler address is directly used as the aggregation domain name; storing a plurality of text parts returned by the website according to the grouping of the collected domain names;
the feature extraction module is used for extracting feature data of the stored texts, and the texts under each domain name correspondingly extract important link addresses and text keyword results;
by the formula
Calculating word frequency, extracting words with word frequency exceeding a threshold value in the stored text as characteristic data, and correspondingly extracting important link addresses and text keyword results from the text under each domain name according to the word frequency; where n i,j denotes the number of occurrences of the word t i in text j,Represents the sum of all word words in text j,Representing the sum of all word frequency numbers in the corpus, and nt i represents the total frequency of the word t i in the corpus;
And the sensitive judging module is used for identifying whether sensitive information exists in the text keyword result by using a sensitive data discovery technology and outputting a corresponding result.
7. The sensitive data interface crawler identification of claim 6 wherein the web access log includes time of request, IP address, user identity information, sessionid, requestbody, responbody, method, status, user identity information including account number, cookie, uuid.
CN202111100833.1A 2021-09-18 2021-09-18 Method and device for identifying crawler of sensitive data interface Active CN113821754B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111100833.1A CN113821754B (en) 2021-09-18 2021-09-18 Method and device for identifying crawler of sensitive data interface

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111100833.1A CN113821754B (en) 2021-09-18 2021-09-18 Method and device for identifying crawler of sensitive data interface

Publications (2)

Publication Number Publication Date
CN113821754A CN113821754A (en) 2021-12-21
CN113821754B true CN113821754B (en) 2024-08-16

Family

ID=78922493

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111100833.1A Active CN113821754B (en) 2021-09-18 2021-09-18 Method and device for identifying crawler of sensitive data interface

Country Status (1)

Country Link
CN (1) CN113821754B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116150542B (en) * 2023-04-21 2023-07-14 河北网新数字技术股份有限公司 Dynamic page generation method and device and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108712426A (en) * 2018-05-21 2018-10-26 携程旅游网络技术(上海)有限公司 Reptile recognition methods and system a little are buried based on user behavior
CN112287198A (en) * 2020-10-28 2021-01-29 上海云信留客信息科技有限公司 Spam short message detection method based on crawler technology

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104766014B (en) * 2015-04-30 2017-12-01 安一恒通(北京)科技有限公司 Method and system for detecting malicious website
CN106250513B (en) * 2016-08-02 2021-04-23 西南石油大学 Event modeling-based event personalized classification method and system
CN106411578B (en) * 2016-09-12 2019-07-12 国网山东省电力公司电力科学研究院 A kind of web publishing system and method being adapted to power industry
CN106776768B (en) * 2016-11-23 2018-02-02 福建六壬网安股份有限公司 A kind of URL grasping means of distributed reptile engine and system
CN109600272B (en) * 2017-09-30 2022-03-18 北京国双科技有限公司 Crawler detection method and device
CN108256104B (en) * 2018-02-05 2020-05-26 恒安嘉新(北京)科技股份公司 Comprehensive classification method of internet websites based on multidimensional characteristics
CN109308330A (en) * 2018-07-24 2019-02-05 国家计算机网络与信息安全管理中心 The method of enterprise's leakage information extraction, analysis and classification Internet-based
CN110351248B (en) * 2019-06-14 2022-03-18 北京纵横无双科技有限公司 Safety protection method and device based on intelligent analysis and intelligent current limiting
CN112929390B (en) * 2021-03-12 2023-03-24 厦门帝恩思科技股份有限公司 Network intelligent monitoring method based on multi-strategy fusion

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108712426A (en) * 2018-05-21 2018-10-26 携程旅游网络技术(上海)有限公司 Reptile recognition methods and system a little are buried based on user behavior
CN112287198A (en) * 2020-10-28 2021-01-29 上海云信留客信息科技有限公司 Spam short message detection method based on crawler technology

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
改进的TF-IDF关键词提取方法;王小林 等;《计算机科学与应用》;第 3 卷(第 1 期);64-68 *

Also Published As

Publication number Publication date
CN113821754A (en) 2021-12-21

Similar Documents

Publication Publication Date Title
CN105930727B (en) Reptile recognition methods based on Web
CN108737423B (en) Phishing website discovery method and system based on webpage key content similarity analysis
CN109960729A (en) The detection method and system of HTTP malicious traffic stream
CN102279875B (en) Method and device for identifying fishing website
CN109729044B (en) Universal internet data acquisition reverse-crawling system and method
CN108712426A (en) Reptile recognition methods and system a little are buried based on user behavior
CN104615760A (en) Phishing website recognizing method and phishing website recognizing system
CN109831459B (en) Method, device, storage medium and terminal equipment for secure access
CN107438083B (en) Detection method for phishing site and its detection system under a kind of Android environment
CN105281973A (en) Webpage fingerprint identification method aiming at specific website category
CN114244564B (en) Attack defense method, device, equipment and readable storage medium
CN112491784A (en) Request processing method and device of Web site and computer readable storage medium
CN113098887A (en) Phishing website detection method based on website joint characteristics
CN108319672A (en) Mobile terminal malicious information filtering method and system based on cloud computing
CN110830445A (en) Method and device for identifying abnormal access object
CN108718306A (en) A kind of abnormal flow behavior method of discrimination and device
CN105871585A (en) Terminal association method and device
WO2020082763A1 (en) Decision trees-based method and apparatus for detecting phishing website, and computer device
CN112131507A (en) Website content processing method, device, server and computer-readable storage medium
CN112866281A (en) Distributed real-time DDoS attack protection system and method
CN113821754B (en) Method and device for identifying crawler of sensitive data interface
CN114422211A (en) HTTP malicious traffic detection method and device based on graph attention network
CN113904834B (en) XSS attack detection method based on machine learning
Ma et al. Advanced deep web crawler based on Dom
CN114900356A (en) Malicious user behavior detection method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant