Nothing Special   »   [go: up one dir, main page]

CN109274632A - A kind of recognition methods of website and device - Google Patents

A kind of recognition methods of website and device Download PDF

Info

Publication number
CN109274632A
CN109274632A CN201710565741.8A CN201710565741A CN109274632A CN 109274632 A CN109274632 A CN 109274632A CN 201710565741 A CN201710565741 A CN 201710565741A CN 109274632 A CN109274632 A CN 109274632A
Authority
CN
China
Prior art keywords
website
url
abnormal
request
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710565741.8A
Other languages
Chinese (zh)
Other versions
CN109274632B (en
Inventor
付为民
郝建忠
郑浩彬
陈涛
邬学农
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Group Guangdong Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Group Guangdong Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Group Guangdong Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201710565741.8A priority Critical patent/CN109274632B/en
Publication of CN109274632A publication Critical patent/CN109274632A/en
Application granted granted Critical
Publication of CN109274632B publication Critical patent/CN109274632B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/10Network architectures or network communication protocols for network security for controlling access to devices or network resources
    • H04L63/101Access control lists [ACL]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention implements to provide recognition methods and the device of a kind of website, the described method includes: receiving the uniform resource locator URL request that user accesses website, the corresponding URL of the URL request is searched in white list, if finding the corresponding URL of the URL request in the white list, the corresponding URL of the URL request is connected;The corresponding URL of the URL request is searched in blacklist, if finding the corresponding URL of the URL request in the blacklist, generates high risk prompt information;If not finding the corresponding URL of the URL request in the white list and the blacklist, each feature weight value of the corresponding URL of the URL request is then calculated according to preset rules, and identifies whether the corresponding URL of the URL request is abnormal website according to each feature weight value.The embodiment of the present invention, which realizes, quick and precisely efficiently identifies abnormal website, and significantly reduces the False Rate of system, and the user experience is improved.

Description

Website identification method and device
Technical Field
The invention relates to the technical field of computers, in particular to a website identification method and device.
Background
With the rapid development of the mobile internet, the way for users to browse website information has been changed from a single PC to mobile terminal devices. In 2016, 6 months and 22 days, a Chinese Internet information center (CNNIC) issues a 37 th statistical report of the development conditions of the Chinese Internet in Beijing, and the report shows that: by 12 months in 2015, the national netizen scale reaches 6.88 hundred million, wherein the mobile phone netizen scale reaches 6.20 hundred million, and the percentage of occupation is as high as 90.12%.
Meanwhile, the security problem of the mobile phone client is increasingly highlighted, in 2015, the number of active smart phone networking terminals in China reaches 11.3 hundred million, and the problems including counterfeiting, phishing websites and malicious programs are increased, so that the internet surfing security of a user is threatened, and money loss or personal information leakage is caused.
Currently, an operator intercepts a Uniform Resource Locator (URL) requested by a mobile client on a network side mainly through a blacklist.
The blacklist method comprises the following steps: a blacklist list is configured for the WAP gateway by a Wireless Application Protocol (WAP), after a mobile HTTP request reaches the WAP gateway, the gateway analyzes URLs in a HyperText Transfer Protocol (HTTP) header and searches and matches the URLs in sequence, if the URL hits in the blacklist, the WAP gateway does not proxy the request any more, and returns the request directly to the mobile phone terminal 403, and access to the page is denied.
The blacklist method has the advantages that: the method is simple and direct, all URL gateways hitting the blacklist do not need to do proxy requests next, and the proxy gateways do not need to do requests to an original server, so that the load of the proxy gateways can be reduced. The handset terminal gets 403 the page (browser or application app presentation) to which access is denied directly.
The black list method has the following defects:
1. at present, the blacklist is deployed in the WAP gateway, and a user is required to set 10.0.0.172 proxy at a terminal, and if the proxy is not set, the internet traffic of the user cannot be intercepted without passing through the WAP gateway.
Statistically, above 90% of users do not set 10.0.0.172 proxy at terminal side, and the interception scheme has no effect on the part of users.
2. The blacklist interception mode and the page are too simple, so that the user can misunderstand that the network fault occurs, and the experience is poor.
The user accesses the illegal website, mostly obtained from pushing illegal short messages, mails, advertisements and the like, and the user does not know that the website accessed by the user is illegal, harmful or wrong. The blacklist processing mode effectively prevents the user from accessing, but the user obtains an oversimplified page for refusing to access, the user can misunderstand that the network or website service has problems, and the evaluation of the user on the operator network or website is reduced. In addition, the mode easily causes the user to repeatedly try to access or the client automatically tries to access again, so that the blacklist is larger and larger as the number of counterfeit and phishing websites increases, and the larger blacklist means that each matching needs longer time. This increases the processing load of the proxy gateway and reduces the processing efficiency of the proxy gateway, thereby reducing the speed of the user accessing the internet.
3. The traditional blacklist interception mode requires very high data accuracy, and in order to ensure that normal websites cannot be intercepted by mistake, a large amount of manual work is needed to carry out one-by-one audit, time and labor are consumed, and one-by-one audit can not be carried out on suspected websites which are counted in billions on the whole internet. In addition, the counterfeit and phishing websites have the characteristics of frequent domain name change, high similarity, short timeliness and the like, so that the traditional blacklist mode is not suitable for the current requirements.
4. The traditional blacklist interception mode cannot flexibly process most suspected websites, complaints of the websites are easily caused if the blacklist is added for direct interception, and the risk of revealing the privacy of a client does not exist if the blacklist is not added for direct interception.
Therefore, how to improve the traditional blacklist interception mode and quickly, accurately and efficiently identify the abnormal website becomes a technical problem to be solved urgently.
Disclosure of Invention
Aiming at the defects in the prior art, the embodiment of the invention provides a website identification method and device.
In a first aspect, an embodiment of the present invention provides a method for identifying a website, where the method includes:
receiving a Uniform Resource Locator (URL) request of a user for accessing a website, searching a URL corresponding to the URL request in a white list, and if the URL corresponding to the URL request is searched in the white list, connecting the URL corresponding to the URL request;
searching a URL corresponding to the URL request in a blacklist, and if the URL corresponding to the URL request is searched in the blacklist, generating high-risk prompt information;
if the URL corresponding to the URL request is not found in the white list and the black list, calculating each characteristic weight value of the URL corresponding to the URL request according to a preset rule, and identifying whether the URL corresponding to the URL request is an abnormal website or not according to each characteristic weight value.
Optionally, the calculating, according to a preset rule, each feature weight value of the URL corresponding to the URL request specifically includes:
and calculating the feature weight values of four dimensions, namely the domain name similarity weight, the webpage content similarity weight, the user reporting amount weight and the secondary visit amount weight of the URL corresponding to the URL request according to a preset rule.
Optionally, the abnormal website specifically includes:
high probability abnormal websites, suspected abnormal websites and high probability normal websites.
Optionally, the method further includes:
if the URL corresponding to the URL request is an abnormal website, performing secondary identification on the URL corresponding to the URL request;
if the secondary identification result is the high-probability abnormal website, generating high-risk prompt information, tracking and identifying the high-probability abnormal website, secondarily connecting the high-probability abnormal website, counting the secondary connection times, and adding the high-probability abnormal website to the blacklist;
if the secondary identification result is the high-probability normal website, directly connecting the high-probability normal website, and adding the high-probability normal website to the white list;
if the result of the secondary identification is the suspected abnormal website, generating general risk prompt information, tracking and identifying the suspected abnormal website, secondarily connecting the high-probability abnormal website, counting the secondary connection times, and adding the suspected abnormal website to a grey list.
Optionally, the method further includes:
performing iterative computation and identification on the blacklist, the white list and the grey list according to periodic update information of each feedback information of a user, crawled webpage content, updated webpage content feature similarity value and website secondary visit quantity;
if the identification result is the high-probability abnormal website, adding the high-probability abnormal website into the blacklist;
if the identification result is the high-probability normal website, adding the high-probability normal website into the white list;
and if the identification result is neither the high-probability abnormal website nor the high-probability normal website, continuously keeping the identification result in the grey list for waiting for next iterative computation for identification.
Optionally, the method for calculating the domain name similarity weight includes:
establishing a white list website domain name library;
comparing the domain name of the URL corresponding to the URL request with the domain names in the domain name library of the white list website, and judging whether common spelling errors, vowel character substitution, homophonic and heteromorphic character substitution, wrong top-level domain name substitution, wrong second-level domain name substitution, odd-number complex number conversion, homomorphic characters, deletion or repetition of a certain character, adjacent character exchange positions, keyboard adjacent character substitution or insertion, and insertion or deletion of separated characters exist or not to obtain a judgment result;
and according to the judgment result, calculating similarity score values of the domain name of the URL corresponding to the URL request and the domain name in the white list website domain name library, and acquiring the maximum value in the score values as the domain name similarity weight of the URL corresponding to the URL request.
In a second aspect, an embodiment of the present invention provides an apparatus for identifying a website, where the apparatus includes:
the white list processing device is used for receiving a Uniform Resource Locator (URL) request of a user for accessing a website, searching a URL corresponding to the URL request in a white list, and if the URL corresponding to the URL request is searched in the white list, connecting the URL corresponding to the URL request;
the blacklist processing device is used for searching a URL corresponding to the URL request in a blacklist, and if the URL corresponding to the URL request is searched in the blacklist, high risk prompt information is generated;
and the abnormal website processing device is used for calculating each characteristic weight value of the URL corresponding to the URL request according to a preset rule if the URL corresponding to the URL request is not found in the white list and the black list, and identifying whether the URL corresponding to the URL request is an abnormal website or not according to each characteristic weight value.
Optionally, the abnormal website processing apparatus specifically includes:
and calculating the feature weight values of four dimensions, namely the domain name similarity weight, the webpage content similarity weight, the user reporting amount weight and the secondary visit amount weight of the URL corresponding to the URL request according to a preset rule.
Optionally, the abnormal website specifically includes:
high probability abnormal websites, suspected abnormal websites and high probability normal websites.
Optionally, the apparatus further comprises:
the secondary identification device is used for carrying out secondary identification on the URL corresponding to the URL request if the URL corresponding to the URL request is an abnormal website;
the high-probability abnormal website processing device is used for generating high-risk prompt information if the secondary identification result is the high-probability abnormal website, tracking and identifying the high-probability abnormal website, secondarily connecting the high-probability abnormal website, counting the secondary connection times, and adding the high-probability abnormal website to the blacklist;
the high-probability normal website processing device is used for directly connecting the high-probability normal website and adding the high-probability normal website to the white list if the secondary identification result is the high-probability normal website;
and the suspected abnormal website processing device is used for generating general risk prompt information if the secondary identification result is the suspected abnormal website, tracking and identifying the suspected abnormal website, secondarily connecting the high-probability abnormal website, counting the secondary connection times, and adding the suspected abnormal website to a grey list.
Optionally, the apparatus further comprises:
the iterative computation device is used for performing iterative computation and identification on the blacklist, the white list and the grey list according to periodic update information of each feedback information of a user, crawl webpage content, update a webpage content feature similarity value and website secondary visit quantity;
the high-probability abnormal website iteration device is used for adding the high-probability abnormal website into the blacklist if the identification result is the high-probability abnormal website;
the high-probability normal website iteration device is used for adding the high-probability normal website into the white list if the identification result is the high-probability normal website;
and the suspected abnormal website iteration device is used for continuing to remain in the grey list to wait for the next iteration calculation for identification if the identification result is neither the high-probability abnormal website nor the high-probability normal website.
Optionally, the device for calculating the domain name similarity weight specifically includes:
the white list website establishing device is used for establishing a white list website domain name library;
a comparison device, configured to compare the domain name of the URL corresponding to the URL request with the domain names in the domain name library of the white list website, and determine whether there is a common spelling error, vowel character substitution, homophonic and heteromorphic character substitution, wrong top-level domain name substitution, wrong second-level domain name substitution, singular and plural number conversion, homomorphism, missing or repeated certain character, adjacent character exchange position, keyboard adjacent character substitution or insertion, or insertion or deletion content of a separation character, so as to obtain a determination result;
and the processing device is used for calculating similarity score values of the domain name of the URL corresponding to the URL request and the domain name in the white list website domain name library according to the judgment result, and acquiring the maximum value in the score values as the domain name similarity weight of the URL corresponding to the URL request.
In a third aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes:
at least one processor; and
at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor, and the processor calls the program instructions to perform any of the corresponding methods described above.
In a fourth aspect, embodiments of the present invention provide a non-transitory computer-readable storage medium storing a computer program, the computer program causing the computer to perform any of the corresponding methods described above.
According to the method and the device for identifying the website, provided by the embodiment of the invention, the abnormal website is comprehensively analyzed and identified through multiple dimensions of domain name similarity, webpage content similarity, user report information and secondary visit quantity of the website, and a multi-dimensional comprehensive study abnormal website identification algorithm model is established on the basis to classify the website to realize hierarchical visit control.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flowchart illustrating a website identification method according to an embodiment of the present invention;
FIG. 2 is a flow chart of another method for identifying websites in accordance with an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of an identification apparatus for websites according to an embodiment of the present invention;
fig. 4 is a logic block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments, but not all embodiments, of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
An embodiment of the present invention provides a method for identifying a website, and fig. 1 is a schematic flow chart of the method for identifying a website in the embodiment of the present invention, and as shown in fig. 1, the method includes:
step S101, receiving a Uniform Resource Locator (URL) request of a user for accessing a website, searching a URL corresponding to the URL request in a white list, and if the URL corresponding to the URL request is searched in the white list, connecting the URL corresponding to the URL request;
the white list means that the concept of the white list corresponds to a black list. For example: in a computer system, a plurality of software is applied to a black and white list rule, an operating system, a firewall, antivirus software, a mail system, application software and the like, and the black and white list rule is almost applied in all aspects related to control. If the white list is set up, users (or IP addresses, IP packets, mails and the like) in the white list can pass preferentially and cannot be rejected as junk mails, and the safety and the rapidness are greatly improved. The meaning of the application is expanded by one step, and the application with the blacklist function has the corresponding white list function.
The URL request is a URL which is sent by a website access user and needs to be linked by a current user.
Step S102, searching a URL corresponding to the URL request in a blacklist, and if the URL corresponding to the URL request is searched in the blacklist, generating high-risk prompt information;
the blacklist means that a user (or an IP address, an IP packet, a mail, a virus, etc.) listed in the blacklist cannot pass through the blacklist after the blacklist is enabled.
Step S103, if the URL corresponding to the URL request is not found in the white list and the black list, calculating each characteristic weight value of the URL corresponding to the URL request according to a preset rule, and identifying whether the URL corresponding to the URL request is an abnormal website or not according to each characteristic weight value.
The weight is a relative concept, and for a certain index, the weight of the certain index refers to the relative importance degree of the index in the overall evaluation. The weight is to be separated from a plurality of evaluation indexes, and the weights corresponding to a group of evaluation index systems form a weight system.
The abnormal website is an email deception website aiming at stealing your identity, and in the abnormal website means, a deception planner tries to deceive your trust through a false debit so as to make you reveal valuable personal data, such as credit card numbers, passwords, account data or other information; the abnormal websites also comprise yellow websites, Trojan virus download links and other websites, and the abnormal website means can be realized on line through telephone or short messages or through junk mails or pop-up windows.
According to the method for identifying the abnormal website, the characteristic weight values corresponding to the website are calculated through the preset rule, and the probability that each characteristic weight value judges that the website is the abnormal website is obtained through operation. On the basis of the above method embodiment, the calculating, according to a preset rule, each feature weight value of the URL corresponding to the URL request specifically includes:
and calculating the feature weight values of four dimensions, namely the domain name similarity weight, the webpage content similarity weight, the user reporting amount weight and the secondary visit amount weight of the URL corresponding to the URL request according to a preset rule.
The method for calculating the domain name similarity weight comprises the following steps:
firstly, establishing a common white name website domain name library, including common operator, bank, E-commerce and public inspection website;
secondly, comparing the domain name of the URL to be detected with the domain names in a white list one by one, and judging whether contents such as common spelling errors, vowel character substitution, homophonic and heteromorphic character substitution, wrong top-level domain name substitution, wrong second-level domain name substitution, odd number plural number conversion, homomorphic characters, missing or repeated certain characters, adjacent character exchange positions, keyboard adjacent character substitution or insertion, insertion or deletion of separated characters and the like exist;
and thirdly, calculating the domain name similarity score of the domain name and each domain name in the white list according to the result of the second step, and taking the maximum value as the domain name similarity score of the domain name.
The method for calculating the similarity weight of the webpage content comprises the following steps:
firstly, establishing a webpage content feature library in a common white list website, wherein the features comprise: titles, keywords, pictures, etc., a web content feature library such as www.10086.cn, www.ccb.com, etc.;
secondly, crawling the web page content characteristics of the suspected abnormal website by a crawler technology;
the crawler is a web crawler (also called web spider, web robot, in the middle of FOAF communities, more often called web chasers), and is a program or script that automatically captures web information according to a certain rule. Other less commonly used names are ants, automatic indexing, simulation programs, or worms.
The working principle of the crawler technology is that a web crawler is a program for automatically extracting web pages, downloads the web pages from the world wide web for a search engine and is an important component of the search engine, the traditional crawler obtains the URL on the initial web page from the URL of one or a plurality of initial web pages, and continuously extracts new URLs from the current web page and puts the new URLs into a queue in the process of capturing the web pages until certain stop conditions of the system are met. The workflow of the focused crawler is complex, and links irrelevant to the subject need to be filtered according to a certain webpage analysis algorithm, and useful links are reserved and put into a URL queue to be captured. Then, it will select the next web page URL from the queue according to a certain search strategy, and repeat the above process until reaching a certain condition of the system. In addition, all the web pages grabbed by the crawler are stored by the system, certain analysis and filtering are carried out, and indexes are established so as to facilitate later query and retrieval; for focused crawlers, the analysis results obtained by this process may also give feedback and guidance to the subsequent grabbing process.
And thirdly, taking out the domain name similarity algorithm from the white list feature library, analyzing to obtain white list domain name feature information with the highest similarity with the suspected abnormal website, and calculating the webpage content feature similarity of the suspected abnormal website.
The method for calculating the weight of the user report volume comprises the following steps:
the method comprises the following steps of firstly, counting the number of the websites reported as abnormal websites by a user or pulled into a blacklist by the user;
secondly, counting the number of the websites reported as normal websites by the user or complaints of the websites as a white list by the user;
and thirdly, calculating the user reported information characteristic score value of the website according to the statistical value.
The calculation method of the secondary visit volume weight comprises the following steps:
and (4) counting the secondary visit amount and the proportion of the website after being prompted to have risks, and calculating the secondary visit amount characteristic score of the website.
The abnormal website identification algorithm model is to combine the domain name similarity score value of the URL, the webpage content similarity score value, the user reported information characteristic score value and the website secondary visit amount characteristic score value, carry out decision judgment according to different weights, and finally judge whether the URL is a counterfeit URL.
On the basis of the embodiment of the method, the abnormal website specifically comprises:
high probability abnormal websites, suspected abnormal websites and high probability normal websites.
The high-probability abnormal website is an abnormal website or a dangerous website which is known to be high in possibility according to each characteristic weight value calculated according to the preset rule.
The high-probability normal website refers to an abnormal website or a non-dangerous website which is known to be very low in possibility according to each characteristic weight value calculated according to the preset rule.
The suspected abnormal website refers to an abnormal website which is obtained by calculating each characteristic weight value according to the preset rule and has uncertain possibility, and the possibility is yet to be calculated and investigated.
On the basis of the above method embodiment, the method further comprises:
if the URL corresponding to the URL request is an abnormal website, performing secondary identification on the URL corresponding to the URL request;
and performing secondary identification, namely tracking and identifying the websites with high probability of abnormality and the suspected abnormal websites with the first identification result, releasing and counting the secondary access amount when the user accesses for the second time, performing iterative computation on the two websites according to a preset rule, performing characteristic weight value computation, and obtaining the judgment and identification result again.
If the secondary identification result is the high-probability abnormal website, generating high-risk prompt information, tracking and identifying the high-probability abnormal website, secondarily connecting the high-probability abnormal website, counting the secondary connection times, and adding the high-probability abnormal website to the blacklist;
if the secondary identification result is the high-probability normal website, directly connecting the high-probability normal website, and adding the high-probability normal website to the white list;
if the result of the secondary identification is the suspected abnormal website, generating general risk prompt information, tracking and identifying the suspected abnormal website, secondarily connecting the high-probability abnormal website, counting the secondary connection times, and adding the suspected abnormal website to a grey list.
On the basis of the above method embodiment, the method further comprises:
performing iterative computation and identification on the blacklist, the white list and the grey list according to periodic update information of each feedback information of a user, crawled webpage content, updated webpage content feature similarity value and website secondary visit quantity;
if the identification result is the high-probability abnormal website, adding the high-probability abnormal website into the blacklist;
if the identification result is the high-probability normal website, adding the high-probability normal website into the white list;
and if the identification result is neither the high-probability abnormal website nor the high-probability normal website, continuously keeping the identification result in the grey list for waiting for next iterative computation for identification.
According to the website identification method provided by the embodiment of the invention, a three-level access control and user feedback mechanism is established through a suspected abnormal website grey list, a high-probability normal website white list and a high-probability abnormal website blacklist (a pseudo base station URL library, a mobile phone malicious software link library, a URL blacklist database collected by customer service), and URLs in the suspected abnormal website grey list, the high-probability abnormal website blacklist and the high-probability normal website white list are continuously subjected to iterative computation and update according to information such as user feedback, website secondary access amount and the like, so that the misjudgment rate of a system is effectively reduced, and the user experience is improved.
On the basis of the above method embodiment, the method for calculating the domain name similarity weight includes:
establishing a white list website domain name library;
comparing the domain name of the URL corresponding to the URL request with the domain names in the domain name library of the white list website, and judging whether common spelling errors, vowel character substitution, homophonic and heteromorphic character substitution, wrong top-level domain name substitution, wrong second-level domain name substitution, odd-number complex number conversion, homomorphic characters, deletion or repetition of a certain character, adjacent character exchange positions, keyboard adjacent character substitution or insertion, and insertion or deletion of separated characters exist or not to obtain a judgment result;
and according to the judgment result, calculating similarity score values of the domain name of the URL corresponding to the URL request and the domain name in the white list website domain name library, and acquiring the maximum value in the score values as the domain name similarity weight of the URL corresponding to the URL request.
According to the website identification method provided by the embodiment of the invention, the domain name is comprehensively analyzed from 16 angles, such as common spelling errors, vowel character substitution, homophonic and heteromorphic character substitution, wrong top-level domain name substitution, wrong second-level domain name substitution, odd-number complex number conversion, homomorphic characters, deletion or repetition of a certain character, adjacent character exchange positions, keyboard adjacent character substitution or insertion, separation character insertion or deletion and the like, through a domain name similarity analysis algorithm, and the identification accuracy is high.
The embodiment of the invention has the following specific implementation modes:
fig. 2 is a flowchart of a method for identifying another website in an embodiment of the present invention, as shown in fig. 2, the method specifically includes:
step one, white list filtering is carried out on a URL request submitted by a user for the first time, if the URL request is hit, the URL request is judged to be a normal website, and the URL request is directly put through;
secondly, if the white list is not hit, filtering a blacklist of the domain name (the blacklist library mainly comprises a pseudo base station URL library, a mobile phone malicious software link library, a URL blacklist database collected by customer service, and a website library of the hung horse obtained from a horse hanging reporting platform through a crawler), and if the domain name is hit, carrying out high risk prompt on the user;
if the blacklist is not hit, calculating characteristic values of four dimensions, such as domain name similarity weight, webpage content similarity weight, user reporting amount weight, website secondary visit amount weight and the like, classifying the websites by using an abnormal website identification algorithm model, and dividing the websites into high-probability abnormal websites, suspected abnormal websites and high-probability normal websites;
fourthly, for the abnormal websites with high probability, carrying out high-risk prompt on the user; tracking and identifying the website so as to pass and count the secondary visit amount when the user visits for the second time; storing the website into a high-probability abnormal website blacklist library; for the high-probability normal website, judging the normal website, directly putting the normal website, and adding the normal website to a high-probability normal website white list library; for a suspected abnormal website, performing general risk prompt on a user, adding the user into a grey name list library of the suspected abnormal website, and tracking and identifying the website so as to allow the user to pass and count the secondary visit amount during secondary visit;
fifthly, for a high-probability abnormal website black name list library, a suspected abnormal website grey name list library and a high-probability normal website white name list library, carrying out iterative computation and identification on the abnormal website black name list library, the suspected abnormal website grey name list library and the high-probability normal website white name list library according to information such as each feedback of a user, periodic crawled webpage content updating webpage content feature similarity value, periodic updating of website secondary visit volume and the like, and continuously storing the abnormal website black name list library with the high probability as an identification result; storing the white list library of the normal-probability website for the website with the high-probability normal recognition result; and other websites continue to remain in the grey list library of the suspected abnormal websites to wait for the next iterative computation.
According to the website identification method provided by the embodiment of the invention, abnormal websites are comprehensively analyzed and identified from multiple dimensions of domain name similarity, webpage content similarity, user report information and website secondary visit quantity, and an abnormal website identification algorithm model for multi-dimensional comprehensive study and judgment is established on the basis to classify the websites so as to realize hierarchical access control.
An embodiment of the present invention provides an apparatus for identifying a website, and fig. 3 is a schematic structural diagram of the apparatus for identifying a website in an embodiment of the present invention, and as shown in fig. 3, the apparatus includes: white list processing means 301, black list processing means 302 and abnormal website processing means 303; wherein,
the white list processing device 301 is configured to receive a URL request of a URL of a website accessed by a user, search for a URL corresponding to the URL request in a white list, and connect to the URL corresponding to the URL request if the URL corresponding to the URL request is found in the white list; the blacklist processing device 302 is configured to search a blacklist for a URL corresponding to the URL request, and if the blacklist is found for the URL corresponding to the URL request, generate high risk prompt information; the abnormal website processing device 303 is configured to calculate, according to a preset rule, each feature weight value of the URL corresponding to the URL request if the URL corresponding to the URL request is not found in the white list and the black list, and identify whether the URL corresponding to the URL request is an abnormal website according to each feature weight value.
According to the website identification device provided by the embodiment of the invention, through the abnormal website processing device, the corresponding characteristic weight values of the website are calculated according to the preset rule, and the probability that the website is an abnormal website is judged by calculating the obtained characteristic weight values.
On the basis of the above method embodiment, the calculating, according to a preset rule, each feature weight value of the URL corresponding to the URL request specifically includes:
and calculating the feature weight values of four dimensions, namely the domain name similarity weight, the webpage content similarity weight, the user reporting amount weight and the secondary visit amount weight of the URL corresponding to the URL request according to a preset rule.
On the basis of the embodiment of the method, the abnormal website specifically comprises:
high probability abnormal websites, suspected abnormal websites and high probability normal websites.
Optionally, the apparatus further comprises: the system comprises a secondary identification device, a high-probability abnormal website processing device, a high-probability normal website processing device and a suspected abnormal website processing device; wherein
The secondary identification device is used for carrying out secondary identification on the URL corresponding to the URL request if the URL corresponding to the URL request is an abnormal website; the high-probability abnormal website processing device is used for generating high-risk prompt information if the secondary identification result is the high-probability abnormal website, tracking and identifying the high-probability abnormal website, secondarily connecting the high-probability abnormal website, counting the secondary connection times, and adding the high-probability abnormal website to the blacklist; the high-probability normal website processing device is used for directly connecting the high-probability normal website and adding the high-probability normal website to the white list if the secondary identification result is the high-probability normal website; and the suspected abnormal website processing device is used for generating general risk prompt information if the secondary identification result is the suspected abnormal website, tracking and identifying the suspected abnormal website, secondarily connecting the high-probability abnormal website, counting the secondary connection times, and adding the suspected abnormal website to a grey list.
On the basis of the above method embodiment, the apparatus further includes: the system comprises an iteration calculation device, a high-probability abnormal website iteration device, a high-probability normal website iteration device and a suspected abnormal website iteration device; wherein,
the iterative computation device is used for performing iterative computation and identification on the blacklist, the white list and the gray list according to periodic update information of each feedback information of a user, crawled webpage content, updated webpage content feature similarity value and website secondary visit quantity; the high-probability abnormal website iteration device is used for adding the high-probability abnormal website into the blacklist if the identification result is the high-probability abnormal website; the high-probability normal website iteration device is used for adding the high-probability normal website into the white list if the identification result is the high-probability normal website; and the suspected abnormal website iteration device is used for continuously keeping the suspected abnormal website in the grey list for waiting for the next iteration calculation for identification if the identification result is neither the high-probability abnormal website nor the high-probability normal website.
According to the website identification device provided by the embodiment of the invention, a three-level access control and user feedback mechanism is established through an iterative calculation device of a suspected abnormal website grey list, a high-probability normal website white list and a high-probability abnormal website blacklist (a pseudo base station URL library, a mobile phone malicious software link library, a URL blacklist database collected by customer service), and URLs in the suspected abnormal website grey list, the high-probability abnormal website blacklist and the high-probability normal website whitelist are continuously subjected to iterative calculation and updating according to information such as user feedback, website secondary access amount and the like, so that the misjudgment rate of a system is effectively reduced, and the user experience is improved.
On the basis of the above method embodiment, the device for calculating the domain name similarity weight includes: the system comprises a white list website establishing device, a comparing device and a processing device; wherein,
the white list website establishing device is used for establishing a white list website domain name library; the comparison device is used for comparing the domain name of the URL corresponding to the URL request with the domain name in the domain name library of the white list website, and judging whether common spelling errors, vowel character substitution, homophonic and heteromorphic character substitution, wrong top-level domain name substitution, wrong second-level domain name substitution, odd-number complex number conversion, homomorphic characters, deletion or repetition of a certain character, adjacent character exchange positions, keyboard adjacent character substitution or insertion, and insertion or deletion content of separating characters exist or not to obtain a judgment result; and the processing device is used for calculating similarity score values of the domain name of the URL corresponding to the URL request and the domain name in the white list website domain name library according to the judgment result, and acquiring the maximum value in the score values as the domain name similarity weight of the URL corresponding to the URL request.
According to the website identification device provided by the embodiment of the invention, the domain name is comprehensively analyzed from 16 angles, such as common spelling errors, vowel character substitution, homophonic and heteromorphic character substitution, wrong top-level domain name substitution, wrong second-level domain name substitution, odd-number plural conversion, homomorphic characters, deletion or repetition of a certain character, adjacent character exchange positions, keyboard adjacent character substitution or insertion, separation character insertion or deletion and the like, through a domain name similarity analysis and calculation device, and the identification accuracy is high.
The website identification device provided in the embodiment of the present invention is used for implementing the website identification method provided in the embodiment of the present invention, and the specific implementation manner has been specifically stated in the above method embodiment, and is not described herein again.
According to the website identification device provided by the embodiment of the invention, abnormal websites are comprehensively analyzed and identified through multiple dimensions of domain name similarity, webpage content similarity, user report information and website secondary visit quantity, and an abnormal website identification algorithm model for multi-dimensional comprehensive study and judgment is established on the basis to classify the websites so as to realize hierarchical access control.
Fig. 4 is a logic block diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 4, the electronic device includes: a processor (processor)401, a memory (memory)402, and a bus 403;
wherein, the processor 401 and the memory 402 complete the communication with each other through the bus 403; the processor 401 is configured to call program instructions in the memory 402 to perform the methods provided by the above-described method embodiments.
The present embodiments disclose a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the methods provided by the above-described method embodiments.
The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the method embodiments described above.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the embodiments of the present invention, and are not limited thereto; although embodiments of the present invention have been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (9)

1. A method for identifying a website, the method comprising:
receiving a Uniform Resource Locator (URL) request of a user for accessing a website, searching a URL corresponding to the URL request in a white list, and if the URL corresponding to the URL request is searched in the white list, connecting the URL corresponding to the URL request;
searching a URL corresponding to the URL request in a blacklist, and if the URL corresponding to the URL request is searched in the blacklist, generating high-risk prompt information;
if the URL corresponding to the URL request is not found in the white list and the black list, calculating each characteristic weight value of the URL corresponding to the URL request according to a preset rule, and identifying whether the URL corresponding to the URL request is an abnormal website or not according to each characteristic weight value.
2. The method according to claim 1, wherein the calculating each feature weight value of the URL corresponding to the URL request according to a preset rule specifically includes:
and calculating the feature weight values of four dimensions, namely the domain name similarity weight, the webpage content similarity weight, the user reporting amount weight and the secondary visit amount weight of the URL corresponding to the URL request according to a preset rule.
3. The method according to claim 1, wherein the abnormal website specifically comprises:
high probability abnormal websites, suspected abnormal websites and high probability normal websites.
4. The method of claim 3, further comprising:
if the URL corresponding to the URL request is an abnormal website, performing secondary identification on the URL corresponding to the URL request;
if the secondary identification result is the high-probability abnormal website, generating high-risk prompt information, tracking and identifying the high-probability abnormal website, secondarily connecting the high-probability abnormal website, counting the secondary connection times, and adding the high-probability abnormal website to the blacklist;
if the secondary identification result is the high-probability normal website, directly connecting the high-probability normal website, and adding the high-probability normal website to the white list;
if the result of the secondary identification is the suspected abnormal website, generating general risk prompt information, tracking and identifying the suspected abnormal website, secondarily connecting the high-probability abnormal website, counting the secondary connection times, and adding the suspected abnormal website to a grey list.
5. The method of claim 4, further comprising:
performing iterative computation and identification on the blacklist, the white list and the grey list according to periodic update information of each feedback information of a user, crawled webpage content, updated webpage content feature similarity value and website secondary visit quantity;
if the identification result is the high-probability abnormal website, adding the high-probability abnormal website into the blacklist;
if the identification result is the high-probability normal website, adding the high-probability normal website into the white list;
and if the identification result is neither the high-probability abnormal website nor the high-probability normal website, continuously keeping the identification result in the grey list for waiting for next iterative computation for identification.
6. The method according to claim 2, wherein the method for calculating the domain name similarity weight comprises:
establishing a white list website domain name library;
comparing the domain name of the URL corresponding to the URL request with the domain names in the domain name library of the white list website, and judging whether common spelling errors, vowel character substitution, homophonic and heteromorphic character substitution, wrong top-level domain name substitution, wrong second-level domain name substitution, odd-number complex number conversion, homomorphic characters, deletion or repetition of a certain character, adjacent character exchange positions, keyboard adjacent character substitution or insertion, and insertion or deletion of separated characters exist or not to obtain a judgment result;
and according to the judgment result, calculating similarity score values of the domain name of the URL corresponding to the URL request and the domain name in the white list website domain name library, and acquiring the maximum value in the score values as the domain name similarity weight of the URL corresponding to the URL request.
7. An apparatus for identifying a website, the apparatus comprising:
the white list processing device is used for receiving a Uniform Resource Locator (URL) request of a user for accessing a website, searching a URL corresponding to the URL request in a white list, and if the URL corresponding to the URL request is searched in the white list, connecting the URL corresponding to the URL request;
the blacklist processing device is used for searching a URL corresponding to the URL request in a blacklist, and if the URL corresponding to the URL request is searched in the blacklist, high risk prompt information is generated;
and the abnormal website processing device is used for calculating each characteristic weight value of the URL corresponding to the URL request according to a preset rule if the URL corresponding to the URL request is not found in the white list and the black list, and identifying whether the URL corresponding to the URL request is an abnormal website or not according to each characteristic weight value.
8. An electronic device, comprising:
at least one processor; and
at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1 to 6.
9. A non-transitory computer-readable storage medium storing a computer program that causes a computer to perform the method according to any one of claims 1 to 6.
CN201710565741.8A 2017-07-12 2017-07-12 Website identification method and device Active CN109274632B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710565741.8A CN109274632B (en) 2017-07-12 2017-07-12 Website identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710565741.8A CN109274632B (en) 2017-07-12 2017-07-12 Website identification method and device

Publications (2)

Publication Number Publication Date
CN109274632A true CN109274632A (en) 2019-01-25
CN109274632B CN109274632B (en) 2021-05-11

Family

ID=65147708

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710565741.8A Active CN109274632B (en) 2017-07-12 2017-07-12 Website identification method and device

Country Status (1)

Country Link
CN (1) CN109274632B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109831465A (en) * 2019-04-12 2019-05-31 重庆天蓬网络有限公司 A kind of invasion detection method based on big data log analysis
CN110069693A (en) * 2019-04-29 2019-07-30 百度在线网络技术(北京)有限公司 Method and apparatus for determining target pages
CN111147490A (en) * 2019-12-26 2020-05-12 中国科学院信息工程研究所 Directional fishing attack event discovery method and device
CN111756728A (en) * 2020-06-23 2020-10-09 深圳前海微众银行股份有限公司 Vulnerability attack detection method and device
CN112256988A (en) * 2020-10-19 2021-01-22 中国互联网金融协会 Method and device for monitoring cross-border house-buying website, electronic equipment and storage medium
CN112417329A (en) * 2020-10-19 2021-02-26 中国互联网金融协会 Method and device for monitoring illegal internet foreign exchange deposit transaction platform
CN112733057A (en) * 2020-11-27 2021-04-30 杭州安恒信息安全技术有限公司 Network content security detection method, electronic device and storage medium
CN112948725A (en) * 2021-03-02 2021-06-11 北京六方云信息技术有限公司 Phishing website URL detection method and system based on machine learning
CN114389854A (en) * 2021-12-22 2022-04-22 杭州美创科技有限公司 Malicious e-mail detection method and system
CN115801455A (en) * 2023-01-31 2023-03-14 北京微步在线科技有限公司 Website fingerprint-based counterfeit website detection method and device
CN116366338A (en) * 2023-03-30 2023-06-30 北京微步在线科技有限公司 Risk website identification method and device, computer equipment and storage medium
CN116846668A (en) * 2023-07-28 2023-10-03 北京中睿天下信息技术有限公司 Harmful URL detection method, system, equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101035128A (en) * 2007-04-18 2007-09-12 大连理工大学 Three-folded webpage text content recognition and filtering method based on the Chinese punctuation
US7854001B1 (en) * 2007-06-29 2010-12-14 Trend Micro Incorporated Aggregation-based phishing site detection
US8544090B1 (en) * 2011-01-21 2013-09-24 Symantec Corporation Systems and methods for detecting a potentially malicious uniform resource locator
CN103428186A (en) * 2012-05-24 2013-12-04 中国移动通信集团公司 Method and device for detecting phishing website
CN103607385A (en) * 2013-11-14 2014-02-26 北京奇虎科技有限公司 Method and apparatus for security detection based on browser
CN106209488A (en) * 2015-04-28 2016-12-07 北京瀚思安信科技有限公司 For detecting the method and apparatus that website is attacked
CN106603490A (en) * 2016-11-10 2017-04-26 上海斐讯数据通信技术有限公司 Phishing website detecting method and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101035128A (en) * 2007-04-18 2007-09-12 大连理工大学 Three-folded webpage text content recognition and filtering method based on the Chinese punctuation
US7854001B1 (en) * 2007-06-29 2010-12-14 Trend Micro Incorporated Aggregation-based phishing site detection
US8544090B1 (en) * 2011-01-21 2013-09-24 Symantec Corporation Systems and methods for detecting a potentially malicious uniform resource locator
CN103428186A (en) * 2012-05-24 2013-12-04 中国移动通信集团公司 Method and device for detecting phishing website
CN103607385A (en) * 2013-11-14 2014-02-26 北京奇虎科技有限公司 Method and apparatus for security detection based on browser
CN106209488A (en) * 2015-04-28 2016-12-07 北京瀚思安信科技有限公司 For detecting the method and apparatus that website is attacked
CN106603490A (en) * 2016-11-10 2017-04-26 上海斐讯数据通信技术有限公司 Phishing website detecting method and system

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109831465A (en) * 2019-04-12 2019-05-31 重庆天蓬网络有限公司 A kind of invasion detection method based on big data log analysis
CN110069693A (en) * 2019-04-29 2019-07-30 百度在线网络技术(北京)有限公司 Method and apparatus for determining target pages
CN110069693B (en) * 2019-04-29 2021-12-24 百度在线网络技术(北京)有限公司 Method and device for determining target page
CN111147490A (en) * 2019-12-26 2020-05-12 中国科学院信息工程研究所 Directional fishing attack event discovery method and device
CN111756728B (en) * 2020-06-23 2021-08-17 深圳前海微众银行股份有限公司 Vulnerability attack detection method and device, computing equipment and storage medium
CN111756728A (en) * 2020-06-23 2020-10-09 深圳前海微众银行股份有限公司 Vulnerability attack detection method and device
CN112256988A (en) * 2020-10-19 2021-01-22 中国互联网金融协会 Method and device for monitoring cross-border house-buying website, electronic equipment and storage medium
CN112417329A (en) * 2020-10-19 2021-02-26 中国互联网金融协会 Method and device for monitoring illegal internet foreign exchange deposit transaction platform
CN112733057A (en) * 2020-11-27 2021-04-30 杭州安恒信息安全技术有限公司 Network content security detection method, electronic device and storage medium
CN112948725A (en) * 2021-03-02 2021-06-11 北京六方云信息技术有限公司 Phishing website URL detection method and system based on machine learning
CN114389854A (en) * 2021-12-22 2022-04-22 杭州美创科技有限公司 Malicious e-mail detection method and system
CN115801455A (en) * 2023-01-31 2023-03-14 北京微步在线科技有限公司 Website fingerprint-based counterfeit website detection method and device
CN115801455B (en) * 2023-01-31 2023-05-26 北京微步在线科技有限公司 Method and device for detecting counterfeit website based on website fingerprint
CN116366338A (en) * 2023-03-30 2023-06-30 北京微步在线科技有限公司 Risk website identification method and device, computer equipment and storage medium
CN116366338B (en) * 2023-03-30 2024-02-06 北京微步在线科技有限公司 Risk website identification method and device, computer equipment and storage medium
CN116846668A (en) * 2023-07-28 2023-10-03 北京中睿天下信息技术有限公司 Harmful URL detection method, system, equipment and storage medium

Also Published As

Publication number Publication date
CN109274632B (en) 2021-05-11

Similar Documents

Publication Publication Date Title
CN109274632B (en) Website identification method and device
US11310268B2 (en) Systems and methods using computer vision and machine learning for detection of malicious actions
US11399288B2 (en) Method for HTTP-based access point fingerprint and classification using machine learning
US8438386B2 (en) System and method for developing a risk profile for an internet service
US9954886B2 (en) Method and apparatus for detecting website security
CN104954372B (en) A kind of evidence obtaining of fishing website and verification method and system
Ahmed et al. Real time detection of phishing websites
US8763116B1 (en) Detecting fraudulent activity by analysis of information requests
CN109768992B (en) Webpage malicious scanning processing method and device, terminal device and readable storage medium
CN104217160A (en) Method and system for detecting Chinese phishing website
CN108023868B (en) Malicious resource address detection method and device
WO2023044060A1 (en) Malicious homoglyphic domain name detection, generation, and associated cyber security applications
Banerjee et al. SUT: Quantifying and mitigating url typosquatting
CN104135467B (en) Identify method and the device of malicious websites
US20210006592A1 (en) Phishing Detection based on Interaction with End User
Ramesh et al. Identification of phishing webpages and its target domains by analyzing the feign relationship
CN107896225A (en) Fishing website decision method, server and storage medium
CN114244564A (en) Attack defense method, device, equipment and readable storage medium
Gupta et al. Robust injection point-based framework for modern applications against XSS vulnerabilities in online social networks
Thaker et al. Detecting phishing websites using data mining
Roopak et al. On effectiveness of source code and SSL based features for phishing website detection
US10313127B1 (en) Method and system for detecting and alerting users of device fingerprinting attempts
JP4564916B2 (en) Phishing fraud countermeasure method, terminal, server and program
Liu et al. Financial websites oriented heuristic anti-phishing research
Swathi et al. Detection of Phishing Websites Using Machine Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant