CN109274632A

CN109274632A - A kind of recognition methods of website and device

Info

Publication number: CN109274632A
Application number: CN201710565741.8A
Authority: CN
Inventors: 付为民; 郝建忠; 郑浩彬; 陈涛; 邬学农
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Guangdong Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Guangdong Co Ltd
Priority date: 2017-07-12
Filing date: 2017-07-12
Publication date: 2019-01-25
Anticipated expiration: 2037-07-12
Also published as: CN109274632B

Abstract

The present invention implements to provide recognition methods and the device of a kind of website, the described method includes: receiving the uniform resource locator URL request that user accesses website, the corresponding URL of the URL request is searched in white list, if finding the corresponding URL of the URL request in the white list, the corresponding URL of the URL request is connected；The corresponding URL of the URL request is searched in blacklist, if finding the corresponding URL of the URL request in the blacklist, generates high risk prompt information；If not finding the corresponding URL of the URL request in the white list and the blacklist, each feature weight value of the corresponding URL of the URL request is then calculated according to preset rules, and identifies whether the corresponding URL of the URL request is abnormal website according to each feature weight value.The embodiment of the present invention, which realizes, quick and precisely efficiently identifies abnormal website, and significantly reduces the False Rate of system, and the user experience is improved.

Description

Website identification method and device

Technical Field

The invention relates to the technical field of computers, in particular to a website identification method and device.

Background

With the rapid development of the mobile internet, the way for users to browse website information has been changed from a single PC to mobile terminal devices. In 2016, 6 months and 22 days, a Chinese Internet information center (CNNIC) issues a 37 th statistical report of the development conditions of the Chinese Internet in Beijing, and the report shows that: by 12 months in 2015, the national netizen scale reaches 6.88 hundred million, wherein the mobile phone netizen scale reaches 6.20 hundred million, and the percentage of occupation is as high as 90.12%.

Meanwhile, the security problem of the mobile phone client is increasingly highlighted, in 2015, the number of active smart phone networking terminals in China reaches 11.3 hundred million, and the problems including counterfeiting, phishing websites and malicious programs are increased, so that the internet surfing security of a user is threatened, and money loss or personal information leakage is caused.

Currently, an operator intercepts a Uniform Resource Locator (URL) requested by a mobile client on a network side mainly through a blacklist.

The blacklist method comprises the following steps: a blacklist list is configured for the WAP gateway by a Wireless Application Protocol (WAP), after a mobile HTTP request reaches the WAP gateway, the gateway analyzes URLs in a HyperText Transfer Protocol (HTTP) header and searches and matches the URLs in sequence, if the URL hits in the blacklist, the WAP gateway does not proxy the request any more, and returns the request directly to the mobile phone terminal 403, and access to the page is denied.

The blacklist method has the advantages that: the method is simple and direct, all URL gateways hitting the blacklist do not need to do proxy requests next, and the proxy gateways do not need to do requests to an original server, so that the load of the proxy gateways can be reduced. The handset terminal gets 403 the page (browser or application app presentation) to which access is denied directly.

The black list method has the following defects:

1. at present, the blacklist is deployed in the WAP gateway, and a user is required to set 10.0.0.172 proxy at a terminal, and if the proxy is not set, the internet traffic of the user cannot be intercepted without passing through the WAP gateway.

Statistically, above 90% of users do not set 10.0.0.172 proxy at terminal side, and the interception scheme has no effect on the part of users.

2. The blacklist interception mode and the page are too simple, so that the user can misunderstand that the network fault occurs, and the experience is poor.

The user accesses the illegal website, mostly obtained from pushing illegal short messages, mails, advertisements and the like, and the user does not know that the website accessed by the user is illegal, harmful or wrong. The blacklist processing mode effectively prevents the user from accessing, but the user obtains an oversimplified page for refusing to access, the user can misunderstand that the network or website service has problems, and the evaluation of the user on the operator network or website is reduced. In addition, the mode easily causes the user to repeatedly try to access or the client automatically tries to access again, so that the blacklist is larger and larger as the number of counterfeit and phishing websites increases, and the larger blacklist means that each matching needs longer time. This increases the processing load of the proxy gateway and reduces the processing efficiency of the proxy gateway, thereby reducing the speed of the user accessing the internet.

3. The traditional blacklist interception mode requires very high data accuracy, and in order to ensure that normal websites cannot be intercepted by mistake, a large amount of manual work is needed to carry out one-by-one audit, time and labor are consumed, and one-by-one audit can not be carried out on suspected websites which are counted in billions on the whole internet. In addition, the counterfeit and phishing websites have the characteristics of frequent domain name change, high similarity, short timeliness and the like, so that the traditional blacklist mode is not suitable for the current requirements.

4. The traditional blacklist interception mode cannot flexibly process most suspected websites, complaints of the websites are easily caused if the blacklist is added for direct interception, and the risk of revealing the privacy of a client does not exist if the blacklist is not added for direct interception.

Therefore, how to improve the traditional blacklist interception mode and quickly, accurately and efficiently identify the abnormal website becomes a technical problem to be solved urgently.

Disclosure of Invention

Aiming at the defects in the prior art, the embodiment of the invention provides a website identification method and device.

In a first aspect, an embodiment of the present invention provides a method for identifying a website, where the method includes:

receiving a Uniform Resource Locator (URL) request of a user for accessing a website, searching a URL corresponding to the URL request in a white list, and if the URL corresponding to the URL request is searched in the white list, connecting the URL corresponding to the URL request;

searching a URL corresponding to the URL request in a blacklist, and if the URL corresponding to the URL request is searched in the blacklist, generating high-risk prompt information;

if the URL corresponding to the URL request is not found in the white list and the black list, calculating each characteristic weight value of the URL corresponding to the URL request according to a preset rule, and identifying whether the URL corresponding to the URL request is an abnormal website or not according to each characteristic weight value.

Optionally, the calculating, according to a preset rule, each feature weight value of the URL corresponding to the URL request specifically includes:

and calculating the feature weight values of four dimensions, namely the domain name similarity weight, the webpage content similarity weight, the user reporting amount weight and the secondary visit amount weight of the URL corresponding to the URL request according to a preset rule.

Optionally, the abnormal website specifically includes:

high probability abnormal websites, suspected abnormal websites and high probability normal websites.

Optionally, the method further includes:

if the URL corresponding to the URL request is an abnormal website, performing secondary identification on the URL corresponding to the URL request;

if the secondary identification result is the high-probability abnormal website, generating high-risk prompt information, tracking and identifying the high-probability abnormal website, secondarily connecting the high-probability abnormal website, counting the secondary connection times, and adding the high-probability abnormal website to the blacklist;

if the secondary identification result is the high-probability normal website, directly connecting the high-probability normal website, and adding the high-probability normal website to the white list;

if the result of the secondary identification is the suspected abnormal website, generating general risk prompt information, tracking and identifying the suspected abnormal website, secondarily connecting the high-probability abnormal website, counting the secondary connection times, and adding the suspected abnormal website to a grey list.

Optionally, the method further includes:

performing iterative computation and identification on the blacklist, the white list and the grey list according to periodic update information of each feedback information of a user, crawled webpage content, updated webpage content feature similarity value and website secondary visit quantity;

if the identification result is the high-probability abnormal website, adding the high-probability abnormal website into the blacklist;

if the identification result is the high-probability normal website, adding the high-probability normal website into the white list;

and if the identification result is neither the high-probability abnormal website nor the high-probability normal website, continuously keeping the identification result in the grey list for waiting for next iterative computation for identification.

Optionally, the method for calculating the domain name similarity weight includes:

establishing a white list website domain name library;

comparing the domain name of the URL corresponding to the URL request with the domain names in the domain name library of the white list website, and judging whether common spelling errors, vowel character substitution, homophonic and heteromorphic character substitution, wrong top-level domain name substitution, wrong second-level domain name substitution, odd-number complex number conversion, homomorphic characters, deletion or repetition of a certain character, adjacent character exchange positions, keyboard adjacent character substitution or insertion, and insertion or deletion of separated characters exist or not to obtain a judgment result;

and according to the judgment result, calculating similarity score values of the domain name of the URL corresponding to the URL request and the domain name in the white list website domain name library, and acquiring the maximum value in the score values as the domain name similarity weight of the URL corresponding to the URL request.

In a second aspect, an embodiment of the present invention provides an apparatus for identifying a website, where the apparatus includes:

the white list processing device is used for receiving a Uniform Resource Locator (URL) request of a user for accessing a website, searching a URL corresponding to the URL request in a white list, and if the URL corresponding to the URL request is searched in the white list, connecting the URL corresponding to the URL request;

the blacklist processing device is used for searching a URL corresponding to the URL request in a blacklist, and if the URL corresponding to the URL request is searched in the blacklist, high risk prompt information is generated;

and the abnormal website processing device is used for calculating each characteristic weight value of the URL corresponding to the URL request according to a preset rule if the URL corresponding to the URL request is not found in the white list and the black list, and identifying whether the URL corresponding to the URL request is an abnormal website or not according to each characteristic weight value.

Optionally, the abnormal website processing apparatus specifically includes:

Optionally, the abnormal website specifically includes:

Optionally, the apparatus further comprises:

the secondary identification device is used for carrying out secondary identification on the URL corresponding to the URL request if the URL corresponding to the URL request is an abnormal website;

the high-probability abnormal website processing device is used for generating high-risk prompt information if the secondary identification result is the high-probability abnormal website, tracking and identifying the high-probability abnormal website, secondarily connecting the high-probability abnormal website, counting the secondary connection times, and adding the high-probability abnormal website to the blacklist;

the high-probability normal website processing device is used for directly connecting the high-probability normal website and adding the high-probability normal website to the white list if the secondary identification result is the high-probability normal website;

and the suspected abnormal website processing device is used for generating general risk prompt information if the secondary identification result is the suspected abnormal website, tracking and identifying the suspected abnormal website, secondarily connecting the high-probability abnormal website, counting the secondary connection times, and adding the suspected abnormal website to a grey list.

Optionally, the apparatus further comprises:

the iterative computation device is used for performing iterative computation and identification on the blacklist, the white list and the grey list according to periodic update information of each feedback information of a user, crawl webpage content, update a webpage content feature similarity value and website secondary visit quantity;

the high-probability abnormal website iteration device is used for adding the high-probability abnormal website into the blacklist if the identification result is the high-probability abnormal website;

the high-probability normal website iteration device is used for adding the high-probability normal website into the white list if the identification result is the high-probability normal website;

and the suspected abnormal website iteration device is used for continuing to remain in the grey list to wait for the next iteration calculation for identification if the identification result is neither the high-probability abnormal website nor the high-probability normal website.

Optionally, the device for calculating the domain name similarity weight specifically includes:

the white list website establishing device is used for establishing a white list website domain name library;

a comparison device, configured to compare the domain name of the URL corresponding to the URL request with the domain names in the domain name library of the white list website, and determine whether there is a common spelling error, vowel character substitution, homophonic and heteromorphic character substitution, wrong top-level domain name substitution, wrong second-level domain name substitution, singular and plural number conversion, homomorphism, missing or repeated certain character, adjacent character exchange position, keyboard adjacent character substitution or insertion, or insertion or deletion content of a separation character, so as to obtain a determination result;

and the processing device is used for calculating similarity score values of the domain name of the URL corresponding to the URL request and the domain name in the white list website domain name library according to the judgment result, and acquiring the maximum value in the score values as the domain name similarity weight of the URL corresponding to the URL request.

In a third aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, and the processor calls the program instructions to perform any of the corresponding methods described above.

In a fourth aspect, embodiments of the present invention provide a non-transitory computer-readable storage medium storing a computer program, the computer program causing the computer to perform any of the corresponding methods described above.

According to the method and the device for identifying the website, provided by the embodiment of the invention, the abnormal website is comprehensively analyzed and identified through multiple dimensions of domain name similarity, webpage content similarity, user report information and secondary visit quantity of the website, and a multi-dimensional comprehensive study abnormal website identification algorithm model is established on the basis to classify the website to realize hierarchical visit control.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flowchart illustrating a website identification method according to an embodiment of the present invention;

FIG. 2 is a flow chart of another method for identifying websites in accordance with an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an identification apparatus for websites according to an embodiment of the present invention;

fig. 4 is a logic block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments, but not all embodiments, of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

An embodiment of the present invention provides a method for identifying a website, and fig. 1 is a schematic flow chart of the method for identifying a website in the embodiment of the present invention, and as shown in fig. 1, the method includes:

step S101, receiving a Uniform Resource Locator (URL) request of a user for accessing a website, searching a URL corresponding to the URL request in a white list, and if the URL corresponding to the URL request is searched in the white list, connecting the URL corresponding to the URL request;

the white list means that the concept of the white list corresponds to a black list. For example: in a computer system, a plurality of software is applied to a black and white list rule, an operating system, a firewall, antivirus software, a mail system, application software and the like, and the black and white list rule is almost applied in all aspects related to control. If the white list is set up, users (or IP addresses, IP packets, mails and the like) in the white list can pass preferentially and cannot be rejected as junk mails, and the safety and the rapidness are greatly improved. The meaning of the application is expanded by one step, and the application with the blacklist function has the corresponding white list function.

The URL request is a URL which is sent by a website access user and needs to be linked by a current user.

Step S102, searching a URL corresponding to the URL request in a blacklist, and if the URL corresponding to the URL request is searched in the blacklist, generating high-risk prompt information;

the blacklist means that a user (or an IP address, an IP packet, a mail, a virus, etc.) listed in the blacklist cannot pass through the blacklist after the blacklist is enabled.

Step S103, if the URL corresponding to the URL request is not found in the white list and the black list, calculating each characteristic weight value of the URL corresponding to the URL request according to a preset rule, and identifying whether the URL corresponding to the URL request is an abnormal website or not according to each characteristic weight value.

The weight is a relative concept, and for a certain index, the weight of the certain index refers to the relative importance degree of the index in the overall evaluation. The weight is to be separated from a plurality of evaluation indexes, and the weights corresponding to a group of evaluation index systems form a weight system.

The abnormal website is an email deception website aiming at stealing your identity, and in the abnormal website means, a deception planner tries to deceive your trust through a false debit so as to make you reveal valuable personal data, such as credit card numbers, passwords, account data or other information; the abnormal websites also comprise yellow websites, Trojan virus download links and other websites, and the abnormal website means can be realized on line through telephone or short messages or through junk mails or pop-up windows.

According to the method for identifying the abnormal website, the characteristic weight values corresponding to the website are calculated through the preset rule, and the probability that each characteristic weight value judges that the website is the abnormal website is obtained through operation. On the basis of the above method embodiment, the calculating, according to a preset rule, each feature weight value of the URL corresponding to the URL request specifically includes:

The method for calculating the domain name similarity weight comprises the following steps:

firstly, establishing a common white name website domain name library, including common operator, bank, E-commerce and public inspection website;

secondly, comparing the domain name of the URL to be detected with the domain names in a white list one by one, and judging whether contents such as common spelling errors, vowel character substitution, homophonic and heteromorphic character substitution, wrong top-level domain name substitution, wrong second-level domain name substitution, odd number plural number conversion, homomorphic characters, missing or repeated certain characters, adjacent character exchange positions, keyboard adjacent character substitution or insertion, insertion or deletion of separated characters and the like exist;

and thirdly, calculating the domain name similarity score of the domain name and each domain name in the white list according to the result of the second step, and taking the maximum value as the domain name similarity score of the domain name.

The method for calculating the similarity weight of the webpage content comprises the following steps:

firstly, establishing a webpage content feature library in a common white list website, wherein the features comprise: titles, keywords, pictures, etc., a web content feature library such as www.10086.cn, www.ccb.com, etc.;

secondly, crawling the web page content characteristics of the suspected abnormal website by a crawler technology;

the crawler is a web crawler (also called web spider, web robot, in the middle of FOAF communities, more often called web chasers), and is a program or script that automatically captures web information according to a certain rule. Other less commonly used names are ants, automatic indexing, simulation programs, or worms.

The working principle of the crawler technology is that a web crawler is a program for automatically extracting web pages, downloads the web pages from the world wide web for a search engine and is an important component of the search engine, the traditional crawler obtains the URL on the initial web page from the URL of one or a plurality of initial web pages, and continuously extracts new URLs from the current web page and puts the new URLs into a queue in the process of capturing the web pages until certain stop conditions of the system are met. The workflow of the focused crawler is complex, and links irrelevant to the subject need to be filtered according to a certain webpage analysis algorithm, and useful links are reserved and put into a URL queue to be captured. Then, it will select the next web page URL from the queue according to a certain search strategy, and repeat the above process until reaching a certain condition of the system. In addition, all the web pages grabbed by the crawler are stored by the system, certain analysis and filtering are carried out, and indexes are established so as to facilitate later query and retrieval; for focused crawlers, the analysis results obtained by this process may also give feedback and guidance to the subsequent grabbing process.

And thirdly, taking out the domain name similarity algorithm from the white list feature library, analyzing to obtain white list domain name feature information with the highest similarity with the suspected abnormal website, and calculating the webpage content feature similarity of the suspected abnormal website.

The method for calculating the weight of the user report volume comprises the following steps:

the method comprises the following steps of firstly, counting the number of the websites reported as abnormal websites by a user or pulled into a blacklist by the user;

secondly, counting the number of the websites reported as normal websites by the user or complaints of the websites as a white list by the user;

and thirdly, calculating the user reported information characteristic score value of the website according to the statistical value.

The calculation method of the secondary visit volume weight comprises the following steps:

and (4) counting the secondary visit amount and the proportion of the website after being prompted to have risks, and calculating the secondary visit amount characteristic score of the website.

The abnormal website identification algorithm model is to combine the domain name similarity score value of the URL, the webpage content similarity score value, the user reported information characteristic score value and the website secondary visit amount characteristic score value, carry out decision judgment according to different weights, and finally judge whether the URL is a counterfeit URL.

On the basis of the embodiment of the method, the abnormal website specifically comprises:

The high-probability abnormal website is an abnormal website or a dangerous website which is known to be high in possibility according to each characteristic weight value calculated according to the preset rule.

The high-probability normal website refers to an abnormal website or a non-dangerous website which is known to be very low in possibility according to each characteristic weight value calculated according to the preset rule.

The suspected abnormal website refers to an abnormal website which is obtained by calculating each characteristic weight value according to the preset rule and has uncertain possibility, and the possibility is yet to be calculated and investigated.

On the basis of the above method embodiment, the method further comprises:

and performing secondary identification, namely tracking and identifying the websites with high probability of abnormality and the suspected abnormal websites with the first identification result, releasing and counting the secondary access amount when the user accesses for the second time, performing iterative computation on the two websites according to a preset rule, performing characteristic weight value computation, and obtaining the judgment and identification result again.

On the basis of the above method embodiment, the method further comprises:

According to the website identification method provided by the embodiment of the invention, a three-level access control and user feedback mechanism is established through a suspected abnormal website grey list, a high-probability normal website white list and a high-probability abnormal website blacklist (a pseudo base station URL library, a mobile phone malicious software link library, a URL blacklist database collected by customer service), and URLs in the suspected abnormal website grey list, the high-probability abnormal website blacklist and the high-probability normal website white list are continuously subjected to iterative computation and update according to information such as user feedback, website secondary access amount and the like, so that the misjudgment rate of a system is effectively reduced, and the user experience is improved.

On the basis of the above method embodiment, the method for calculating the domain name similarity weight includes:

establishing a white list website domain name library;

According to the website identification method provided by the embodiment of the invention, the domain name is comprehensively analyzed from 16 angles, such as common spelling errors, vowel character substitution, homophonic and heteromorphic character substitution, wrong top-level domain name substitution, wrong second-level domain name substitution, odd-number complex number conversion, homomorphic characters, deletion or repetition of a certain character, adjacent character exchange positions, keyboard adjacent character substitution or insertion, separation character insertion or deletion and the like, through a domain name similarity analysis algorithm, and the identification accuracy is high.

The embodiment of the invention has the following specific implementation modes:

fig. 2 is a flowchart of a method for identifying another website in an embodiment of the present invention, as shown in fig. 2, the method specifically includes:

step one, white list filtering is carried out on a URL request submitted by a user for the first time, if the URL request is hit, the URL request is judged to be a normal website, and the URL request is directly put through;

secondly, if the white list is not hit, filtering a blacklist of the domain name (the blacklist library mainly comprises a pseudo base station URL library, a mobile phone malicious software link library, a URL blacklist database collected by customer service, and a website library of the hung horse obtained from a horse hanging reporting platform through a crawler), and if the domain name is hit, carrying out high risk prompt on the user;

if the blacklist is not hit, calculating characteristic values of four dimensions, such as domain name similarity weight, webpage content similarity weight, user reporting amount weight, website secondary visit amount weight and the like, classifying the websites by using an abnormal website identification algorithm model, and dividing the websites into high-probability abnormal websites, suspected abnormal websites and high-probability normal websites;

fourthly, for the abnormal websites with high probability, carrying out high-risk prompt on the user; tracking and identifying the website so as to pass and count the secondary visit amount when the user visits for the second time; storing the website into a high-probability abnormal website blacklist library; for the high-probability normal website, judging the normal website, directly putting the normal website, and adding the normal website to a high-probability normal website white list library; for a suspected abnormal website, performing general risk prompt on a user, adding the user into a grey name list library of the suspected abnormal website, and tracking and identifying the website so as to allow the user to pass and count the secondary visit amount during secondary visit;

fifthly, for a high-probability abnormal website black name list library, a suspected abnormal website grey name list library and a high-probability normal website white name list library, carrying out iterative computation and identification on the abnormal website black name list library, the suspected abnormal website grey name list library and the high-probability normal website white name list library according to information such as each feedback of a user, periodic crawled webpage content updating webpage content feature similarity value, periodic updating of website secondary visit volume and the like, and continuously storing the abnormal website black name list library with the high probability as an identification result; storing the white list library of the normal-probability website for the website with the high-probability normal recognition result; and other websites continue to remain in the grey list library of the suspected abnormal websites to wait for the next iterative computation.

According to the website identification method provided by the embodiment of the invention, abnormal websites are comprehensively analyzed and identified from multiple dimensions of domain name similarity, webpage content similarity, user report information and website secondary visit quantity, and an abnormal website identification algorithm model for multi-dimensional comprehensive study and judgment is established on the basis to classify the websites so as to realize hierarchical access control.

An embodiment of the present invention provides an apparatus for identifying a website, and fig. 3 is a schematic structural diagram of the apparatus for identifying a website in an embodiment of the present invention, and as shown in fig. 3, the apparatus includes: white list processing means 301, black list processing means 302 and abnormal website processing means 303; wherein,

the white list processing device 301 is configured to receive a URL request of a URL of a website accessed by a user, search for a URL corresponding to the URL request in a white list, and connect to the URL corresponding to the URL request if the URL corresponding to the URL request is found in the white list; the blacklist processing device 302 is configured to search a blacklist for a URL corresponding to the URL request, and if the blacklist is found for the URL corresponding to the URL request, generate high risk prompt information; the abnormal website processing device 303 is configured to calculate, according to a preset rule, each feature weight value of the URL corresponding to the URL request if the URL corresponding to the URL request is not found in the white list and the black list, and identify whether the URL corresponding to the URL request is an abnormal website according to each feature weight value.

According to the website identification device provided by the embodiment of the invention, through the abnormal website processing device, the corresponding characteristic weight values of the website are calculated according to the preset rule, and the probability that the website is an abnormal website is judged by calculating the obtained characteristic weight values.

On the basis of the above method embodiment, the calculating, according to a preset rule, each feature weight value of the URL corresponding to the URL request specifically includes:

Optionally, the apparatus further comprises: the system comprises a secondary identification device, a high-probability abnormal website processing device, a high-probability normal website processing device and a suspected abnormal website processing device; wherein

The secondary identification device is used for carrying out secondary identification on the URL corresponding to the URL request if the URL corresponding to the URL request is an abnormal website; the high-probability abnormal website processing device is used for generating high-risk prompt information if the secondary identification result is the high-probability abnormal website, tracking and identifying the high-probability abnormal website, secondarily connecting the high-probability abnormal website, counting the secondary connection times, and adding the high-probability abnormal website to the blacklist; the high-probability normal website processing device is used for directly connecting the high-probability normal website and adding the high-probability normal website to the white list if the secondary identification result is the high-probability normal website; and the suspected abnormal website processing device is used for generating general risk prompt information if the secondary identification result is the suspected abnormal website, tracking and identifying the suspected abnormal website, secondarily connecting the high-probability abnormal website, counting the secondary connection times, and adding the suspected abnormal website to a grey list.

On the basis of the above method embodiment, the apparatus further includes: the system comprises an iteration calculation device, a high-probability abnormal website iteration device, a high-probability normal website iteration device and a suspected abnormal website iteration device; wherein,

the iterative computation device is used for performing iterative computation and identification on the blacklist, the white list and the gray list according to periodic update information of each feedback information of a user, crawled webpage content, updated webpage content feature similarity value and website secondary visit quantity; the high-probability abnormal website iteration device is used for adding the high-probability abnormal website into the blacklist if the identification result is the high-probability abnormal website; the high-probability normal website iteration device is used for adding the high-probability normal website into the white list if the identification result is the high-probability normal website; and the suspected abnormal website iteration device is used for continuously keeping the suspected abnormal website in the grey list for waiting for the next iteration calculation for identification if the identification result is neither the high-probability abnormal website nor the high-probability normal website.

According to the website identification device provided by the embodiment of the invention, a three-level access control and user feedback mechanism is established through an iterative calculation device of a suspected abnormal website grey list, a high-probability normal website white list and a high-probability abnormal website blacklist (a pseudo base station URL library, a mobile phone malicious software link library, a URL blacklist database collected by customer service), and URLs in the suspected abnormal website grey list, the high-probability abnormal website blacklist and the high-probability normal website whitelist are continuously subjected to iterative calculation and updating according to information such as user feedback, website secondary access amount and the like, so that the misjudgment rate of a system is effectively reduced, and the user experience is improved.

On the basis of the above method embodiment, the device for calculating the domain name similarity weight includes: the system comprises a white list website establishing device, a comparing device and a processing device; wherein,

the white list website establishing device is used for establishing a white list website domain name library; the comparison device is used for comparing the domain name of the URL corresponding to the URL request with the domain name in the domain name library of the white list website, and judging whether common spelling errors, vowel character substitution, homophonic and heteromorphic character substitution, wrong top-level domain name substitution, wrong second-level domain name substitution, odd-number complex number conversion, homomorphic characters, deletion or repetition of a certain character, adjacent character exchange positions, keyboard adjacent character substitution or insertion, and insertion or deletion content of separating characters exist or not to obtain a judgment result; and the processing device is used for calculating similarity score values of the domain name of the URL corresponding to the URL request and the domain name in the white list website domain name library according to the judgment result, and acquiring the maximum value in the score values as the domain name similarity weight of the URL corresponding to the URL request.

According to the website identification device provided by the embodiment of the invention, the domain name is comprehensively analyzed from 16 angles, such as common spelling errors, vowel character substitution, homophonic and heteromorphic character substitution, wrong top-level domain name substitution, wrong second-level domain name substitution, odd-number plural conversion, homomorphic characters, deletion or repetition of a certain character, adjacent character exchange positions, keyboard adjacent character substitution or insertion, separation character insertion or deletion and the like, through a domain name similarity analysis and calculation device, and the identification accuracy is high.

The website identification device provided in the embodiment of the present invention is used for implementing the website identification method provided in the embodiment of the present invention, and the specific implementation manner has been specifically stated in the above method embodiment, and is not described herein again.

According to the website identification device provided by the embodiment of the invention, abnormal websites are comprehensively analyzed and identified through multiple dimensions of domain name similarity, webpage content similarity, user report information and website secondary visit quantity, and an abnormal website identification algorithm model for multi-dimensional comprehensive study and judgment is established on the basis to classify the websites so as to realize hierarchical access control.

Fig. 4 is a logic block diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 4, the electronic device includes: a processor (processor)401, a memory (memory)402, and a bus 403;

wherein, the processor 401 and the memory 402 complete the communication with each other through the bus 403; the processor 401 is configured to call program instructions in the memory 402 to perform the methods provided by the above-described method embodiments.

The present embodiments disclose a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the methods provided by the above-described method embodiments.

The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the method embodiments described above.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the embodiments of the present invention, and are not limited thereto; although embodiments of the present invention have been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for identifying a website, the method comprising:

2. The method according to claim 1, wherein the calculating each feature weight value of the URL corresponding to the URL request according to a preset rule specifically includes:

3. The method according to claim 1, wherein the abnormal website specifically comprises:

4. The method of claim 3, further comprising:

5. The method of claim 4, further comprising:

6. The method according to claim 2, wherein the method for calculating the domain name similarity weight comprises:

establishing a white list website domain name library;

7. An apparatus for identifying a website, the apparatus comprising:

8. An electronic device, comprising:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1 to 6.

9. A non-transitory computer-readable storage medium storing a computer program that causes a computer to perform the method according to any one of claims 1 to 6.