Embodiment
The invention will be further described below in conjunction with accompanying drawing and concrete execution mode.
The embodiment of the invention can be applied to computer system/server, and it can be with numerous other universal or special computingasystem environment or configuration operation.The example that is suitable for well-known computing system, environment and/or the configuration used with computer system/server includes but not limited to: personal computer system, server computer system, thin client, thick client computer, hand-held or laptop devices, the system based on microprocessor, set-top box, programmable consumer electronics, NetPC Network PC, Xiao type Ji calculate machine Xi Tong ﹑ large computer system and comprise the distributed cloud computing technology environment of above-mentioned any system, etc.
Computer system/server can be described under the general linguistic context of the computer system executable instruction (such as program module) of being carried out by computer system.Usually, program module can comprise routine, program, target program, assembly, logic, data structure etc., and they are carried out specific task or realize specific abstract data type.Computer system/server can be implemented in distributed cloud computing environment, and in the distributed cloud computing environment, task is by carrying out by the teleprocessing equipment of communication network link.In distributed cloud computing environment, program module can be positioned on the Local or Remote computing system storage medium that comprises memory device.
All can produce a large amount of new files every day on the Internet, and wherein major part is new software and upgrade patch bag, and the software that these are new and upgrade patch bag can be collected the file in the white list database of server end.Include in time that these are new software and upgrade patch bag and to enter in the white list database, at first to check the publication channel of these softwares, usually can determine publication channel by the official website of checking these softwares, then these official websites be monitored.
The white list database of server end can also be collected renewal to the white list of legal procedure, specifically can be realized by following mode.
The first mode: by the technical staff periodically by craft, utilize spider or web crawlers and/or user to upload legal procedure is collected; By manual or automatically screen performance of program and or the program behavior and being kept in the described white list of described legal procedure by instrument.
The second mode: according to the legal procedure feature in the existing known white list and corresponding program behavior thereof, unknown program feature and program behavior are analyzed, to upgrade white list.
The system of the credible website of identification of the embodiment of the invention, can be by obtaining the download log of download file, and download log is analyzed, current site extracted in the download log, from current site, confirm official website, take the website and filter out plug-in in the official website and/or private at last.Analyze by the download log to software, can get access to more accurately Download Info.
Fig. 1 has schematically shown the according to an embodiment of the invention recognition methods flow chart of credible website.As shown in Figure 1, in the present embodiment, the identification process of credible website can comprise the steps:
Step S11 extracts the download log of current site in a setting-up time section;
When certain client device in the Internet when certain download site is downloaded some software, can gather the download behavior of client device, and the download behavior of client device is recited as the download log of software.Can record the Download Info of some softwares in this download log, such as the download path of software, the site information that software is downloaded etc. by these Download Infos, can get access to the concrete condition that software is downloaded.
For example, there is the site information of two softwares to be respectively http://www.badiu.com/xxxx and http://www.baidu.com/yyyy in the download log, can from the site information that these two softwares are downloaded, extracts candidate website logo information and be www.baidu.com.Certainly, can also extract by other means website logo information, the present invention is not limited this.Wherein, current site can be download website website or forum website etc.
Generally comprise following information in the download log: the signature of the software that client device is downloaded, client device are downloaded the path of software, the site information of software download and the software document name of download.Certainly, can also comprise some other information in the described download log, such as download time of software etc., the embodiment of the invention to this not in addition restriction ratio as, can also comprise the cryptographic Hash (hash value) of user id, download file, the parent page of download file, the URL(UniformResource Locator of user's download file current page in the download log, URL(uniform resource locator)) etc.The cryptographic Hash of download file is used for the unique identification download file.Cryptographic Hash also can be called the md5 value, if download file is compressed package files, also will comprise the md5 value of the file in the compressed package in the download log.
Step S12 according to the user ID in the download log of step S11 extraction and download file sign, counts current site was carried out the download link of down operation in described setting-up time section sample size and number of users;
Step S13 adds up the sample size of the current site that obtains and the confidence level that number of users obtains current site according to step S12;
In general, fewer from the kind of official website's download file in a setting-up time section, because the renewal speed of the download file that provides in the official website is slower, and version compares less.If each file that same person is downloaded from a website relatively at random, and a lot of clients have all been downloaded same file from this website in the setting-up time section, can judge that then this document is relatively believable, be official website and the website of this document is provided.
Have as can be known above-mentionedly, supposing has m user to download n kind sample from a certain website in a period of time, if the n value is smaller, m is larger, and the n value is just more credible.Based on this, a kind of mode of obtaining the confidence level of current site can be: the confidence level of current site and sample size (obtaining by step S12) are inversely proportional to, and are directly proportional with number of users (obtaining by step S 12).
In embodiments of the present invention, can calculate confidence level by following formula (1):
W=m/n formula (1)
In the above-mentioned formula (1), W is the confidence level of current site, and m is the number of users that carried out the download link of down operation in the setting-up time section, and n is the sample size that carried out the download link of down operation in the setting-up time section.
Step S14, whether the sample size identification current site of the current site that the confidence level of the current site of obtaining according to step S13 and step S12 statistics obtain is official website.
Suppose to calculate confidence level with above-mentioned formula (1), if the n value less than default sample number threshold value, and W value can judge then that greater than the confidence level threshold value of presetting current site is official website.
Wherein, sample number threshold value and confidence level threshold value can rule of thumb be obtained.Such as, sampling given figure threshold value 〉=6 situation under, the confidence level threshold value 〉=the 85%(accuracy is arranged in 1.5 the download link) all be the official website download link, account for the 75%(recall ratio of whole official website download website).Turn down the sample number threshold value, will reduce accuracy, promote recall ratio; Otherwise, heighten the sample number threshold value, can improve accuracy, reduce recall ratio.Heighten the confidence level threshold value, can promote accuracy, reduce recall ratio.
In other embodiments of the invention, if when judging current site as official website, can also further grasp download link by this official website by step S14.And, can also further the download link that grasps be saved in the white list.Grasping manipulation can be finished by diverse network reptile business and/or website monitoring business.
May comprise also that plug-in website, private take the third party websites such as website in the official website that can identify by step S14.Consider that plug-in website sample, private take the particularity of website sample, need external linked network station, private to take the website and process separately.Therefore, alternatively, after step S14, can also be further from the official website that identifies, get rid of plug-in website, private takes the website, need to determine credible website.If when judging current site as credible website, can also be further by this credible website crawl download link.And, can also further the download link that grasps be saved in the white list.
Plug-in website and the private removal that takes the website can utilize Bayes classifier to finish.In the embodiment of the invention, utilize Bayes's text classifier that the Word message in the webpage is done characteristic statistics, calculate the probability that given webpage belongs to plug-in official website, if this probable value thinks then that greater than the probability threshold value of setting it is plug-in official website.
Except needs are removed plug-in website, can also remove private take the website concrete grammar can be as follows:
At first, obtain the private reference sample that takes the website, utilize Bayes's text classifier that the web page contents that private takes website reference sample reference sample is carried out the text participle, thereby and take categories of websites in private respectively and add up the word frequency of the phrase of getting and obtain two reference vectors:
V-SOFT={word1_count,word2_count,…,wordn_count}
Secondly, obtain a webpage to be sorted, the content of this webpage to be sorted carried out the text participle, obtain vector:
V-UNKNOWN={word1_count,word2_count,…,wordn_count}
Afterwards, calculate respectively the distance to V-SOFT by V-UNKNOWN, compare with respective threshold according to the above-mentioned distance that obtains, above-mentioned distance is during less than corresponding threshold value, illustrate that then webpage to be sorted takes the classification of website the closer to private, whether private takes the website thereby can differentiate, and is classified in this website to be sorted in this way, certainly the manner private that is not limited only to classify takes the website, can also be used for other websites of classification.
At last, take website, plug-in website by rejecting private in the official website.
The recognition methods of the credible website of the embodiment of the invention, can identify the higher official website of confidence level, thereby for the user that the download demand is arranged provides reliable download site, reduced the user and downloaded to the maliciously risk of sample, improved user's network security guarantee.
Fig. 2 has schematically shown according to an embodiment of the invention another flow chart of the recognition methods of credible website.As shown in Figure 2, the recognition methods of credible website can comprise:
Step S21, determine the address of corresponding log store server according to the url of current site; Usually, when the user carries out the resource downloading operation to current site, the a series of data messages that produce, these information are documented on the log store server with the form of daily record, and the description to associative operations such as resources on date, time, user and the download current site is all being put down in writing in the daily record of every delegation.
Step S22, according to the address of described log store server address, extract the download log of current site in a setting-up time;
In order fast and effeciently to assess the credibility of current site, preferably, process from log store server intercepting part download log, when intercepting, can carry out the division of time period take time point as foundation, extracting in the section sometime is download log in the setting-up time section, in order to analyze fast and effectively.The length of this setting-up time section is not done and is particularly limited, and can arrange according to data operation efficient and the credible reliability of judging.
Step S23, from the download log of extracting, obtain user ID and download file sign;
Because in the download log, mostly all comprising the resource that is downloaded on the user ID (id) of downloading the current site resource and the current site is download file sign (id), can identify on current site by user ID, download the user of resource in the setting-up time section, and can identify the file of being downloaded by the user on the current site by the download file sign.
Step S24, according to the user ID in the setting-up time section of extracting and download file sign, count current site was carried out the download link of down operation in described setting-up time section sample size and number of users;
As previously mentioned, owing to just extracted the download log of setting-up time section content in the present embodiment, therefore, when statistical analysis, correspondingly, only in the setting-up time section, the user ID in the download log and download file sign are carried out, can add up by the registered user name of login and download current site resource, also can add up according to the IP address of anonymous access current site and downloaded resources.
Step S25, be inversely proportional to according to confidence level and the sample size of current site, be directly proportional with number of users, obtain the confidence level of current site;
In embodiments of the present invention, can calculate confidence level by following formula (1):
W=m/n formula (1)
In the above-mentioned formula (1), W is the confidence level of current site, and m is the number of users that carried out the download link of down operation in the setting-up time section, and n is the sample size that carried out the download link of down operation in the setting-up time section.
Intelligible, the embodiment of the invention also can adopt other similar nonlinear confidence level computational methods, obtains the confidence level of current site, does not repeat them here.
Step S26, judge whether confidence level is not less than the confidence level threshold value of setting, if it is execution in step S27; Otherwise, execution in step 29;
Whether step S27, judgement sample quantity are not less than the sample threshold of setting, and if so, then execution in step 30; Otherwise, execution in step 29.
Step 29, judge that current site is unofficial website;
Step 30, judge that current site is official website.
After step S30, can remove and obtain credible website after private in the official website takes the third party websites such as website, plug-in website, and after collecting credible website, can be periodically by craft, utilize spider or web crawlers and/or user to upload the file of credible website is collected; Follow-up by manual or automatically screen performance of program and or the program behavior and be kept at the white list database of the relevant program of file by instrument.
Can further according to the legal procedure feature in the existing known white list and corresponding program behavior thereof, unknown program feature and program behavior be analyzed, to upgrade white list.
Fig. 3 schematically shows except upgrading the schematic flow sheet that sample threshold is carried out the confidence level judgement in the recognition methods of the credible website of another embodiment according to the present invention.As shown in Figure 3, in the present embodiment, from above-mentioned embodiment illustrated in fig. 2ly different be, in order to improve the accuracy rate of credible judgement, prevent that situation about misjudging from occurring, process for the confidence level of the setting-up time section of different durations, meanwhile upgrade sample threshold, it can comprise the steps:
Step S31, in the current setting-up time section, be inversely proportional to according to confidence level and the sample size of current site, be directly proportional with number of users, obtain the confidence level of current setting-up time section content current site;
In embodiments of the present invention, can calculate confidence level by following formula (1):
W=m/n formula (1)
In the above-mentioned formula (1), W is the confidence level of current site, and m is the number of users that carried out the download link of down operation in the setting-up time section, and n is the sample size that carried out the download link of down operation in the setting-up time section.
Intelligible, the embodiment of the invention also can adopt other similar nonlinear confidence level computational methods, obtains the confidence level of current site, does not repeat them here.
Step S32, judge whether the confidence level for corresponding in the current setting-up time section is not less than the confidence level threshold value of setting, if it is execution in step S33; Otherwise, execution in step S34;
Step S33, judge the sample threshold that whether is not less than setting for sample size in the current setting-up time section, if so, execution in step S35 then; Otherwise, execution in step S34.
Step S34, judge that current site is unofficial website;
Step S35, in another setting-up time section, be inversely proportional to according to confidence level and the sample size of current site, be directly proportional with number of users, obtain the confidence level of another setting-up time section content current site, and execution in step S36;
Among the step S35, obtain in another setting-up time section confidence level can referring among above-mentioned Fig. 1 for the computational methods of confidence level in the current slot, do not repeat them here.
Step S36, judge whether the confidence level for corresponding in this another setting-up time section is not less than the confidence level threshold value of setting, if it is execution in step S37; Otherwise, execution in step 34;
Step S37, renewal sample threshold;
Step S38, judge for sample size in this another setting-up time section whether be not less than sample threshold after the renewal, if so, then execution in step 39; Otherwise, execution in step S35.
Step S39, judge that current site is official website.
After step S39, can remove and obtain credible website after private in the official website takes the third party websites such as website, plug-in website, and after collecting credible website, can be periodically by craft, utilize spider or web crawlers and/or user to upload the file of credible website is collected; Follow-up by manual or automatically screen performance of program and or the program behavior and be kept at the white list database of the relevant program of file by instrument.
Can further according to the legal procedure feature in the existing known white list and corresponding program behavior thereof, unknown program feature and program behavior be analyzed, to upgrade white list.
Because this programme can improve the believable probability of source web of the file of collecting, so can improve the efficient of the collection of white list (credible website).
Need to prove that the embodiment with reference to shown in Figure 3 can have in the time of a plurality of settings, and add up respectively the confidence level of a plurality of correspondences, according to the confidence level of these a plurality of correspondences, carry out the credibility of current site and judge that procedure detailed does not repeat them here.
In addition, according to the description among the step S14, turn down the sample number threshold value, will reduce accuracy, promote recall ratio; Otherwise, heighten the sample number threshold value, can improve accuracy, reduce recall ratio.Heighten the confidence level threshold value, can promote accuracy, reduce recall ratio.Therefore, only carry out the judgement of website credibility by upgrading sample threshold in the present embodiment.
Can also carry out the judgement of website credibility by upgrading the confidence level threshold value in another embodiment, not repeat them here.
Fig. 4 has schematically shown the according to an embodiment of the invention block diagram of the recognition device of credible website.As shown in Figure 4, in the present embodiment, the recognition device of credible website can comprise extraction module 41, statistical module 42, acquisition module 43 and identification module 44.Extraction module 41 is used for extracting the download log of current site in a setting-up time section.Statistical module 42 is used for counting current site was carried out the download link of down operation in described setting-up time section sample size and number of users according to the user ID of the described download log of extraction module 41 extractions and download file sign.Acquisition module 43 is for the described sample size of the current site that counts according to statistical module 42 and the confidence level that number of users obtains current site.Whether identification module 44 to identify described current site be official website if being used for the confidence level of the current site obtained according to acquisition module 43 and sample size that statistical module 42 counts.Identification module 44 also is used for obtaining credible website behind the described official website cleaning third party website of identification.
Wherein, identification module 44 can also be used at sample size less than default sample number threshold value, and the confidence level of current site judges that current site is official website during greater than default confidence level threshold value.
In embodiments of the present invention, Fig. 5 has schematically shown the according to an embodiment of the invention another block diagram of the recognition device of credible website.The recognition device of credible website can also comprise handling module 45.Handling module 45 links to each other with identification module 44, is used for when 44 of moulds of identification are judged current site as official website, by described official website crawl download link; Described handling module 45 also is used for when described identification module 44 is judged described current site as credible website, by described credible website crawl download link.Further, the recognition device of credible website can also comprise preservation module 46.Preserve module 46 and link to each other with above-mentioned handling module 45, be used for the download link that handling module 45 grasps is saved in the white list database.
Wherein, the confidence level of current site can be inversely proportional to described sample size, is directly proportional with described number of users.
Wherein, current site can be download website website or forum website etc.
The recognition device of the credible website of the embodiment of the invention, by carrying out the recognition methods of above-mentioned credible website, can identify the higher official website of confidence level, thereby for the user that the download demand is arranged provides reliable download site, reduce the user and downloaded to the maliciously risk of sample, improved user's network security guarantee.
Fig. 6 has schematically shown the according to an embodiment of the invention block diagram of the gathering system of credible website.As shown in Figure 5, in the present embodiment, the gathering system of credible website can comprise: server 51 and authentic specimen database 52.
Server 51 comprises that CPU or DSP etc. have the processor cluster 511 of data processing function, to carry out: extract the download log of current site in a setting-up time section, according to the user ID in the described download log of extracting and download file sign, count current site was carried out the download link of down operation in described setting-up time section sample size and number of users, the confidence level of obtaining current site according to described sample size and the number of users of the current site that counts, whether identify described current site according to the confidence level of the current site of obtaining with the sample size that counts is official website;
At server 51, can pass through its CPU or DSP control wired network adapter or wireless network card access current site to extract the download log of current site.
Authentic specimen database 52 is used for collecting the official website of judging through described server 51.
Alternatively, described server comprises:
Extraction module is used for extracting the download log of current site in a setting-up time section;
Statistical module is used for user ID and download file sign according to the described download log of described extraction module extraction, counts current site was carried out the download link of down operation in described setting-up time section sample size and number of users;
Acquisition module is for the described sample size of the current site that goes out according to described statistical module counts and the confidence level that number of users obtains current site;
Identification module, whether being used for the confidence level of the current site obtained according to described acquisition module and sample size that described statistical module counts goes out, to identify described current site be official website.
Alternatively, described identification module also is used at described sample size less than default sample number threshold value, and the confidence level of described current site judges that described current site is official website during greater than default confidence level threshold value.
Alternatively, described server also comprises: handling module, link to each other with described identification module, and be used for when described identification module is judged described current site as official website, by described official website crawl download link.
Alternatively, described identification module also is used for obtaining credible website behind the described official website cleaning third party website of identification.
Alternatively, described handling module also is used for when described identification module is judged described current site as credible website, by described credible website crawl download link.
Alternatively, described server also comprises: preserve module, link to each other with described handling module, be used for the download link of described handling module crawl is saved in the white list database.
In the present embodiment, the technical description of relevant official website's recognition device and each functional module thereof can referring to above-described embodiment, not repeat them here.
The gathering system of the credible website of the embodiment of the invention, can be by obtaining the download log of download file, and download log analyzed, extract current site in the download log, from current site, confirm official website, at last plug-in in the official website and/or the private third party websites such as website that take are filtered out.Analyze by the download log to software, can get access to more accurately Download Info.
Alleged " embodiment ", " embodiment " or " one or more embodiment " mean herein, and special characteristic, structure or the characteristic described in conjunction with the embodiments comprise at least one embodiment of the present invention.In addition, the word example that note that here " in one embodiment " not necessarily refers to same embodiment entirely.
In the specification that provides herein, a large amount of details have been described.Yet, can understand, embodiments of the invention can be put into practice in the situation of these details not having.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
It should be noted above-described embodiment the present invention will be described rather than limit the invention, and those skilled in the art can design alternative embodiment in the situation of the scope that does not break away from claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and is not listed in element or step in the claim.Being positioned at word " " before the element or " one " does not get rid of and has a plurality of such elements.The present invention can realize by means of the hardware that includes some different elements and by means of the computer of suitably programming.In having enumerated the unit claim of some devices, several in these devices can be to come imbody by same hardware branch.The use of word first, second and C grade does not represent any order.Can be title with these word explanations.
In addition, shall also be noted that the language that uses in this specification mainly selects for purpose readable and instruction, rather than select in order to explain or to limit theme of the present invention.Therefore, in the situation of the scope and spirit that do not depart from appended claims, many modifications and changes all are apparent for those skilled in the art.For scope of the present invention, be illustrative to disclosing of doing of the present invention, and nonrestrictive, scope of the present invention is limited by appended claims.