CN112711678A - Data analysis method, device, equipment and storage medium - Google Patents
Data analysis method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN112711678A CN112711678A CN201911026006.5A CN201911026006A CN112711678A CN 112711678 A CN112711678 A CN 112711678A CN 201911026006 A CN201911026006 A CN 201911026006A CN 112711678 A CN112711678 A CN 112711678A
- Authority
- CN
- China
- Prior art keywords
- target
- word
- user agent
- character string
- agent character
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Medical Informatics (AREA)
- Mathematical Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Machine Translation (AREA)
Abstract
The embodiment of the invention discloses a data analysis method, a data analysis device, data analysis equipment and a storage medium. The method comprises the following steps: acquiring a target user agent character string to be analyzed, and preprocessing the target user agent character string to generate a target characteristic vector; inputting the target characteristic vector into a pre-trained data analysis model, and determining a target browser identifier in the target user agent character string according to information output by the data analysis model to serve as an analysis result of the target user agent character string; wherein the data analysis model is a machine learning model. By the technical scheme, the user agent character string can be analyzed more efficiently and more accurately.
Description
Technical Field
The embodiment of the invention relates to the internet technology, in particular to a data analysis method, a data analysis device, data analysis equipment and a storage medium.
Background
A User Agent (UA) is a special string header that is an inherent property of a browser. When the browser interacts with the server, the browser sends a Request header (Request headers) containing a User-Agent field to the server, wherein the User-Agent field carries a User Agent character string of the browser. The server can identify information such as an operating system and version, a CPU type, a browser and version, a browser rendering engine, a browser language, a browser plug-in and the like used by a user by analyzing the Useragent character string. Therefore, the Useragent analysis has an important role in front-end user identification, and for example, the information of purchasing power, attributes, occupation and the like of the user can be further analyzed on the basis of the information obtained by the Useragent analysis. Meanwhile, the UserAgent analysis also has important application in front-end performance monitoring and exception reporting.
The commonly used UserAgent analysis technology at present is as follows: and extracting keywords in the user agent character string corresponding to each browser in advance, and constructing a regular expression by using the keywords so as to establish a regular matching library. When a user agent character string of a browser needs to be analyzed, keywords in the user agent character string are extracted, a regular expression is built, the regular expression is matched with each regular expression in a regular matching library, and then various information in the user agent character string is analyzed.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art: because different Useragent character strings may contain the same key words, the matching of the regular expression fails, and the accuracy of Useragent analysis is reduced. In addition, in order to improve comprehensiveness and accuracy of the user agent analysis, the regular matching library is usually expanded, which inevitably reduces the efficiency of the user agent analysis.
Disclosure of Invention
Embodiments of the present invention provide a data parsing method, apparatus, device, and storage medium, so as to implement more efficient and more accurate parsing of a user agent string.
In a first aspect, an embodiment of the present invention provides a data parsing method, including:
acquiring a target user agent character string to be analyzed, and preprocessing the target user agent character string to generate a target characteristic vector;
inputting the target characteristic vector into a pre-trained data analysis model, and determining a target browser identifier in the target user agent character string according to information output by the data analysis model to serve as an analysis result of the target user agent character string; wherein the data analysis model is a machine learning model.
In a second aspect, an embodiment of the present invention further provides a data parsing apparatus, where the apparatus includes:
the target characteristic vector generation module is used for acquiring a target user agent character string to be analyzed and preprocessing the target user agent character string to generate a target characteristic vector;
the analysis module is used for inputting the target characteristic vector into a pre-trained data analysis model, and determining a target browser identifier in the target user agent character string according to information output by the data analysis model to serve as an analysis result of the target user agent character string; wherein the data analysis model is a machine learning model.
In a third aspect, an embodiment of the present invention further provides an electronic device, where the electronic device includes:
one or more processors;
a storage device for storing one or more programs,
when the one or more programs are executed by the one or more processors, the one or more processors implement the data parsing method provided by any embodiment of the invention.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the data parsing method provided in any embodiment of the present invention.
The embodiment of the invention generates a target characteristic vector by acquiring a target user agent character string to be analyzed and preprocessing the target user agent character string; and inputting the target characteristic vector into a pre-trained data analysis model, and determining a target browser identifier in the target user agent character string according to information output by the data analysis model to serve as an analysis result of the target user agent character string. The user agent character string is analyzed by the machine learning model, and the analysis efficiency and accuracy of the user agent character string are improved.
Drawings
Fig. 1 is a flowchart of a data parsing method according to a first embodiment of the present invention;
FIG. 2 is a flowchart of a data parsing method according to a second embodiment of the present invention;
FIG. 3 is a flowchart of a data parsing method according to a third embodiment of the present invention;
fig. 4 is a schematic structural diagram of a data analysis apparatus according to a fourth embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device in a fifth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Example one
The data analysis method provided by the embodiment can be suitable for automatic analysis of the user agent character string. The method may be performed by a data parsing apparatus, which may be implemented by software and/or hardware, and may be integrated in an electronic device, such as a desktop computer or a server. Referring to fig. 1, the method of the present embodiment specifically includes the following steps:
s110, obtaining a target user agent character string to be analyzed, preprocessing the target user agent character string, and generating a target feature vector.
The target user agent character string is a user agent character string needing to be analyzed. The target feature vector is a feature vector obtained by digitizing a target user agent string.
The analysis of the user agent character string first needs to obtain a target user agent character string to be analyzed, which can be obtained from a network request sent by a browser or can be obtained by reading from a storage medium. The specific acquisition mode is determined according to the actual service requirement.
In the analysis process, the user agent character string has no specific structure, and the input of the machine learning model needs to be a vector with a fixed structure, so that the text features of the target user agent character string need to be converted into fixed digital features by using a model for converting the text characters into the vector with the fixed structure, namely the text features are digitized into the target feature vector. Considering that the meaning of each word of the UserAgnet character string does not need to be known in the UserAgnet analysis, only the category of the character string needs to be identified, so that the target user agent character string can be subjected to word segmentation processing firstly, and then the word segmentation result is digitized by using a term frequency-inverse document frequency algorithm (TF-IDF) to obtain the target feature vector.
And S120, inputting the target characteristic vector into a pre-trained data analysis model, and determining a target browser identifier in the target user agent character string according to information output by the data analysis model to serve as an analysis result of the target user agent character string.
The data analysis model is a pre-trained machine learning model, and is used for analyzing the input feature vector so as to judge the category corresponding to the feature vector, such as information of a browser identifier, a browser category, a browser version or a central processing unit of user equipment.
The target feature vector obtained in S110 is input into a data analysis model, and the output information of the model, such as the probability that the target feature vector may belong to each browser identifier, may be obtained through the processing of the data analysis model. And then, according to the output information, determining the target browser identification in the target user agent character string. For example, the browser identifier with the highest probability in the output information is determined as the target browser identifier. The target browser identification is the result of parsing the target user agent string. In addition, the composition and composition structure of the target user agent character string can be obtained according to the target browser identifier, and then the whole target user agent character string can be completely analyzed according to the composition structure, so that various information contained in the target user agent character string can be obtained.
Illustratively, the data parsing model is pre-trained by: acquiring a user agent character string set, and preprocessing each user agent character string in the user agent character string set to obtain each feature vector; inputting at least one training feature vector in each feature vector into a data analysis model to obtain a browser prediction result corresponding to each training feature vector, and determining model deviation by using a preset loss function according to each browser prediction result and each browser verification result corresponding to the corresponding training feature vector; if the model deviation does not meet the preset convergence condition, performing error back transmission of the model deviation by using a logistic regression algorithm to iteratively train the data analysis model; inputting at least one test feature vector in each feature vector into a data analysis model to obtain a browser test result corresponding to each test feature vector, and determining the model accuracy of the data analysis model according to each browser test result and each browser verification result corresponding to the corresponding test feature vector; and if the model accuracy is less than the preset accuracy, returning to execute the step of acquiring the user agent character string set.
In the training process based on the machine learning model, a training sample set needs to be acquired first. In this embodiment, the user character strings are collected through the access log of the server. In order to improve the data collection efficiency and the richness of sample data, in this embodiment, a service with a wider service coverage (for example, a larger user access amount) is selected for data collection. For example, an access log corresponding to an e-commerce platform with a large user access amount, such as the kyoto and the naught, can be selected as a data source for sample collection. Meanwhile, in order to ensure an appropriate sample data amount, the present embodiment may randomly select a set number (e.g., one or three days, etc.) of access logs, such as not less than 100 ten thousand pieces of data, from all the access logs. Then, character string extraction can be performed from the collected access log, and a training sample set composed of a plurality of user agent character strings, namely a user agent character string set, is obtained. And then, carrying out word segmentation, text vectorization and other preprocessing on each user agent character string in the user agent character string set to obtain a feature vector corresponding to each user agent character string.
Then, each feature vector corresponding to the user agent string set is split into 2 subsets, for example, a feature matrix (matrix elements without numerical values are occupied by 1) formed by each feature vector can be split by a train _ test _ split method, that is, feature vectors with a preset proportion are randomly selected from the feature matrix to serve as training feature vectors for training a data analysis model, and the remaining feature vectors serve as test feature vectors for testing the trained data analysis model.
And then, inputting each training feature vector into a data analysis model, namely, taking each training feature vector as an input parameter of a prediction function h (), and outputting a browser identifier and a prediction probability corresponding to each feature vector as a browser prediction result through function classification processing. Then, all the browser prediction results and the browser identification verification results (collected when the user agent character string set is obtained) corresponding to the corresponding feature vectors are input into a preset loss function J (theta), and model prediction deviation (model deviation for short) corresponding to each feature vector is calculated. If the model deviation does not satisfy the preset deviation (i.e. the preset convergence condition), the error back-propagation of the model deviation needs to be performed by using a logistic regression algorithm to update the model weight in the data analysis model, so as to iteratively train the data analysis model, i.e. the logistic regression method is used to find the minimum value of the preset loss function J (θ). And if the model deviation meets the preset convergence condition, ending the iterative training of the model, and obtaining the data analysis model which is a trained model.
After the model training is completed, the accuracy of the model needs to be tested by using each test feature vector. Namely, each test feature vector is input into a data analysis model to obtain the test result of each browser. And determining the ratio of the number of test eigenvectors with correct model analysis to the total number of the test eigenvectors according to the browser test result and the browser verification result as the model accuracy. Score method this step can be implemented by calling lgs. If the model accuracy does not meet the preset accuracy (e.g., 100%), a new user agent string needs to be collected to further refine the data parsing model, i.e., repeat the above-described complete model training process. And if the model accuracy reaches the preset accuracy, ending the whole model training process. The advantage of this arrangement is that the logistic regression algorithm is used to improve the model training efficiency.
Illustratively, the gradient ascent algorithm in the logistic regression algorithm is a random gradient ascent algorithm, so that in each model iterative training process, one training feature vector is randomly determined from each training feature vector for the current iterative training of the data analysis model.
The logistic regression algorithm is one of discrete selection method models, in order to save operation time, a gradient rise algorithm for realizing logistic regression is changed into a random gradient rise algorithm, the gradient rise algorithm updates the model weight by carrying out iterative operation on all training feature vectors every time, and the random gradient rise algorithm only carries out iterative operation on one training feature vector every time so as to update the weight randomly, so that the time spent on calculating the model weight can be reduced, the calculation speed of the judgment condition weight in the prediction method h () is improved, and the model training efficiency is further improved.
According to the technical scheme of the embodiment, a target feature vector is generated by acquiring a target user agent character string to be analyzed and preprocessing the target user agent character string; and inputting the target characteristic vector into a pre-trained data analysis model, and determining a target browser identifier in the target user agent character string according to information output by the data analysis model to serve as an analysis result of the target user agent character string. The user agent character string is analyzed by the machine learning model, and the analysis efficiency and accuracy of the user agent character string are improved.
Example two
In this embodiment, based on the first embodiment, further optimization is performed on "preprocessing the target user agent character string to generate the target feature vector". Wherein explanations of the same or corresponding terms as those of the above embodiments are omitted. Referring to fig. 2, the data parsing method provided in this embodiment includes:
s210, obtaining a target user agent character string to be analyzed.
And S220, performing word segmentation on the target user agent character string to obtain each target characteristic word.
Before the text digitization model is used, the target user agent character string needs to be subjected to word segmentation processing to obtain each word segmentation result (namely the target characteristic word). And then, each target feature word can be respectively input into a text digital model to realize the conversion of the target user agent character string into a target feature vector.
And S230, generating a target characteristic vector of the target user agent character string by utilizing an improved word frequency inverse file frequency algorithm according to each target characteristic word and a pre-constructed user agent character string set.
The word frequency inverse file frequency algorithm is constructed based on a word frequency (TF) value and an inverse document word frequency (IDF) value of a word. The word frequency TF is the frequency of a given word appearing in the text, and the TF value is calculated by normalizing the word frequency to prevent long text. The inverse document frequency IDF is a measure of the general importance of the term. The specific calculation formula is as follows:
TF-IDF=TF*IDF (3)
the improved word frequency inverse file frequency algorithm is an improvement on a TD-IDF algorithm, and aims to solve the problem that the TD-IDF algorithm can determine words with a few occurrences as feature words. Therefore, the TD-IDF algorithm is improved in the embodiment of the invention so as to highlight the target characteristic words which can represent the target user agent character string better in all the target characteristic words. The specific improvement mode is as follows: on the basis of the TD-IDF algorithm, a word weight determining algorithm, namely a word weight determining sub-algorithm, capable of representing the recognition degree of the target feature words to the target user agent character strings is added. The word weight may be understood as the probability that a string is a UA string that is recognizable based on the characteristic words in the string.
According to the formula of the word frequency inverse file frequency algorithm, counting the number of target user agent character strings and related character words in a user agent character string set aiming at each target character word, and then inputting the number of the character words and the number of the user agent character strings into the formula of the improved word frequency inverse file frequency algorithm to calculate the characteristic weight value [ (i, j) light ] of one target character word. Wherein, i represents the row number of the target user agent character string in the feature matrix, if the character string is a feature vector, i is 0; j represents the column number of the target feature word in the feature vector or the feature matrix, and weight is the feature weight value. The feature weight values corresponding to all the obtained target feature words can form a target feature vector of the target user agent character string. For example, the target feature vector may be [ (0,0)0.072323, (0,1)0.062323, (0,2)0.052323 ].
Exemplarily, S230 includes:
s231, determining a sub-algorithm by using the word weight in the improved word frequency inverse file frequency algorithm according to each target feature word, the browser category corresponding to each target feature word and the user agent character string set, and determining the word weight of each target feature word.
The formula of the word weight determination sub-algorithm in this embodiment is as follows:
wherein, χ2(t, c) is the word weight of the characteristic word t under the browser category c; n is the total number of the user agent character strings in the user agent character string set; a is the total number of the user agent character strings which contain the characteristic words t and belong to the browser category c, and the browser category can be a computer-side browser, a mobile phone-side browser and the like; b is the total number of the user agent character strings which contain the characteristic word t and do not belong to the browser category c; c is the total number of the user agent character strings which do not contain the characteristic word t and belong to the browser category C; d is not containing a featureWord t, and the total number of user agent strings that do not belong to browser category c.
In the process of acquiring the user agent character string set, the browser category corresponding to each user agent character string can be determined, and then the browser category to which each feature word after word segmentation belongs can be further determined. Further, the browser category corresponding to each target feature word can be determined. Based on the target feature words and the prior knowledge of the browser categories corresponding to the target feature words, the parameter values required in the formula (4) can be counted from the user agent character string set, the word weights of the target feature words in different browser categories can be obtained through calculation of the formula (4), and then the word weights of the corresponding target feature words are obtained through calculation of the word weights corresponding to the browser categories.
According to the formula (4), the word weight represents the importance degree of a target feature word in all collected usergenets, and represents the strength of the capability of the target feature word for distinguishing and identifying the usergenets. The larger the word weight of a target feature word is, the more a user agent character string can be identified by the target feature word, and the more important the target feature word is in the user agent character string.
After determining the word weight of each target feature word, the method further includes: and screening the target characteristic words according to the preset quantity according to the word weight of each target characteristic word.
In order to further reduce the amount of calculation in the data analysis process and further improve the analysis efficiency of the user agent, in this embodiment, the target feature words may be sorted in a descending order according to the word weight of each target feature word, and a preset number (for example, 5) of target feature words sorted in the top order may be intercepted, so as to be used for calculating the feature weight values (that is, the target word frequency inverse file frequency values) of subsequent S232 and S233.
S232, determining an initial word frequency inverse file frequency value of each target characteristic word by utilizing a word frequency inverse file frequency algorithm in the improved word frequency inverse file frequency algorithm according to each target characteristic word and the user agent character string set.
The formula of the word frequency inverse file frequency algorithm after normalization in this embodiment is as follows:
wherein, TF-IDF is the frequency value of the original word frequency inverse file; tf isi(tj) For the jth target characteristic word t in the ith user agent character stringjThe term frequency TF value of (a) is calculated according to formula (1), and if the final result is a feature vector, i is 0; n isjIncluding target characteristic word t in character string set for user agentjThe number of user agent strings.
And (3) counting the number of the characteristic words and the number of the user agent character strings required by each target characteristic word according to a formula (1) and a formula (5), and calculating the initial word frequency inverse file frequency value of each target characteristic word.
S233, determining the target word frequency inverse file frequency value of each target feature word by using an improved word frequency inverse file frequency algorithm according to the initial word frequency inverse file frequency value and the word weight of each target feature word, and constructing the target feature vector of the target user agent character string by using each target word frequency inverse file frequency value.
The formula of the improved word frequency inverse file frequency algorithm in the embodiment is as follows:
wherein TD-IDF is the frequency value of the target word frequency inverse file,for the jth target characteristic word t in the ith user agent character stringjThe word weight of.
And (4) calculating the target word frequency inverse file frequency value of the target feature word by using the word weight of the target feature word obtained in the step 231 and the initial word frequency inverse file frequency value of the target feature word obtained in the step 232 based on the formula (6). And constructing a target feature vector of the target user agent character string by the target word frequency inverse file frequency value of each target feature word.
According to the technical scheme of the embodiment, the target feature vector of the target user agent character string is generated by utilizing an improved word frequency inverse file frequency algorithm according to each target feature word and a pre-constructed user agent character string set. The method realizes the word weight capable of representing the recognition degree of the target characteristic words to the target user agent character string in the process of digitalizing the user agent character string into the characteristic vector based on the improved word frequency inverse file frequency algorithm, thereby highlighting the target characteristic words capable of representing the target user agent character string in all the target characteristic words, enabling the obtained target characteristic vector to more comprehensively and more accurately represent the information of the target user agent character string, and further improving the resolution accuracy of the user agent character string.
EXAMPLE III
In this embodiment, based on the second embodiment, further optimization is performed on "segmenting the target user agent character string to obtain each target feature word". Wherein explanations of the same or corresponding terms as those of the above embodiments are omitted. Referring to fig. 3, the data parsing method provided in this embodiment includes:
s310, obtaining a target user agent character string to be analyzed.
S320, determining the length of the target word segmentation according to the pre-established dictionary tree and the target user agent character string.
The dictionary tree stores browser identifiers, that is, the dictionary tree is constructed by using the browser identifiers in advance for fast matching of character strings.
In the related art, the word segmentation of the character string is realized based on a fixed word segmentation length set by experience. However, the word segmentation result obtained in this way is likely to destroy the integrity of the browser identifier, thereby reducing the accuracy of the user agent character string recognition. Therefore, the method of dynamically determining the length of the target word segmentation according to the browser identifier is adopted in the embodiment. In specific implementation, the target user agent character string can be quickly matched with the dictionary tree storing all browser identifications, and the length of the character string which is matched with the target user agent character string is determined as the target word segmentation length.
Exemplarily, S320 includes:
s321, taking the preset word segmentation length as the current word segmentation length, and performing word segmentation on the target user agent character string according to the current word segmentation length to obtain word segmentation results.
The preset segmentation length is a fixed segmentation length set empirically, and may be 1 character length, for example. The current participle length is the participle length in the current loop operation, which is continuously updated with the loop operation.
The idea of dynamically determining the word segmentation length in this embodiment is as follows: and splitting the character string of the target user agent according to different word segmentation lengths, and matching word segmentation results obtained by word segmentation each time with the dictionary tree one by one. And determining the most suitable word segmentation length as the target word segmentation length according to the comprehensive matching degree (matching degree for short) of each word segmentation result of each word segmentation and the dictionary tree. In specific implementation, the preset word segmentation length is used as the current word segmentation length and is used as the start of the circulation flow. And in the circulation operation, performing word segmentation on the target user agent character string according to the current word segmentation length to obtain each word segmentation result.
And S322, matching each word segmentation result with each browser identifier in the dictionary tree, determining the current matching degree corresponding to the current word segmentation length, and recording the current word segmentation length and the current matching degree.
And performing character string matching on each word segmentation result and the dictionary tree to obtain the matching degree of each word segmentation result. And determining the comprehensive matching degree corresponding to the current word segmentation length according to each matching degree to serve as the current matching degree. The current degree of matching may be determined by calculating a mean, a median, or a maximum value of the respective degrees of matching. Considering that the higher the matching degree is, the more complete the browser identifier is, in this embodiment, the maximum value of the matching degrees is taken as the current matching degree. For subsequent convenience, the current word segmentation length and the corresponding current matching degree thereof can be recorded.
S323, if the current word segmentation length is smaller than the character length of the target user agent character string, increasing the current word segmentation length according to a preset step length, and returning to execute the step of performing word segmentation on the target user agent character string according to the current word segmentation length to obtain word segmentation results.
In order to determine the segmentation length more accurately, in this embodiment, the segmentation results corresponding to all the segmentation lengths may be matched with the dictionary tree, so as to obtain the current matching degree corresponding to each segmentation length, and further determine the most appropriate segmentation length according to each current matching degree. Thus, it is necessary to determine whether the current participle length is smaller than the character length of the target user agent character string after the operation of S322. If not, the loop flow of calculating each current matching degree is ended, and the process proceeds to S323. If so, the current word segmentation length needs to be increased according to a preset step length (such as the length of 1 character length) to update the current word segmentation length, and then the step of executing the step of performing word segmentation on the target user agent character string according to the current word segmentation length to obtain word segmentation results in the step S321 is returned to, so as to circularly determine the current word segmentation length corresponding to each word segmentation length and record the current word segmentation length.
And S324, determining the current word segmentation length corresponding to the current matching degree with the largest value in the current matching degrees as the target word segmentation length.
And comparing the current matching degrees determined in the step S323, determining the maximum current matching degree, and determining the current word segmentation length corresponding to the current matching degree as the target word segmentation length.
Exemplarily, after matching each word segmentation result with each browser identifier in the dictionary tree and determining a current matching degree corresponding to the current word segmentation length, the method further includes:
and S325, if the current matching degree is equal to or greater than the preset matching degree, taking the current word segmentation length as the target word segmentation length of the target user agent character string.
In order to improve the determination efficiency of the target word segmentation length, a matching degree (i.e., a preset matching degree) meeting the requirement of service precision may be preset in this embodiment, and as long as the current matching degree meets the preset matching degree, the current word segmentation length may be determined to be the target word segmentation length without obtaining the current matching degree of each current word segmentation length, which may reduce the operation of word segmentation and string matching to a certain extent, thereby improving the determination efficiency of the target word segmentation length.
In specific implementation, after the current matching degree is determined in S322, the current matching degree is compared with the preset matching degree. And if the current matching degree is greater than or equal to the preset matching degree, taking the current word segmentation length as the target word segmentation length.
S326, if the current matching degree is smaller than the preset matching degree, increasing the current word segmentation length according to the preset step length, and returning to execute the step of performing word segmentation on the target user agent character string according to the current word segmentation length to obtain each word segmentation result.
If the current matching degree is smaller than the preset matching degree, the loop operation is needed to be continued, the current word segmentation length is updated, and the step of performing the word segmentation on the target user agent character string according to the current word segmentation length to obtain each word segmentation result in S321 is returned, so as to calculate the new current matching degree and compare the new current matching degree with the preset matching degree.
It should be noted that S233 to S234 and S235 to S236 are parallel steps, and it is specifically determined by which way to determine the target word segmentation length according to the service requirement. For example, if the business requirement focuses on the word segmentation accuracy, the schemes of S233-S234 can be adopted, and if the business requirement focuses on balancing the word segmentation accuracy and the word segmentation efficiency, the schemes of S235-S236 can be adopted.
S330, segmenting the target user agent character string according to the target segmentation length to obtain each target characteristic word.
S340, generating target characteristic vectors of the target user agent character strings by utilizing an improved word frequency inverse file frequency algorithm according to each target characteristic word and a pre-constructed user agent character string set.
According to the technical scheme of the embodiment, the target word segmentation length is determined according to the pre-established dictionary tree and the target user agent character string, so that the word segmentation length is dynamically determined according to the characteristics of the target user agent character string, complete browser identification is contained in each target characteristic word, the browser identification is prevented from being divided into different target characteristic words, and the analysis efficiency and the analysis accuracy of the user agent character string are further improved.
Example four
The present embodiment provides a data analysis apparatus, and referring to fig. 4, the apparatus specifically includes:
a target feature vector generation module 410, configured to obtain a target user agent string to be analyzed, and pre-process the target user agent string to generate a target feature vector;
the analysis module 420 is configured to input the target feature vector into a pre-trained data analysis model, determine a target browser identifier in the target user agent string according to information output by the data analysis model, and use the target browser identifier as an analysis result of the target user agent string; wherein, the data analysis model is a machine learning model.
Optionally, the target feature vector generating module 410 includes:
the word segmentation sub-module is used for segmenting the target user agent character string to obtain each target characteristic word;
the target characteristic vector generation submodule is used for generating a target characteristic vector of a target user agent character string by utilizing an improved word frequency inverse file frequency algorithm according to each target characteristic word and a user agent character string set constructed in advance; the improved word frequency inverse file frequency algorithm comprises a word weight determining sub-algorithm which is used for determining the recognition degree of each target characteristic word to the target user agent character string.
Optionally, the word segmentation sub-module is specifically configured to:
determining a target word segmentation length according to a pre-established dictionary tree and a target user agent character string, wherein each browser identifier is stored in the dictionary tree;
and segmenting the target user agent character string according to the target segmentation length to obtain each target characteristic word.
Further, the word segmentation sub-module is specifically configured to:
taking the preset word segmentation length as the current word segmentation length, and performing word segmentation on the target user agent character string according to the current word segmentation length to obtain word segmentation results;
matching each word segmentation result with each browser identification in a dictionary tree, determining the current matching degree corresponding to the current word segmentation length, and recording the current word segmentation length and the current matching degree;
if the current word segmentation length is smaller than the character length of the target user agent character string, increasing the current word segmentation length according to a preset step length, and returning to execute the step of performing word segmentation on the target user agent character string according to the current word segmentation length to obtain each word segmentation result;
and determining the current word segmentation length corresponding to the current matching degree with the maximum value in the current matching degrees as the target word segmentation length.
Further, the word segmentation sub-module is further specifically configured to:
after matching each word segmentation result with each browser identification in a dictionary tree and determining the current matching degree corresponding to the current word segmentation length, if the current matching degree is equal to or greater than the preset matching degree, taking the current word segmentation length as the target word segmentation length of the target user agent character string;
and if the current matching degree is smaller than the preset matching degree, increasing the current word segmentation length according to the preset step length, and returning to execute the step of segmenting the target user agent character string according to the current word segmentation length to obtain each word segmentation result.
Optionally, the target feature vector generation submodule is specifically configured to:
determining a sub-algorithm by using a word weight determination sub-algorithm in an improved word frequency inverse file frequency algorithm according to each target feature word, the browser category corresponding to each target feature word and a user agent character string set;
determining an initial word frequency inverse file frequency value of each target characteristic word by utilizing a word frequency inverse file frequency algorithm in an improved word frequency inverse file frequency algorithm according to each target characteristic word and a user agent character string set;
and determining the target word frequency inverse file frequency value of each target characteristic word by utilizing an improved word frequency inverse file frequency algorithm according to the initial word frequency inverse file frequency value and the word weight of each target characteristic word, and constructing a target characteristic vector of the target user agent character string by utilizing each target word frequency inverse file frequency value.
Further, the target feature vector generation submodule is further configured to:
after the word weight of each target feature word is determined, screening each target feature word according to the preset number according to the word weight of each target feature word.
Optionally, on the basis of the apparatus, the apparatus further includes a model training module, configured to pre-train the data analysis model by:
acquiring a user agent character string set, and preprocessing each user agent character string in the user agent character string set to obtain each feature vector;
inputting at least one training feature vector in each feature vector into a data analysis model to obtain a browser prediction result corresponding to each training feature vector, and determining model deviation by using a preset loss function according to each browser prediction result and each browser verification result corresponding to the corresponding training feature vector;
if the model deviation does not meet the preset convergence condition, performing error back transmission of the model deviation by using a logistic regression algorithm to iteratively train the data analysis model;
inputting at least one test feature vector in each feature vector into a data analysis model to obtain a browser test result corresponding to each test feature vector, and determining the model accuracy of the data analysis model according to each browser test result and each browser verification result corresponding to the corresponding test feature vector;
and if the model accuracy is less than the preset accuracy, returning to execute the step of acquiring the user agent character string set.
The gradient rise algorithm in the logistic regression algorithm is a random gradient rise algorithm, and in each model iterative training process, a training feature vector is randomly determined from each training feature vector and is used for current iterative training of the data analysis model.
Through the data analysis device of the fourth embodiment of the invention, the user agent character string can be analyzed more efficiently and more accurately.
The data analysis device provided by the embodiment of the invention can execute the data analysis method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
It should be noted that, in the embodiment of the data analysis apparatus, each included unit and each included module are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
EXAMPLE five
Referring to fig. 5, the present embodiment provides an electronic device, which includes: one or more processors 520; the storage 510 is configured to store one or more programs, and when the one or more programs are executed by the one or more processors 520, the one or more processors 520 implement the data parsing method provided in the embodiment of the present invention, including:
acquiring a target user agent character string to be analyzed, and preprocessing the target user agent character string to generate a target characteristic vector;
inputting the target characteristic vector into a pre-trained data analysis model, and determining a target browser identifier in a target user agent character string according to information output by the data analysis model to serve as an analysis result of the target user agent character string; wherein, the data analysis model is a machine learning model.
Of course, those skilled in the art can understand that the processor 520 may also implement the technical solution of the data parsing method provided in any embodiment of the present invention.
The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention. As shown in fig. 5, the electronic device includes a processor 520, a storage 510, an input 530, and an output 540; the number of the processors 520 in the electronic device may be one or more, and one processor 520 is taken as an example in fig. 5; the processor 520, the storage 510, the input device 530, and the output device 540 in the electronic apparatus may be connected by a bus or other means, and are exemplified by a bus 550 in fig. 5.
The storage device 510 is a computer-readable storage medium, and can be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the data parsing method in the embodiment of the present invention (for example, a target feature vector generation module and a parsing module in the data parsing device).
The storage device 510 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the storage 510 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, storage 510 may further include memory located remotely from processor 520, which may be connected to electronic devices over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 530 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus. The output device 540 may include a display device such as a display screen.
EXAMPLE six
The present embodiments provide a storage medium containing computer-executable instructions which, when executed by a computer processor, are operable to perform a method of data parsing, the method comprising:
acquiring a target user agent character string to be analyzed, and preprocessing the target user agent character string to generate a target characteristic vector;
inputting the target characteristic vector into a pre-trained data analysis model, and determining a target browser identifier in a target user agent character string according to information output by the data analysis model to serve as an analysis result of the target user agent character string; wherein, the data analysis model is a machine learning model.
Of course, the storage medium provided by the embodiment of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the above method operations, and may also perform related operations in the data parsing method provided by any embodiment of the present invention.
From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions to enable an electronic device (which may be a personal computer, a server, or a network device) to execute the data parsing method provided by the embodiments of the present invention.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.
Claims (12)
1. A data parsing method, comprising:
acquiring a target user agent character string to be analyzed, and preprocessing the target user agent character string to generate a target characteristic vector;
inputting the target characteristic vector into a pre-trained data analysis model, and determining a target browser identifier in the target user agent character string according to information output by the data analysis model to serve as an analysis result of the target user agent character string; wherein the data analysis model is a machine learning model.
2. The method of claim 1, wherein preprocessing the target user agent string to generate a target feature vector comprises:
performing word segmentation on the target user agent character string to obtain each target characteristic word;
generating the target characteristic vector of the target user agent character string by utilizing an improved word frequency inverse file frequency algorithm according to each target characteristic word and a pre-constructed user agent character string set; the improved word frequency inverse file frequency algorithm comprises a word weight determining sub-algorithm, and the word weight determining sub-algorithm is used for determining the recognition degree of each target feature word to the target user agent character string.
3. The method of claim 2, wherein segmenting the target user agent string to obtain target feature words comprises:
determining a target word segmentation length according to a pre-established dictionary tree and the target user agent character string, wherein each browser identifier is stored in the dictionary tree;
and segmenting the target user agent character string according to the target segmentation length to obtain each target characteristic word.
4. The method of claim 3, wherein determining a target participle length based on a pre-established trie and the target user agent string comprises:
taking a preset word segmentation length as a current word segmentation length, and segmenting the target user agent character string according to the current word segmentation length to obtain word segmentation results;
matching each word segmentation result with each browser identification in the dictionary tree, determining the current matching degree corresponding to the current word segmentation length, and recording the current word segmentation length and the current matching degree;
if the current word segmentation length is smaller than the character length of the target user agent character string, increasing the current word segmentation length according to a preset step length, and returning to execute the step of performing word segmentation on the target user agent character string according to the current word segmentation length to obtain word segmentation results;
and determining the current word segmentation length corresponding to the current matching degree with the largest value in the current matching degrees as the target word segmentation length.
5. The method according to claim 4, wherein after matching each of the segmentation results with each browser identifier in the dictionary tree and determining a current matching degree corresponding to a current segmentation length, the method further comprises:
if the current matching degree is equal to or greater than the preset matching degree, taking the current word segmentation length as the target word segmentation length of the target user agent character string;
and if the current matching degree is smaller than the preset matching degree, increasing the current word segmentation length according to a preset step length, and returning to execute the step of performing word segmentation on the target user agent character string according to the current word segmentation length to obtain each word segmentation result.
6. The method of claim 2, wherein generating the target feature vector using a modified word-frequency inverse file frequency algorithm based on each of the target feature words and a pre-constructed set of user agent strings comprises:
determining a sub-algorithm by using the word weight in the improved word frequency inverse file frequency algorithm according to each target feature word, the browser category corresponding to each target feature word and the user agent character string set, and determining the word weight of each target feature word;
determining an initial word frequency inverse file frequency value of each target characteristic word by using a word frequency inverse file frequency algorithm in the improved word frequency inverse file frequency algorithm according to each target characteristic word and the user agent character string set;
and determining a target word frequency inverse file frequency value of each target feature word by using an improved word frequency inverse file frequency algorithm according to the initial word frequency inverse file frequency value and the word weight of each target feature word, and constructing the target feature vector of the target user agent character string by using each target word frequency inverse file frequency value.
7. The method of claim 6, after determining the word weight for each of the target feature words, further comprising:
and screening each target characteristic word according to a preset quantity according to the word weight of each target characteristic word.
8. The method of claim 1, wherein the data parsing model is pre-trained by:
acquiring a user agent character string set, and preprocessing each user agent character string in the user agent character string set to obtain each feature vector;
inputting at least one training feature vector in each feature vector into the data analysis model to obtain a browser prediction result corresponding to each training feature vector, and determining model deviation by using a preset loss function according to each browser prediction result and each browser verification result corresponding to the corresponding training feature vector;
if the model deviation does not meet the preset convergence condition, performing error back transmission of the model deviation by using a logistic regression algorithm to iteratively train the data analysis model;
inputting at least one test feature vector in each feature vector into the data analysis model to obtain a browser test result corresponding to each test feature vector, and determining the model accuracy of the data analysis model according to each browser test result and each browser verification result corresponding to the corresponding test feature vector;
and if the model accuracy is less than the preset accuracy, returning to execute the step of acquiring the user agent character string set.
9. The method of claim 8, wherein the gradient-boosting algorithm in the logistic regression algorithm is a random gradient-boosting algorithm, such that during each iterative training process of the model, a training feature vector is randomly determined from the training feature vectors for the current iterative training of the data analysis model.
10. A data analysis device, comprising:
the target characteristic vector generation module is used for acquiring a target user agent character string to be analyzed and preprocessing the target user agent character string to generate a target characteristic vector;
the analysis module is used for inputting the target characteristic vector into a pre-trained data analysis model, and determining a target browser identifier in the target user agent character string according to information output by the data analysis model to serve as an analysis result of the target user agent character string; wherein the data analysis model is a machine learning model.
11. An electronic device, characterized in that the electronic device comprises:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement a data parsing method as recited in any of claims 1-9.
12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the data parsing method according to any one of claims 1-9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911026006.5A CN112711678A (en) | 2019-10-25 | 2019-10-25 | Data analysis method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911026006.5A CN112711678A (en) | 2019-10-25 | 2019-10-25 | Data analysis method, device, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112711678A true CN112711678A (en) | 2021-04-27 |
Family
ID=75540957
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911026006.5A Pending CN112711678A (en) | 2019-10-25 | 2019-10-25 | Data analysis method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112711678A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114238965A (en) * | 2021-11-17 | 2022-03-25 | 北京华清信安科技有限公司 | Detection analysis method and system for malicious access |
CN114598597A (en) * | 2022-02-24 | 2022-06-07 | 烽台科技(北京)有限公司 | Multi-source log analysis method and device, computer equipment and medium |
-
2019
- 2019-10-25 CN CN201911026006.5A patent/CN112711678A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114238965A (en) * | 2021-11-17 | 2022-03-25 | 北京华清信安科技有限公司 | Detection analysis method and system for malicious access |
CN114598597A (en) * | 2022-02-24 | 2022-06-07 | 烽台科技(北京)有限公司 | Multi-source log analysis method and device, computer equipment and medium |
CN114598597B (en) * | 2022-02-24 | 2023-12-01 | 烽台科技(北京)有限公司 | Multisource log analysis method, multisource log analysis device, computer equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110580308B (en) | Information auditing method and device, electronic equipment and storage medium | |
CN112035599A (en) | Query method and device based on vertical search, computer equipment and storage medium | |
CN112257419A (en) | Intelligent retrieval method and device for calculating patent document similarity based on word frequency and semantics, electronic equipment and storage medium thereof | |
CN111985228A (en) | Text keyword extraction method and device, computer equipment and storage medium | |
CN110597844A (en) | Heterogeneous database data unified access method and related equipment | |
CN115830649A (en) | Network asset fingerprint feature identification method and device and electronic equipment | |
CN114818643A (en) | Log template extraction method for reserving specific service information | |
CN111460114A (en) | Retrieval method, device, equipment and computer readable storage medium | |
CN111723192B (en) | Code recommendation method and device | |
CN117592470A (en) | Low-cost gazette data extraction method driven by large language model | |
CN112711678A (en) | Data analysis method, device, equipment and storage medium | |
CN113656575B (en) | Training data generation method and device, electronic equipment and readable medium | |
CN111104422B (en) | Training method, device, equipment and storage medium of data recommendation model | |
CN117235137B (en) | Professional information query method and device based on vector database | |
CN113590811A (en) | Text abstract generation method and device, electronic equipment and storage medium | |
CN112613176A (en) | Slow SQL statement prediction method and system | |
CN117435189A (en) | Test case analysis method, device, equipment and medium of financial system interface | |
CN114528908B (en) | Network request data classification model training method, classification method and storage medium | |
CN116366312A (en) | Web attack detection method, device and storage medium | |
CN116955534A (en) | Intelligent complaint work order processing method, intelligent complaint work order processing device, intelligent complaint work order processing equipment and storage medium | |
CN116578700A (en) | Log classification method, log classification device, equipment and medium | |
CN113420127A (en) | Threat information processing method, device, computing equipment and storage medium | |
CN114338058A (en) | Information processing method, device and storage medium | |
Li et al. | rLLM: Relational table learning with LLMs | |
CN113987490B (en) | Malicious process detection method, device and system and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |