Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the specification, as detailed in the appended claims.
The terminology used in the description herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the description. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present specification. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
As described above, in the related art, a preset matching algorithm may be used to improve the efficiency of string matching, for example, an Aho-corestick algorithm (which may be generally referred to as AC automata in the art), a Boyer-Moore algorithm, a Horspool algorithm, or the like may be used.
In order to compare the efficiency of matching the character strings, the efficiency of matching the character strings can be generally determined according to an index of time complexity. Taking an AC automaton as an example, when pattern matching is performed, the time complexity is O (n + m + z), n represents the length of a main string, m represents the length of all substrings, and z represents the number of matches.
Taking an AC automaton as an example, when performing combination matching, the following steps need to be performed:
step 1: and splitting the combined character string, taking the obtained single substring as a basic substring (provided with y basic substrings in total), and obtaining matched characters (provided with x matched substrings in total) matched from the main string by using an AC automaton.
For example, the main string is "zifuchuanpipe", and the combined string is "zi ^ ch", "ua ^ fa", "np ^ uc"; then the base substring may be zi, ch, ua, fa, np, uc (6), the matching substring may be zi, ch, ua, np, uc (5), and the base substring fa does not match.
Step 2: and (3) detecting whether the combined character string can be combined by the characters obtained in the step (1) or not. Therefore, the temporal complexity of the combination matching is 0(x × y).
Wherein x is equal to z under the aforementioned pattern matching, and y is equal to O (n + m + z) under the aforementioned pattern matching; therefore, after converting the combination matching into the aforementioned matching pattern, the time complexity is 0(x × y) ═ 0(z × O (n + m + z)); it can be seen that as the number of the combined character strings increases, the number of the basic substrings may increase, and finally, the time complexity increases continuously; efficiency is severely impacted by using ac automata directly for string matching.
In order to solve the above problem, the present specification provides a matching method of a combined character string, which may be described below with reference to an example shown in fig. 1, and may include the steps of:
step 110: a main character string and a combined character string are obtained.
Step 120: constructing index information of the combined character string and a basic substring; and splitting the basic substring by the combined string to obtain the basic substring.
Step 130: and carrying out mode matching on the main character string based on the basic sub character string to obtain a matched character string.
Step 140: determining a target combined character string from the index information according to the matching character string; wherein the basic substring of the target combination string exists simultaneously with the main string.
The embodiments provided in the present specification can be applied to a device for performing string matching, such as a client or a server.
The client may refer to a client device in hardware, such as a desktop computer, a laptop computer, a tablet computer, a smart phone, a handheld computer, a personal digital assistant ("PDA"), or any other wired or wireless processor-driven device.
The client may also refer to an application client on software, such as an instant messaging APP, an electronic wallet APP, and the like, and these application clients may have a requirement for matching character strings, such as retrieving chat records in the instant messaging APP, searching for merchant information in the electronic wallet APP, and the like.
The client also can be a soft and hard combined client, such as a smart phone with an electronic wallet APP installed.
In an embodiment, the matching of the character strings may be performed locally by the client, that is, the client may perform the matching of the character strings based on a local computing resource. In practical applications, after determining the main character string, the user may input the combined character string, so that the client automatically performs matching of the combined character string.
In an embodiment, the server may include a server for character string matching, a server cluster, or a cloud platform constructed based on the server cluster.
In one embodiment, the server may respond to a string matching request initiated by the client. In practical application, after determining the main character string, the user can input the combined character string, the client can further send the main character string and the combined character string to the server, and the server completes matching of the combined character string.
In an embodiment, the Index information may include Inverted Index (Inverted Index) information. Namely: the index information for constructing the combined string and the basic sub-string may specifically include:
establishing inverted index information of the combined character string and the basic substring based on the HashMap;
and counting the number of basic substrings in the combined string, and recording the number into the inverted index information.
HashMap (HashMap) is a way to efficiently establish a mapping relationship.
The inverted index is a database retrieval mode with high efficiency, and is particularly suitable for retrieval under mass data. There are many specific ways for inverted indexes, and in a broad sense, a search server based on Lucene (full text search engine) can be regarded as using inverted indexes; for example including ES
(ElasticSearch)、Solr。
It should be noted that the use of the inverted index is only one implementation manner provided in this specification, and any other manner that can establish index information in practical application may be used as an implementation manner of the embodiment in this specification.
In one embodiment, the base substring may be obtained by:
splitting the combined character string to obtain a single character string;
and taking the single character string as a basic substring.
Fig. 2 is a schematic diagram of a combined string matching process provided in this specification. In this example, the master string is "BACE", and the combined strings are "A ^ B", "C ^ D", "A ^ C", "E"; first, the reverse index information needs to be constructed:
step 11: splitting the combined character string to obtain a basic sub string;
splitting the 'A ^ B' to obtain single character strings 'A' and 'B';
splitting the 'C ^ D' to obtain a single character string 'C' and 'D';
splitting the 'A ^ C' to obtain a single character string 'A' and 'C';
splitting the 'E' to obtain a single character string 'E';
after deduplication, the remaining base substrings are "a", "B", "C", "D", and "E".
The individual character strings described in this specification do not refer to a character string of one character, but character strings separated by a symbol "^" such as a combined character string "abc ^ ac ^ db" split the resulting base substrings into "abc", "ac", and "db".
Step 12: establishing inverted index information of the combined character string and the basic substring based on the HashMap, which can be as follows:
A→A^B,A^C
B→A^B
C→A^C,C^D
D→C^D
E→E
wherein A → A ^ B, A ^ C represent the base substring A to map A ^ B, A ^ C; the others are similar.
Step 13: counting the number of the basic substrings in the combined string, and recording the number into the inverted index information, which can be as follows:
A→A^B(2),A^C(2)
B→A^B(2)
C→A^C(2),C^D(2)
D→C^D(2)
E→E(1)
wherein A → A ^ B (2), A ^ C (2) represents the base substring A maps A ^ B, A ^ C, and the combination string A ^ B has 2 base substrings, A ^ C has 2 base substrings; the others are similar.
In one embodiment, the pattern matching may be implemented using an AC automaton. That is, the performing pattern matching on the main character string based on the basic sub character string to obtain a matching character string may specifically include:
establishing a corresponding AC automaton aiming at the basic substring;
and carrying out pattern matching on the main character string based on the AC automaton of each basic character string to obtain a matched character string.
As shown in fig. 2, the main string is "BACE", and the base string has "a", "B", "C", "D", and "E"; and performing pattern matching on the main strings by the AC automata based on the 5 basic substrings respectively:
according to the AC automaton of the basic substring A, whether the character A exists in the main string BACE matching; because A exists in the main string, matching can be successful, and a matching character string 'A' can be obtained;
according to the AC automaton of the basic substring B, whether the character B exists in the main string BACE matching; because B exists in the main string, matching can be successful, and a matching character string 'B' can be obtained;
according to the AC automaton of the basic substring C, whether the character C exists in the main string BACE matching; because C exists in the main string, matching can be successful, and a matching character string 'C' can be obtained;
according to an AC automaton of the basic substring D, whether a character D exists in the main string BACE matching; because D does not exist in the main string, the matching can fail, and null is output, or the next pattern matching is directly entered;
according to the AC automaton of the basic substring E, whether the character E exists in the main string BACE matching; because E exists in the main string, matching can be successful, and a matching character string 'E' can be obtained;
in summary, matching strings "a", "B", "C", and "E" can be obtained.
In an embodiment, the determining a target combination string from the index information according to the matching string may specifically include:
searching the combined character string mapped by the matched character string from the index information;
counting the frequency of searching each combined character string;
and determining the combined character strings with the frequency consistent with the number of the basic sub character strings as target combined character strings.
As shown in fig. 2, after the matching character string is obtained, the combined character string mapped by the matching character string may be searched from the foregoing index information:
the combined string that matches the string "A" mapping includes "A ^ B" and "A ^ C";
the combined string that matches the string "B" mapping includes "A ^ B";
the combined string that matches the string "C" mapping includes "C ^ D" and "A ^ C";
the combined string that matches the string "E" mapping includes "E".
Further, the frequency of searching each combined character string is counted:
as can be seen, the combination string "A ^ B" is searched by the matching strings "A" and "B", so the frequency of the combination string "A ^ B" is 2;
the combined character string 'A ^ C' is searched by the matched character strings 'A' and 'C', so the frequency of the combined character string 'A ^ C' is 2;
the combined character string 'C ^ D' is searched by the matched character string 'C', so the frequency of the combined character string 'C ^ D' is 1;
the combination string "E" is searched for by the matching string "E", and thus, the frequency of the combination string "E" is 1.
Finally, the frequency of the combined string is compared with the number of basic substrings, which have been obtained in the preceding step 13, and can therefore be obtained directly from the index information:
the frequency of the combined character string 'A ^ B' is 2, and the number of the basic substrings is 2, so that the combined character string 'A ^ B' is consistent with the basic substrings;
the frequency of the combined character string 'A ^ C' is 2, the number of the basic substrings is 2, and the combined character string 'A ^ C' is consistent with the basic substrings;
the frequency of the combined character string 'C ^ D' is 1, the number of the basic substrings is 2, and the combined character string 'C ^ D' is inconsistent with the basic substrings;
the frequency of the combined character string E is 1, and the number of the basic substrings is 1, so that the combined character string E is consistent with the basic substrings;
in summary, the finally determined target combination strings can be "A ^ B", "A ^ C" and "E"; the target combination string may be output.
The efficiency of the above process is illustrated by the time complexity:
1. assume that the number of obtained combination strings is d, and the number of base substrings is y (where d < ═ y).
2. The number of matching strings obtained by the pattern matching in step 130 is set as x.
3. Assuming that the number of combined strings of the matching string map in step 140 is set to m, the time complexity of this step is 0(x), i.e., x times of searching.
4. The number of the mapped combined character strings after de-duplication is assumed to be n (wherein n < ═ d); the time complexity of this step is 0(m) ═ 0(x × T), where T is the average length of the index sequence, according to zipf law T < < d; thus, the temporal complexity of this step < <0(x × d);
5. the time complexity of step 140 is 0(n), i.e., n combined strings are compared for n frequencies and numbers after deduplication.
In summary, the total time complexity is 0(x) +0(m) +0(n) ═ 0(x + m + n).
Because O (x + m + n) < <0(x + x d + d) <0(x y); where 0(x y) is the aforementioned temporal complexity of the combinatorial matching using existing AC automata; therefore, after the embodiment provided by the specification is adopted, the time complexity is far smaller than that of a mode of only adopting an AC automaton, and the matching efficiency of the combined character strings is improved.
In the embodiments provided in this specification, an indexing technique (e.g., inverted index) is used to construct information of a base substring of a combined string; then, performing pattern matching on the basic substrings by using an AC (alternating current) automatic machine, and finally screening matched character strings based on index information so as to obtain target combined character strings existing in the main string; the efficiency of the matching of the combined character strings is effectively improved.
Corresponding to the embodiment of the matching method of the combined character string, the specification also provides an embodiment of a matching device of the combined character string. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer business program instructions in the nonvolatile memory into the memory for operation through the processor of the device in which the device is located. From a hardware aspect, as shown in fig. 3, the hardware structure diagram of the apparatus where the matching device for combining character strings is located in this specification is shown, except for the processor, the network interface, the memory, and the nonvolatile memory shown in fig. 3, the apparatus where the device is located in the embodiment may also include other hardware according to the actual matching function of the combining character strings, which is not described again.
Referring to fig. 4, a block diagram of an apparatus for matching a composite character string according to an embodiment of the present disclosure is provided, where the apparatus corresponds to the embodiment shown in fig. 1, and the apparatus includes:
an acquisition unit 210 that acquires a main character string and a combination character string;
an index establishing unit 220 for establishing index information of the basic sub-character string aiming at the combined character string; wherein the basic substring is obtained by splitting the combined string;
a pattern matching unit 230, performing pattern matching on the main character string based on the basic sub character string to obtain a matching character string;
a combination matching unit 240 for determining a target combination character string from the index information according to the matching character string; wherein the basic substring of the target combination string exists simultaneously with the main string.
Optionally, the index establishing unit 220 specifically includes:
establishing inverted index information of the combined character string and the basic substring based on the HashMap;
and counting the number of basic substrings in the combined string, and recording the number into the inverted index information.
Optionally, the pattern matching unit 230 specifically includes:
establishing a corresponding AC automaton aiming at the basic substring; and carrying out pattern matching on the main character string based on the AC automaton of each basic character string to obtain a matched character string.
Optionally, the combination matching unit 240 specifically includes:
searching the combined character strings mapped by the matched character strings from the index information, and counting the frequency of searching each combined character string; and determining the combined character strings with the frequency consistent with the number of the basic sub character strings as target combined character strings.
Optionally, the apparatus further comprises:
the basic sub-string obtaining unit splits the combined character string to obtain a single character string; and using the single character string as a base substring.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in the specification. One of ordinary skill in the art can understand and implement it without inventive effort.
Fig. 4 above describes the internal functional modules and the structural schematic of the combined character string matching apparatus, and the substantial execution subject may be an electronic device, including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to:
acquiring a main character string and a combined character string;
constructing index information of the combined character string and a basic substring; wherein the basic substring is obtained by splitting the combined string;
performing mode matching on the main character string based on the basic sub character string to obtain a matched character string;
determining a target combined character string from the index information according to the matching character string; wherein the basic substring of the target combination string exists simultaneously with the main string.
Optionally, the constructing index information of the basic character for the combined character string specifically includes:
establishing inverted index information of the combined character string and the basic substring based on the HashMap;
and counting the number of basic substrings in the combined string, and recording the number into the inverted index information.
Optionally, the performing pattern matching on the main character string based on the basic sub character string to obtain a matching character string specifically includes:
establishing a corresponding AC automaton aiming at the basic substring;
and carrying out pattern matching on the main character string based on the AC automaton of each basic character string to obtain a matched character string.
Optionally, the determining a target combined character string from the index information according to the matching character string specifically includes:
searching the combined character string mapped by the matched character string from the index information;
counting the frequency of searching each combined character string;
and determining the combined character strings with the frequency consistent with the number of the basic sub character strings as target combined character strings.
Optionally, the method further includes:
splitting the combined character string to obtain a single character string;
and taking the single character string as a basic substring.
In the above embodiments of the electronic device, it should be understood that the Processor may be a Central Processing Unit (CPU), other general-purpose processors, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. The general-purpose processor may be a microprocessor, or the processor may be any conventional processor, and the aforementioned memory may be a read-only memory (ROM), a Random Access Memory (RAM), a flash memory, a hard disk, or a solid state disk. The steps of a method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the embodiment of the electronic device, since it is substantially similar to the embodiment of the method, the description is simple, and for the relevant points, reference may be made to part of the description of the embodiment of the method.
Other embodiments of the present disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This specification is intended to cover any variations, uses, or adaptations of the specification following, in general, the principles of the specification and including such departures from the present disclosure as come within known or customary practice within the art to which the specification pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the specification being indicated by the following claims.
It will be understood that the present description is not limited to the precise arrangements described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present description is limited only by the appended claims.