Nothing Special   »   [go: up one dir, main page]

CN111400361B - Data real-time storage method, device, computer equipment and storage medium - Google Patents

Data real-time storage method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN111400361B
CN111400361B CN202010090231.1A CN202010090231A CN111400361B CN 111400361 B CN111400361 B CN 111400361B CN 202010090231 A CN202010090231 A CN 202010090231A CN 111400361 B CN111400361 B CN 111400361B
Authority
CN
China
Prior art keywords
data
preset
log
text sequence
converted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010090231.1A
Other languages
Chinese (zh)
Other versions
CN111400361A (en
Inventor
饶鑫
黄望
石晓龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202010090231.1A priority Critical patent/CN111400361B/en
Publication of CN111400361A publication Critical patent/CN111400361A/en
Application granted granted Critical
Publication of CN111400361B publication Critical patent/CN111400361B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application belongs to the field of data structuring processing, and discloses a data real-time storage method, a data real-time storage device, computer equipment and a storage medium, wherein the method comprises the steps of replacing a designated character searched from a text sequence to be determined with a preset cutting character, and cutting the text sequence to be determined at the preset cutting character. Methods, apparatus, computer devices, and readable storage media. The method comprises the steps of classifying collected log data, writing the classified log data meeting log screening conditions into kafka for buffer processing, segmenting the log data read from the kafka to obtain a character string sequence, matching the segmented character string sequence with a key value text sequence, and storing the data into structured data in real time through hbase to obtain the structured data. The method solves the technical problem that the user cannot analyze the log data in real time.

Description

Data real-time storage method, device, computer equipment and storage medium
Technical Field
The present application relates to the field of data processing, and in particular, to a method and apparatus for storing data in real time, a computer device, and a storage medium.
Background
In the prior art, data analysis is performed on user access logs produced by a server, log data on the server can be written into a database table which is created in advance in the Hive in real time through a data warehouse tool Hive based on Hadoop, and then operations such as query analysis are performed on the data in the database table, so that the function of real-time analysis of the log data is realized. However, when the number of the required log data is relatively large, massive data on the server is directly written into Hive, because Hive self defects can cause data congestion, writing into Hive data has delay, and the delay is generally relatively long, so that the storage and analysis of the log data are affected.
Disclosure of Invention
Based on the above, it is necessary to solve the above technical problems, and the present application provides a method, an apparatus, a computer device and a storage medium for storing data in real time, so as to solve the technical problem that in the prior art, the storage and analysis of log data are affected due to serious delay when the read log data is directly written into hive for storage.
A method of data real-time storage, the method comprising:
reading a preset configuration file to obtain log screening conditions, acquiring log data conforming to the log screening conditions from the acquired log data, and writing the acquired log data into kafka according to the preset configuration file to serve as data to be converted;
the data to be converted is read from kafka at regular time, the read data to be converted is cut and matched according to preset slicing conditions to obtain a text sequence to be determined, and the text sequence to be determined is written into a preset data table on hbase;
generating a structure data soft link pointing to the preset database table for hive according to the storage path of the preset database table on the hbase;
And cutting the text sequence to be determined, which is acquired according to the structural data soft link, through a regular expression, and writing the structural data obtained after the cutting process into a structural database on the hive.
A data real-time storage device, the device comprising:
The data buffer module is used for reading a preset configuration file to obtain log screening conditions, acquiring log data conforming to the log screening conditions from the acquired log data, and writing the acquired log data into kafka according to the preset configuration file to serve as data to be converted;
The sequence matching module is used for regularly reading the data to be converted from the kafka, cutting and matching the read data to be converted according to preset slicing conditions to obtain a text sequence to be determined, and writing the text sequence to be determined into a preset data table on hbase;
The link pointing module is used for generating a structure data soft link pointing to the preset database table for hive according to the storage path of the preset database table on the hbase;
And the structuring module is used for cutting the text sequence to be determined, which is acquired according to the structural data soft link, through a regular expression, and writing the structured data obtained after the cutting process into the structured database on the hive.
A computer device comprising a memory and a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the schematic method described above for a data real-time storage device when the computer program is executed.
A computer readable storage medium storing a computer program which when executed by a processor performs the steps of the schematic method for a data real-time storage device described above.
The method, the device, the computer equipment and the storage medium for storing the data in real time are characterized in that the collected log data are subjected to classification processing, the log data which are subjected to classification processing and meet the log screening conditions are written into kafka for buffer processing, the log data read from the kafka are subjected to segmentation processing to obtain a character string sequence, the character string sequence which is not clear after the segmentation is matched with a key value text sequence, and the data are stored in the structured data in real time through hbase, so that the structured data are obtained. Before inputting the data into the hive for storage, the data needs to be buffered through hbase to prevent data congestion during processing of massive data, and writing into the hive data is delayed, so that the technical problem that a user cannot analyze log data in real time is solved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic illustration of an application environment for a data real-time storage method;
FIG. 2 is a flow chart of a method for real-time storage of data;
FIG. 3 is a flow chart illustrating step 204 in FIG. 2;
FIG. 4 is a flow chart of step 302 in FIG. 3;
FIG. 5 is another flow chart of step 306 in FIG. 3;
FIG. 6 is a flow chart of step 202 in FIG. 2;
FIG. 7 is a schematic diagram of a data real-time storage device;
FIG. 8 is a schematic diagram of a computer device in one embodiment.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the description of the drawings above are intended to cover a non-exclusive inclusion. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The data real-time storage method provided by the embodiment of the invention can be applied to an application environment shown in figure 1. The application environment may include, among other things, a terminal 102, a network 106, and a server 104, the network 106 being configured to provide a communication link medium between the terminal 102 and the server 104, the network 106 may include various connection types, such as wired, wireless communication links, or fiber optic cables, etc.
A user may interact with the server 104 using the terminal 102 over the network 106 to receive or send messages, etc. The terminal 102 may have installed thereon various communication client applications such as web browser applications, shopping class applications, search class applications, instant messaging tools, mailbox clients, social platform software, and the like.
The terminal 102 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic video expert compression standard audio plane 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic video expert compression standard audio plane 4) players, laptop and desktop computers, and the like.
The server 104 may be a server that provides various services, such as a background server that provides support for pages displayed on the terminal 102.
It should be noted that, the method for storing data in real time provided by the embodiment of the present application is generally executed by a server/terminal, and accordingly, the device for storing data in real time is generally disposed in the server/terminal device.
It should be understood that the number of terminals, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Wherein the terminal 102 communicates with the server 104 through a network. The server 104 acquires log data from the terminal 102, screens the log data, writes the screened log data into the kafka for buffer storage, cuts and matches the data, cuts and denoises the data through a regular expression, and then obtains structured data to store in a database on hive. The terminal 102 and the server 104 are connected through a network, which may be a wired network or a wireless network, the terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, where the terminal 102 may also be a server storing log data, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.
In one embodiment, as shown in fig. 2, a method for storing data in real time is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:
step 202, reading a preset configuration file to obtain log screening conditions, acquiring log data meeting the log screening conditions from the acquired log data, and writing the acquired log data into kafka according to the preset configuration file to serve as data to be converted.
The data source of the general log data is a server or a background corresponding to the terminal. The information stored on the server for the user to access the system is typically a piece of information generated by Web containers such as ngix (Nginx (engine x) is a high performance HTTP and reverse proxy Web server, while also providing IMAP/POP3/SMTP services), tomcat (Tomcat server is a free open source Web application server), webLogic (Java application server for developing, integrating, deploying and managing large distributed Web applications, web applications and database applications), and may be:
system_name,source,agent,status,method,byte,connecttime,type,appname,request URL,clientip,serverip,acctime Etc. field information.
However, the above data is not limited to log data, and any data satisfying a predetermined screening condition can be processed by the present proposal. The log data herein is merely illustrative of one particular application scenario. Such as the format or type of log data agreed upon by the developer.
The preset configuration file comprises information of an acquisition path of the log data on the data source, screening information of the type or format of the log data required to be acquired by the server side and the like, the information is the configuration file which is written in the server side in advance, and when the log data is acquired and screened, the log data can be processed according to the acquisition path and the screening information in the preset configuration file only by reading the information in the preset configuration file. In this embodiment, the server may be regarded as a web container, and the format and type of log data generated by the web container are various, and before analysis, a preset configuration file needs to be read to obtain log screening conditions, after log data is collected from the web container, operations such as screening the log data according to the preset screening conditions are performed. The server side can acquire the log data from the web container according to the acquisition path as the data to be screened; the data to be screened is log data which is not classified yet, and can be various types of log data in various formats, so that the log data are mixed together, and are messy, and the subsequent processing analysis is inconvenient.
Alternatively, specifically, the log information of one line in the format of the log information acquired in the present embodiment may be:
weblogic_acc appName,agent Nanning 0.012 17:53:54-GET 675 200 /life/selfhire.getAccountInfo 202.103.238.166
A line log of another format of log information may be:
DRV_LOG_ERROR("[0x%08x]-[DWSdk.errorcode=0x%08x]Init DwSDK filded", DRV_INIT_FAILED,initRet)。
From the above, the log data obtained by screening according to the log screening condition is a string which looks meaningless and messy. It is obviously difficult to analyze numerous such log data.
Kafka is a high-throughput distributed publish-subscribe messaging system that can handle all action flow data for consumers in a web site.
And 204, regularly reading the data to be converted from the kafka, cutting and matching the read data to be converted according to preset slicing conditions to obtain a text sequence to be determined, and writing the text sequence to be determined into a preset data table on hbase.
Alternatively, the data to be converted may be pulled from the path specified on kafka every 5s or 25s, and the data to be converted is cut and matched according to the preset slicing condition, where the specific time period of the timing depends on the size of the data, for example, when the data size is relatively large, the data may be pulled for every 5s for display, and if the log data is relatively slow to generate, the data to be converted may be pulled for every 25 s. There is no delay problem.
Further, the read data to be converted can be subjected to cutting matching processing operation through a spark operator, the data to be converted is divided into small character strings, each character string has independent meaning, the character strings obtained after cutting are used as character string sequences, and then the character string sequences are subjected to paraphrasing matching processing to obtain the text sequences to be determined. And then writing the text sequence to be determined into a preset database table on hbase. The primary abstraction provided by Spark is the resilient distributed data set (RDD), which is a collection of elements partitioned across cluster nodes, which can operate in parallel. The spark operator is divided into a conversion operator and an Action operator, the Action operator will form a job, the conversion operator RDD will be converted into another RDD, or the file system data will be converted into an RDD.
Wherein, the preset slicing condition is a standard for cutting the data to be converted. Depending on the particular format or type of data to be converted. Because the data to be converted can be log data customized by a developer, and the format and the type are predefined, after the log to be converted is screened out, the data to be converted can be directly cut and matched according to the preset slicing condition.
Optionally, when the data to be converted is 'weblogic_acc appName,agent Nanning 0.012 17:53:54 -GET 675 200/life/selfhire.getAccountInfo 202.103.238.166', a rule may be searched from the data to be converted to obtain that the data to be converted includes a plurality of character strings, and the character strings are divided by space numbers '/t', so that the space numbers can be used as cutting points to cut the data to be converted, and attribute values are matched with the character strings obtained after cutting to obtain the text sequence to be determined.
The text sequence to be determined is then written in hbase. Wherein Hbase data storage format may be value={"appType":"weblogic_acc","appName":"","xForwardedFor":"","agent":"", "city":"Nanning","accTime":"0.012","connectTime":"17:53:54","source":"", "method":"GET","connectByte":"675","status":"200","requestURL": "/life/selfhire.getAccountInfo","clientIp":"202.103.238.166"......};
It can be seen that the data is stored in the hbase in the form of a dictionary, and each text sequence to be determined is a dictionary, and the attribute value correspond to each other.
Step 206, generating a structure data soft link pointing to the preset database table for hive according to the storage path of the preset database table on hbase.
A link is a method of establishing a connection between a shared file and several directory entries of users accessing it. The soft link is also called a symbolic link and this file contains the pathname of another file. May be any file or directory and may link files of different file systems. As much as a shortcut under win. The absolute path of the file represented by the file is saved, and the file is another file, and an independent block is arranged on a hard disk to replace the path of the file when the file is accessed.
Hbase is not suitable for complex sql statement queries, so data is written to hive for subsequent query and analysis operations. However, because the limitation of hive cannot pull data directly from hbase to hive, a soft link needs to be established to allow data to be written to hive.
And step 208, cutting the text sequence to be determined obtained according to the soft link of the structural data through the regular expression, and writing the structural data obtained after the cutting process into a structural database on hive.
Specifically, the processing of the text sequence to be determined may be to remove ideographic symbols in the text sequence to be determined, and only take numeric values or character strings having representative meanings therein.
Further, replacing the designated characters searched from the text sequence to be determined with preset cutting characters, and cutting the text sequence to be determined at the preset cutting characters.
Specifically, after the text sequence to be determined is obtained from the structure soft link, the obtained text sequence to be determined is processed in a json_complete () mode, and the specific processing steps are as follows:
The text sequence to be determined is obtained, nonsensical characters such as middle brackets, large brackets or double quotation marks in the text sequence to be determined are removed, and designated marks are used for replacing, for example, double vertical lines are used for replacing, { } "; and (3) dividing the data at the double vertical lines into a plurality of small character strings, outputting the small character strings to obtain structured data, and writing the structured data into a structured database on hive to finish the storage operation of the structured data. According to the method, the structured data is written into the hive internal database, the data storage performance is improved by more than 10 times compared with that of the traditional method, and the user access log produced by the structured system is provided to the hive database in real time, so that the structured log can be further analyzed, and the value of the log is mined.
In the data real-time storage method, the collected log data is classified, the log data which accords with the log screening condition after the classification is written into kafka for buffer treatment, the log data read from the kafka is segmented to obtain a character string sequence, the character string sequence with unknown meaning after cutting is matched with a key value text sequence, and the data is stored in the structured data in real time through hbase to obtain the structured data. Before inputting the data into the hive for storage, the data needs to be buffered through hbase to prevent data congestion during processing of massive data, and writing into the hive data is delayed, so that the technical problem that a user cannot analyze log data in real time is solved.
In one embodiment, as shown in FIG. 3, step 204 includes:
Step 302, determining slicing points of the data to be converted according to preset slicing conditions.
The slicing point is extracted from the preset slicing condition according to the data to be converted, for example, every two character strings in the data to be converted are distinguished through "\t", so that the character strings are prevented from being adhered, the "\t" can be used as the slicing point of the data to be converted, the "\t" slicing point is extracted from the preset slicing condition, all positions of the slicing point in the data to be converted are positioned, and the data to be converted are cut according to the positioned data, so that a character string sequence is obtained.
Further, the character string with the most occurrence frequency in the data to be converted is taken as a slicing point. Alternatively, the character string with the most occurrence frequency in the data to be converted may be calculated as the slicing point, which depends on the type of the data to be converted, and the character string with the most occurrence frequency is generally calculated as the slicing point included in the preset slicing condition.
And 304, positioning the position coordinates of the slicing point in the data to be converted, and cutting the data to be converted into a character string sequence according to the position coordinates.
The segmentation of the data to be converted according to the slicing point generally does not destroy the required data, such as the segmentation of one character string into two character strings.
And 306, acquiring and correlating a key value text sequence corresponding to the character string sequence according to a preset matching condition to obtain a text to be determined.
The general key value text sequence is written into a preset configuration file in advance by a developer, and preset matching conditions are included in the preset matching conditions. And acquiring a key value text sequence according to a preset matching condition, acquiring an identifier of the key value text sequence, and associating a character string sequence corresponding to the identifier with the key value text sequence to obtain a text sequence to be determined.
According to the embodiment, the text sequence to be determined is obtained by performing cutting and matching operation on the data to be converted, and the text sequence with the key value is matched with the obtained character string sequence in the meaningless and irregular data, so that the character string sequence which is originally unknown in meaning has the attribute corresponding to the character string sequence, and the meaning of each character string in the log data is defined.
In one embodiment, as shown in FIG. 4, step 302 includes:
step 402, taking the occurrence frequency of each character string in the data to be converted in the same row as a frequency array, and calculating the variance of the frequency array.
Traversing the data to be converted, acquiring each character string in the data to be converted, and listing the frequency of each character string in the data to be converted in the same row as an array. If the data to be converted has 20 lines, the number of occurrence frequencies of the same character string in the 20 lines of data to be converted is taken as an array to be taken as frequency data, and the variance of the frequency data is calculated.
In step 404, if the variance of the frequency array is smaller than the specific value, the character string is used as the slicing point.
If the variance meets less than a certain threshold, the string may be considered a separator, which may be considered a slicing point. Of course, this is just one way of obtaining slice points. Generally, after a slicing point is determined according to preset slicing conditions, the slicing point can be checked once according to the mode, so that the accuracy rate of cutting the data to be converted is increased.
According to the embodiment, the frequency of the character strings in the data to be converted is calculated to determine the segmentation points, so that the determined segmentation points are accurate, and the accuracy of cutting the data to be converted is increased.
In one embodiment, as shown in FIG. 5, step 306 includes:
step 502, a key-value text sequence of a string sequence and an identifier of the key-value text sequence are obtained.
Each character string sequence has a corresponding key value text sequence, but the character string sequence is matched with the key value text sequence, so that the application can determine which character string sequence corresponds to which key value text sequence by acquiring identifiers of the character string sequence and the key value text sequence.
In generating the string sequences, the database generates a line label for each string sequence, which can be used as an identifier of the string sequence. An identifier of the part-time text sequence that matches the identifier is then obtained for association.
Step 504, assigning the character string sequence corresponding to the identifier to the key-value text sequence.
Specifically, after a character string sequence with consistent identifiers and a key value text sequence are obtained, the character string sequence can be assigned to the key value text sequence to serve as a text to be determined.
According to the embodiment, through cutting and matching operation on the data to be converted, the key value text sequence is matched with the obtained character string sequence in the meaningless and irregular data, so that the character string sequence which is originally unknown in meaning has the attribute corresponding to the character string sequence, and the text sequence to be determined is obtained. And the consistency of the identifiers is judged to carry out matching association, so that the method is very convenient and quick, and the calculation and matching efficiency is improved.
In one embodiment, as shown in FIG. 6, step 202 includes:
step 602, obtaining format keywords from the collected log data, and classifying the log data according to the format keywords to obtain classified log data.
The format key may be a key representing the log data level, such as debug log data: all detailed information for debugging; info, some key jumps, the log that proves the normal operation of the software; warning indicates that some accidents happen, and the software can not be processed, but still can normally run; error: because of some serious problems, software cannot normally perform some functions, but still can run; critical/fatal: very serious errors, the software cannot continue to run. Still others are log information formats set by the developer themselves. Log data such as :weblogic_acc appName,agent Nanning 0.012 17:53:54-GET 675 200/life/selfhire.getAccountInfo 202.103.238.166, mentioned above is structured, otherwise only meaningless strings, and the analyst cannot obtain information from such a string that is irregularly circulated.
Therefore, the nonsensical and irregular log data can be classified according to the grade of the daily main data, or the log data can be classified according to the unique distinguishing keywords in different types of logs, and the method is not limited herein as the case may be.
Step 604, obtaining the log data meeting the log screening condition from the classified log data as the data to be written, and writing the data to be written into kafka according to the acquisition path of the data to be written. For example, if the log data meeting the daily screening condition includes a specific keyword, the log data can be regarded as data to be written meeting the daily screening condition, and the data to be written is written into kafka for buffer storage.
According to the embodiment, before the log data is screened, the collected log data is classified according to the grade, the format, the type or the special keywords and the like, and then screening is carried out, so that unnecessary memory occupation of a server side and the like caused by repeatedly executing the screening operation when the log data is screened can be avoided, and the efficiency of log data screening is improved.
It should be understood that, although the steps in the flowcharts of fig. 2-6 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps of fig. 2-6 may include multiple sub-steps or phases that are not necessarily performed at the same time, but may be performed at different times, nor does the order in which the sub-steps or phases are performed need to be sequential, but may be performed in turn or alternately with at least a portion of the sub-steps or phases of other steps or steps.
In one embodiment, as shown in fig. 7, a data real-time storage device is provided, which corresponds to the data real-time storage method in the above embodiment one by one. The data real-time storage device includes:
The data buffer module 702 is configured to read a preset configuration file to obtain a log filtering condition, obtain log data meeting the log filtering condition from the collected log data, and write the obtained log data into kafka as data to be converted according to the preset configuration file.
And the sequence matching module 704 is used for regularly reading the data to be converted from the kafka, performing cutting matching processing on the read data to be converted according to preset slicing conditions to obtain a text sequence to be determined, and writing the text sequence to be determined into a preset data table on hbase.
The link pointing module 706 is configured to generate a soft link of structural data pointing to the preset database table for hive according to a storage path of the preset database table on hbase.
And the structuring module 708 is configured to perform cutting processing on the text sequence to be determined obtained according to the structure data soft link through the regular expression, and write the structured data obtained after the cutting processing into the structured database on hive.
Further, the sequence matching module 704 includes:
The slice point determining submodule is used for determining slice points of data to be converted according to preset slice conditions;
and the data cutting sub-module is used for positioning the position coordinates of the slicing points in the data to be converted and cutting the data to be converted into character string sequences according to the position coordinates.
And the sequence association sub-module is used for acquiring and associating the key value text sequence corresponding to the character string sequence according to a preset matching condition to obtain the text to be determined.
Further, the slicing point determination submodule includes:
The frequency calculating unit is used for taking the occurrence frequency of each character string in the data to be converted in the same line as a frequency array and calculating the variance of the frequency array.
And the first slice confirming unit is used for taking the character string as a slicing point if the variance of the frequency array is smaller than a specific numerical value.
Further, the slicing point determining sub-module further includes:
And the second slice confirming unit is used for taking the character string with the most occurrence frequency in the data to be converted as a slicing point.
Further, the sequence association sub-module includes:
And the identification acquisition unit is used for acquiring the key value text sequence of the character string sequence and the identifier of the key value text sequence.
And the attribute assignment unit is used for assigning the character string sequence corresponding to the identifier to the key value text sequence.
Further, the structuring module 708 includes:
the data classification sub-module is used for acquiring format keywords from the collected log data, classifying the log data according to the format keywords, and obtaining classified log data.
The data screening sub-module is used for acquiring the log data meeting the log screening conditions from the classified log data, taking the log data as the data to be written, and writing the data to be written into the kafka according to the acquisition path of the data to be written.
Further, the structuring module 708 further comprises:
And the cutting processing sub-module is used for replacing the designated characters searched from the text sequence to be determined with preset cutting characters and cutting the text sequence to be determined at the preset cutting characters.
The data real-time storage device performs classification processing on collected log data, writes the log data which accords with log screening conditions after the classification processing into kafka for buffer processing, performs segmentation processing on the log data read from the kafka to obtain a character string sequence, matches the character string sequence with a key value text sequence for the character string sequence with unknown meaning after the segmentation, and stores the data into structured data in real time through hbase to obtain the structured data. Before inputting the data into the hive for storage, the data needs to be buffered through hbase to prevent data congestion during processing of massive data, and writing into the hive data is delayed, so that the technical problem that a user cannot analyze log data in real time is solved.
In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 8. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing user order data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a method for storing data in real time.
It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), a Programmable gate array (Field-Programmable GATE ARRAY, FPGA), a digital Processor (DIGITAL SIGNAL Processor, DSP), an embedded device, and the like.
It will be appreciated by those skilled in the art that the structure shown in FIG. 8 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In one embodiment, a computer apparatus is provided that includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements steps of a method for storing data in real time in the foregoing embodiments, such as steps 202 through 208 shown in fig. 2, when the processor executes the computer program, or where the processor implements functions of each module/unit of the data in real time storage device in the foregoing embodiments, such as functions of modules 702 through 708 shown in fig. 7, when the processor executes the computer program. To avoid repetition, no further description is provided here.
In one embodiment, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor, implements the steps of the method for storing data in real time in the above embodiment, such as steps 202 to 208 shown in fig. 2, or when executed by a processor, implements the functions of the modules/units of the data in real time storage device in the above embodiment, such as the functions of modules 702 to 708 shown in fig. 7. To avoid repetition, no further description is provided here.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that, for those skilled in the art, it is possible to make several modifications, improvements or equivalent substitutions for some technical features without departing from the concept of the present application, and these modifications or substitutions do not make the essence of the same technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present application, and all the modifications or substitutions fall within the protection scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims (7)

1. A method for storing data in real time, comprising:
reading a preset configuration file to obtain log screening conditions, acquiring log data conforming to the log screening conditions from the acquired log data, and writing the acquired log data into kafka according to the preset configuration file to serve as data to be converted;
the data to be converted is read from kafka at regular time, the read data to be converted is cut and matched according to preset slicing conditions to obtain a text sequence to be determined, and the text sequence to be determined is written into a preset data table on hbase;
Generating a structure data soft link pointing to a preset database table for hive according to a storage path of the preset database table on the hbase;
Cutting the text sequence to be determined obtained according to the structure data soft link through a regular expression, and writing the structural data obtained after cutting into a structural database on the hive;
Cutting and matching the read data to be converted according to preset slicing conditions to obtain a text sequence to be determined, wherein the text sequence to be determined comprises the following steps:
Determining slicing points of the data to be converted according to the preset slicing conditions;
positioning the position coordinates of the slicing points in the data to be converted, and cutting the data to be converted into character string sequences according to the position coordinates;
acquiring a key value text sequence corresponding to the character string sequence according to a preset matching condition and correlating the key value text sequence to obtain a text to be determined;
The obtaining and correlating the key value text sequence corresponding to the character string sequence according to the preset matching condition comprises the following steps:
acquiring a key value text sequence of the character string sequence and an identifier of the key value text sequence;
Assigning the character string sequence corresponding to the identifier to the key value text sequence;
the cutting processing of the text sequence to be determined, which is obtained according to the structure data soft link, through a regular expression comprises the following steps:
And replacing the designated characters searched from the text sequence to be determined with preset cutting characters, and cutting the text sequence to be determined at the preset cutting characters.
2. The method according to claim 1, wherein the determining the slicing point of the data to be transformed according to the preset slicing condition comprises:
Taking the occurrence frequency of each character string in the data to be converted in the same row as a frequency array, and calculating the variance of the frequency array;
and if the variance of the frequency array is smaller than a specific numerical value, taking the character string as the slicing point.
3. The method according to claim 1, wherein the determining the slicing point of the data to be transformed according to the preset slicing condition comprises:
And taking the character string with the highest occurrence frequency in the data to be converted as the slicing point.
4. The method according to claim 1, wherein the acquiring the log data meeting the log filtering condition from the acquired log data, and writing the acquired log data into kafka according to the preset configuration file, includes:
acquiring format keywords from the acquired log data, and classifying the log data according to the format keywords to obtain classified log data;
And acquiring the log data meeting the log screening conditions from the classified log data as data to be written, and writing the data to be written into kafka according to the acquisition path of the data to be written.
5. A data real-time storage device, characterized in that it performs the steps of the method according to any one of claims 1to 4, comprising:
The data buffer module is used for reading a preset configuration file to obtain log screening conditions, acquiring log data conforming to the log screening conditions from the acquired log data, and writing the acquired log data into kafka according to the preset configuration file to serve as data to be converted;
the sequence matching module is used for regularly reading the data to be converted from the kafka, cutting and matching the read data to be converted according to preset slicing conditions to obtain a text sequence to be determined, and writing the text sequence to be determined into a preset data table on hbase;
The link pointing module is used for generating a structure data soft link pointing to the preset database table for hive according to the storage path of the preset database table on the hbase;
And the structuring module is used for cutting the text sequence to be determined, which is acquired according to the structural data soft link, through a regular expression, and writing the structured data obtained after the cutting process into the structured database on the hive.
6. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 4 when the computer program is executed.
7. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 4.
CN202010090231.1A 2020-02-13 2020-02-13 Data real-time storage method, device, computer equipment and storage medium Active CN111400361B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010090231.1A CN111400361B (en) 2020-02-13 2020-02-13 Data real-time storage method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010090231.1A CN111400361B (en) 2020-02-13 2020-02-13 Data real-time storage method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111400361A CN111400361A (en) 2020-07-10
CN111400361B true CN111400361B (en) 2024-08-27

Family

ID=71428375

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010090231.1A Active CN111400361B (en) 2020-02-13 2020-02-13 Data real-time storage method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111400361B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112199350B (en) * 2020-09-29 2023-10-24 中国平安人寿保险股份有限公司 Function verification method and device based on data screening, computer equipment and medium
CN112749223A (en) * 2021-01-28 2021-05-04 道和云科技(天津)有限公司 Interface log configuration and structured storage method and system
CN113312353B (en) * 2021-06-10 2024-06-04 中国民航信息网络股份有限公司 Storage method and system for tracking belt log
CN114328076B (en) * 2021-09-18 2024-04-30 腾讯科技(深圳)有限公司 Log information extraction method, device, computer equipment and storage medium
CN115587158B (en) * 2022-12-08 2023-04-25 广东名阳信息科技有限公司 Log data conversion method and system based on visual configuration

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107729526A (en) * 2017-10-30 2018-02-23 清华大学 A kind of method of text structure
CN109033410A (en) * 2018-08-03 2018-12-18 韩雪松 A kind of SQL analytic method based on canonical and character string cutting

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103729442B (en) * 2013-12-30 2017-11-24 华为技术有限公司 Record the method and database engine of transaction journal
CN108847977B (en) * 2018-06-14 2021-06-25 平安科技(深圳)有限公司 Service data monitoring method, storage medium and server

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107729526A (en) * 2017-10-30 2018-02-23 清华大学 A kind of method of text structure
CN109033410A (en) * 2018-08-03 2018-12-18 韩雪松 A kind of SQL analytic method based on canonical and character string cutting

Also Published As

Publication number Publication date
CN111400361A (en) 2020-07-10

Similar Documents

Publication Publication Date Title
CN111400361B (en) Data real-time storage method, device, computer equipment and storage medium
US9245225B2 (en) Prediction of user response actions to received data
US20190163594A1 (en) Using Cognitive Technologies to Identify and Resolve Issues in a Distributed Infrastructure
CN112162965B (en) Log data processing method, device, computer equipment and storage medium
US20150199433A1 (en) Method and system for search engine indexing and searching using the index
CN113688288B (en) Data association analysis method, device, computer equipment and storage medium
CN110837590B (en) Information pushing method and device, computer equipment and storage medium
CN111666298B (en) User service category detection method, device, and computer equipment based on Flink
CN109542764B (en) Webpage automatic testing method and device, computer equipment and storage medium
CN110674360B (en) Tracing method and system for data
CN111445319A (en) Voucher generation method and device, computer equipment and storage medium
CN111652658A (en) Portrait fusion method, apparatus, electronic device and computer readable storage medium
US20220036154A1 (en) Unsupervised multi-dimensional computer-generated log data anomaly detection
CN112506800B (en) Method, apparatus, device, medium and program product for testing code
US20160267586A1 (en) Methods and devices for computing optimized credit scores
CN113778996A (en) Large data stream data processing method and device, electronic equipment and storage medium
CN113221036A (en) Method and device for processing electronic bill mail
US9286349B2 (en) Dynamic search system
CN113961811B (en) Event map-based conversation recommendation method, device, equipment and medium
CN114004212B (en) Data processing method, device and storage medium
CN113360313B (en) Behavior analysis method based on massive system logs
CN111459411B (en) Data migration method, device, equipment and storage medium
CN115357689A (en) Data processing method, device and medium of distributed log and computer equipment
CN113239687A (en) Data processing method and device
CN112650569A (en) Timed task relation network graph generation method based on Oracle code and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant