CN111881086B

CN111881086B - Big data storage method, query method, electronic device and storage medium

Info

Publication number: CN111881086B
Application number: CN202010715304.1A
Authority: CN
Inventors: 查超; 范渊
Original assignee: DBAPPSecurity Co Ltd
Current assignee: DBAPPSecurity Co Ltd
Priority date: 2020-07-23
Filing date: 2020-07-23
Publication date: 2024-03-19
Anticipated expiration: 2040-07-23
Also published as: CN111881086A

Abstract

The application relates to a data storage method, a query method, an electronic device and a storage medium for big data. The PB level data storage method based on the elastic search comprises the following steps of: acquiring data, storing the data according to data fragments of a preset duration, and generating a plurality of index fragments for inquiring each data fragment in an elastic search system; and under the condition that the total data amount of the data fragments corresponding to the index fragments is larger than a preset threshold value and/or the time span of the data fragments corresponding to the index fragments is larger than a preset time span, transferring at least one index fragment with earlier generation time in the index fragments to an HDFS system so as to enable Lucene to inquire the corresponding data according to the index fragments stored in the HDFS system. By the method and the device, the problems of poor data query and storage stability and low efficiency of the large data warehouse in the related technology are solved.

Description

Big data storage method, query method, electronic device and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method for storing big data, a method for querying the big data, an electronic device, and a storage medium.

Background

In recent years, with the rapid development and popularization of computer and information technology, the scale of industry application systems is rapidly increased, and data generated by industry application is explosively increased. Industry/enterprise big data, which often reaches hundreds of TB or even tens to hundreds of PB scales, has far exceeded the processing power of conventional computing technologies and information systems, and big data technologies have emerged.

In the prior art, large data warehouses generally adopt a distributed data storage technology to meet the storage and query requirements of large data. However, large data warehouses in the prior art employ a single distributed system for data storage and querying. Under a massive big data scene, a single distributed system cannot provide sufficient server resource guarantee index and query functions, and meanwhile, the existing distributed system applied to a big data warehouse also has the problems of poor historical data query and storage stability, low efficiency and high operation and maintenance cost.

At present, aiming at the problems of poor data query and storage stability, low efficiency and high operation and maintenance cost of a distributed system applied to a large data warehouse in the related technology, no effective solution is proposed yet.

Disclosure of Invention

The embodiment of the application provides a storage method, a query method, an electronic device and a storage medium for big data, which at least solve the problems of poor data query and storage stability, low efficiency and high operation and maintenance cost of a distributed system applied to a big data warehouse in the related technology.

In a first aspect, an embodiment of the present application provides a method for storing PB class data based on elastic search, including:

acquiring data, storing the data according to data fragments of a preset duration, and generating a plurality of index fragments for inquiring each data fragment in an elastic search system;

and when the total data amount of the data fragments corresponding to the index fragments is larger than a preset threshold value and/or the time span of the data fragments corresponding to the index fragments is larger than a preset time span, transferring at least one index fragment with earlier generation time in the index fragments to an HDFS system so as to enable Lucene to inquire corresponding data according to the index fragments stored in the HDFS system.

In some embodiments, in a case where a total amount of data of the data slices corresponding to the plurality of index slices is smaller than a preset threshold and a time span of the data slices corresponding to the plurality of index slices is smaller than a preset time span, the storing method includes: the plurality of index shards are stored in the elastic search system.

In some embodiments, obtaining data, storing the data in data slices of a preset duration, and generating a plurality of index slices for querying each data slice in the elastic search system includes:

collecting the data through Logstash and accessing the elastic search system;

storing the data acquired by day as one of the data slices in the elastic search system, and generating the index slices associated with each of the data slices by day.

In some embodiments, in a case that a total amount of data of the data slices corresponding to the plurality of index slices is greater than a preset threshold and/or a time span of the data slices corresponding to the plurality of index slices is greater than a preset time span, the storing method includes: and transferring one index fragment with the earliest generation time in the index fragments to an HDFS system, wherein the generation time of the one index fragment with the earliest generation time in the index fragments is the time when the elastic search system starts to store the data fragments.

In a second aspect, an embodiment of the present application provides a method for querying PB class data based on elastic search, including:

Receiving a data query request, and judging whether storage positions of index fragments corresponding to data requested to be queried by the data query request are in an elastic search system or an HDFS system, wherein at least one index fragment with earlier generation time in the elastic search system is transferred to the HDFS system;

when the storage position of the index fragment corresponding to the data requested to be queried by the data query request is in the elastic search system, querying the corresponding data from the index fragment stored by the elastic search system;

and calling a query interface of Lucene to query corresponding data from the index fragments stored by the HDFS system under the condition that the storage position of the index fragments corresponding to the data requested to be queried by the data query request is the HDFS system.

In some embodiments, the data query request includes a storage timestamp of the data requested to be queried, and determining whether the storage location of the index shard corresponding to the data requested to be queried by the data query request is in the elastic search system or in the HDFS system includes:

determining the time span of the index fragment corresponding to the data requested to be queried by the data query request according to the storage time stamp;

Judging whether the time span of the index fragment is larger than a preset time span or not;

determining that the storage position of the index fragment corresponding to the data requested to be queried by the data query request is in an elastic search system under the condition that the time span of the index fragment is judged to be larger than a preset time span;

and determining that the storage position of the index fragment corresponding to the data requested to be queried by the data query request is in the HDFS system when the time span of the index fragment is smaller than the preset time span.

In some embodiments, in a case where the storage location of the index shard corresponding to the data requested to be queried by the data query request includes the elastic search system and the HDFS system, the query method further includes:

splitting the data query request to obtain a first query request and a second query request, wherein the storage position of the index fragment corresponding to the data requested to be queried by the first query request is in the elastic search system, and the storage position of the index fragment corresponding to the data requested to be queried by the second query request is in the HDFS system;

the data corresponding to the first query request is queried from the index fragments stored by the elastic search system, and the query interface of Lucene is called to query the data corresponding to the second query request from the index fragments stored by the HDFS system;

Summarizing the data queried by the elastic search system and the data queried by the HDFS system to obtain the data requested by the data query request.

In some embodiments, invoking the query interface of Lucene to query the index shards stored by the HDFS system for corresponding data includes:

calling a query interface of Lucene to query the corresponding index fragment from the HDFS system;

and carrying out Mapreduce data processing on the index fragments to obtain the data requested by the data query request.

In a third aspect, an embodiment of the present application provides an electronic device, including a memory and a processor, where the memory stores a computer program, and the processor is configured to execute the computer program to perform the method for querying PB-level data based on elastic search according to the first aspect, and/or perform the method for querying PB-level data based on elastic search according to the second aspect.

In a fourth aspect, an embodiment of the present application provides a storage medium, where a computer program is stored, where the computer program is configured to perform, when executed, the method for querying PB-level data based on elastic search according to the first aspect, and/or perform the method for querying PB-level data based on elastic search according to the second aspect.

Compared with the related art, the storage method, the query method, the electronic device and the storage medium for big data provided by the embodiment of the application are characterized in that the data are stored according to the data fragments with preset time length by acquiring the data, and a plurality of index fragments for querying each data fragment are generated in an elastic search system; and under the condition that the total data amount of the data fragments corresponding to the index fragments is larger than a preset threshold value and/or the time span of the data fragments corresponding to the index fragments is larger than a preset time span, transferring at least one index fragment with earlier generation time in the index fragments to an HDFS system so as to enable Lucene to inquire the corresponding data according to the index fragments stored in the HDFS system. By the method and the device, the problems that a distributed system applied to a big data warehouse in the related technology is poor in data query and storage stability, low in efficiency and high in operation and maintenance cost are solved, and the beneficial effects of improving the resource utilization rate of the big data server, saving resources and improving the data query and analysis efficiency are achieved.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the other features, objects, and advantages of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

fig. 1 is a hardware block diagram of a terminal of a method for storing PB data based on an elastic search and a method for querying PB data based on an elastic search according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method of storing PB level data based on an elastic search in accordance with an embodiment of the present application;

FIG. 3 is a flow chart of a method of querying PB level data based on elastic search according to an embodiment of the present application;

FIG. 4 is a flowchart of PB level data storage and querying for an elastic search in accordance with a preferred embodiment of the present application;

FIG. 5 is a block diagram of a memory device for PB level data based on an elastic search in accordance with an embodiment of the present application;

fig. 6 is a block diagram of a structure of an elastic search based PB class data query device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described and illustrated below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden on the person of ordinary skill in the art based on the embodiments provided herein, are intended to be within the scope of the present application.

It is apparent that the drawings in the following description are only some examples or embodiments of the present application, and it is possible for those of ordinary skill in the art to apply the present application to other similar situations according to these drawings without inventive effort. Moreover, it should be appreciated that while such a development effort might be complex and lengthy, it would nevertheless be a routine undertaking of design, fabrication, or manufacture for those of ordinary skill having the benefit of this disclosure, and thus should not be construed as having the benefit of this disclosure.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is to be expressly and implicitly understood by those of ordinary skill in the art that the embodiments described herein can be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar terms herein do not denote a limitation of quantity, but rather denote the singular or plural. The terms "comprising," "including," "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to only those steps or elements but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "connected," "coupled," and the like in this application are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as used herein refers to two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., "a and/or B" may mean: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. The terms "first," "second," "third," and the like, as used herein, are merely distinguishing between similar objects and not representing a particular ordering of objects.

The various techniques described in this application may be used for storage querying of mass data in the field of big data technology.

Prior to describing and illustrating the embodiments of the present application, the related art used in the present application will be described as follows:

PB level data is hundreds of millions of level data.

An index (index), similar to one of the conventional relational databases, is a place for storing the relational documents, is a logical storage of logical data by the ES, and is structured to provide for fast and efficient full text retrieval; indexing a document is storing a document in an index (noun) so that it can be retrieved and queried.

In order to solve the problem that when an index needs to store a large amount of data exceeding the hardware limit of a single node, the single index is unsuitable for a disk of the single node, and the single index is too slow to store data, so that the single node cannot provide a search request, the elastic search provides a function of subdividing the index into a plurality of fragments, when the index is created, only the required number of fragments is required to be defined, and each fragment is a fully functional and independent index and can be hosted on any node in a cluster. The purpose and the reason for setting the slicing are mainly as follows: the slicing allows for performing horizontal splitting/scaling of content, the slicing allows for distributing and parallelizing operations across slices (possibly across multiple nodes), thereby improving performance/throughput,

The elastic search is a distributed, high-expansion and high-real-time search and data analysis engine, and is widely applied to big data scenes such as finance, security and the like. The elastic search provides a laterally extensible, slicing mechanism, high availability, one slice can set multiple copies, even if the server is down, still can operate as usual.

The Lucene is a full text search engine tool kit of an open source code, and is used as a core by an elastic search to realize all the functions of indexing and searching, and is not a complete full text search engine, but a full text search engine framework, and provides a complete query engine, an index engine and a partial text analysis engine.

Mapreduce is a simple-to-use software framework in Hadoop (distributed system architecture), and an application program written based on the framework can run in a distributed cluster, so that the framework is suitable for offline processing of mass data above PB level.

HDFS is a distributed file system that allows us to store data on multiple nodes of a cluster and allows multiple users to access the data, an important component of the Hadoop ecosystem. The method embodiment provided in this embodiment may be executed in a terminal, a computer or a similar computing device. Taking the operation on a terminal as an example, fig. 1 is a hardware structure block diagram of a terminal of the PB class data storage method based on the elastic search and the PB class data query method based on the elastic search according to the embodiment of the present invention. As shown in fig. 1, the terminal may include one or more processors 102 (only one is shown in fig. 1) (the processor 102 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA) and a memory 104 for storing data, and optionally, a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the structure shown in fig. 1 is merely illustrative and not limiting on the structure of the terminal described above. For example, the terminal may also include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1.

The memory 104 may be used to store a computer program, for example, a software program of an application software and a module, such as a computer program corresponding to a method for storing PB-level data based on an elastic search and/or a method for querying PB-level data based on an elastic search in the embodiment of the present invention, and the processor 102 executes the computer program stored in the memory 104, thereby performing various functional applications and data processing, that is, implementing the above-mentioned method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. The specific example of the network described above may include a wireless network provided by a communication provider of the terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.

The present embodiment provides a method for storing PB level data based on elastic search, and fig. 2 is a flowchart of a method for storing PB level data based on elastic search according to an embodiment of the present application, as shown in fig. 2, the flowchart includes the following steps:

step S201, data are acquired, the data are stored according to the data fragments of the preset duration, and a plurality of index fragments for inquiring each data fragment are generated in an elastic search system.

In this embodiment, the data to be stored is that data pushed to a distributed message system (Kafka) by a user is directly accessed to an elastic search system through a data acquisition tool logflash, and then the data is kept in a mode of creating an index fragment according to a data fragment of a preset duration. In this embodiment, the preset time period is one day, each data slice corresponds to data collected and stored on one day, and the elastic search system holds the data in a manner of indexing the slices one day.

Step S202, when the total data amount of the data fragments corresponding to the plurality of index fragments is greater than a preset threshold value and/or the time span of the data fragments corresponding to the plurality of index fragments is greater than a preset time span, at least one index fragment with earlier generation time in the plurality of index fragments is transferred to the HDFS system, so that Lucene queries the corresponding data according to the index fragments stored in the HDFS system.

In this embodiment, the index fragments stored in the elastic search system and the HDFS system are stored with data, and when data is queried, the data can be queried by searching the index files in the elastic search system and the HDFS system, and meanwhile, the data is stored by storing the index fragments, and by using two distributed storage modes of the HDFS system and the elastic search system, the resource requirement of the server is reduced, and the historical data can be queried. Meanwhile, in this embodiment, when the index fragments need to be transferred to the HDFS system, at least one index fragment and corresponding data need to be transferred to the HDFS system, for example, when the total data amount of the data fragments corresponding to the index fragments is greater than a preset threshold, the total data amount stored is greater than the preset threshold, but the time span of the data fragments corresponding to the index fragments is not greater than the preset time span, and at this time, one index fragment with the earliest time generated in the index fragments may be transferred to the HDFS system; when the time span of the data fragments corresponding to the index fragments is larger than the preset time span and the difference between the time span of the data fragments corresponding to the index fragments and the preset time span is larger than a time span, the index fragments with earlier generation time in the index fragments need to be transferred to the HDFS system until the time span of the data fragments corresponding to the index fragments is not larger than the preset time span.

When judging whether the stored data is stored in the elastic search system or the HDFS system, the data storage process is as follows: the background service (the execution main body of the storage method) judges that the total data amount of the current data fragments in the elastic search system exceeds 1000 hundred million or the time span of the index fragments exceeds 2 months is reached, and automatically closes the index fragments of which the data fragments are firstly put into storage. For example, data starts to be stored in 1 st 2019, until 1 st 3 rd 1 st 2019 or the total data amount of the data fragments is more than 1000 billions, the system closes the generated index fragments in 1 st, and the background service also stores the closed index fragments of the elastic search into the HDFS system at the same time, and because the bottom layer of the elastic search system is realized based on Lucene, when the index fragments stored in the HDFS system are queried, the background service directly queries the index fragments stored in the HDFS by calling a query interface of Lucene.

Through the steps S201 to S202, the acquired data are adopted to store the data according to the data fragments of the preset duration, and a plurality of index fragments for querying each data fragment are generated in the elastic search system; and under the condition that the total data amount of the data fragments corresponding to the index fragments is larger than a preset threshold value and/or the time span of the data fragments corresponding to the index fragments is larger than a preset time span, transferring at least one index fragment with earlier generation time in the index fragments to an HDFS system so as to enable Lucene to inquire the corresponding data according to the index fragments stored in the HDFS system, thereby solving the problems of poor data inquiry and storage stability, low efficiency and high operation and maintenance cost of the distributed system of the large data warehouse in the related art, and realizing the quick storage and inquiry of the large data and the reduction of server resources.

In some embodiments, in a case where a total amount of data of the data slices corresponding to the plurality of index slices is smaller than a preset threshold and a time span of the data slices corresponding to the plurality of index slices is smaller than a preset time span, the storage method further includes the steps of: multiple index shards are stored in the elastic search system.

In some embodiments, acquiring data, storing the data in data slices of a preset duration, and generating a plurality of index slices for querying each data slice in an elastic search system includes the following steps:

the data were collected by logstar and accessed into the elastiscearch system.

In this embodiment, the data collected by logstar is pushed by the user into kafka (distributed messaging system).

In some embodiments, in a case where a total amount of data of the data slices corresponding to the plurality of index slices is greater than a preset threshold and/or a time span of the data slices corresponding to the plurality of index slices is greater than a preset time span, the storing method includes the following steps: and transferring one index fragment with the earliest generation time in the plurality of index fragments to an HDFS system, wherein the generation time of the one index fragment with the earliest generation time in the plurality of index fragments is the time when the elastic search system starts to store the data fragments.

The embodiment provides a query method of PB level data based on elastic search, and fig. 3 is a flowchart of a query method of PB level data based on elastic search according to an embodiment of the application, as shown in fig. 3, the flowchart includes the following steps:

step S301, a data query request is received, and it is determined whether a storage location of an index fragment corresponding to data requested to be queried by the data query request is in an elastic search system or in an HDFS system, where at least one index fragment generated in the elastic search system with an earlier time is transferred to the HDFS system.

In this embodiment, the storage location of the index shard is determined according to the generation timestamp of the data requested to be queried in the data query request, for example: when the data requested to be queried in the data query request is data in the last two months, the PB-level data storage method based on the elastic search stores the PB-level data in an elastic search system in an index file mode according to the time span of the data in the last two months, when the corresponding data is queried, the data is determined to be recent data according to the generation time stamp of the data, and the corresponding index fragment is determined to be stored in the elastic search system; when the data requested to be queried in the data query request is data before two months, the query method of PB level data of the elastic search can transfer data fragments and index fragments with time spans exceeding two months into the HDFS system according to the time spans of the data before two months, and when the data is queried, the data is determined to be historical data according to the generation time stamp of the data and the corresponding index fragments are determined to be index fragments transferred into the HDFS system.

In step S302, when the storage location of the index shard corresponding to the data requested to be queried by the data query request is in the elastic search system, the corresponding data is queried from the index shards stored in the elastic search system.

In this embodiment, if it is determined that the data is recent data, the corresponding data is queried in the index shards stored in the elastic search system.

In step S303, when the storage location of the index fragment corresponding to the data requested to be queried by the data query request is in the HDFS system, the query interface of Lucene is called to query the corresponding data from the index fragment stored in the HDFS system.

In the embodiment, the query interface of Lucene is utilized to directly query and restore the index fragments generated in the HDFS system and the elastic search system to complete data query, so that the problem of difficult data (historical data) query with large time span in the related technology is solved.

Through the steps S301 to S303, the data query request is received, and whether the storage location of the index fragment corresponding to the data requested to be queried by the data query request is in the elastic search system or in the HDFS system is determined; when the storage position of the index fragment corresponding to the data requested to be queried by the data query request is in the elastic search system, querying the corresponding data from the index fragment stored by the elastic search system; under the condition that the storage position of the index fragment corresponding to the data requested to be queried by the data query request is in the HDFS system, the query interface of Lucene is called to query the corresponding data from the index fragment stored in the HDFS system, the problem of difficult data query with large time span in the related technical field is solved, and the advantages of massive data query and high data query efficiency by using lightweight data are realized.

In some embodiments, the data query request includes a storage timestamp of the data requested to be queried, and determining whether the storage location of the index shard corresponding to the data requested to be queried by the data query request is in the elastic search system or in the HDFS system includes the following steps:

and determining the time span of the index fragment corresponding to the data requested to be queried by the data query request according to the storage time stamp.

In this embodiment, the storage time stamp is a data slice, a generation time stamp of an index slice, or a warehouse entry time stamp, that is, a time when data is accessed into the elastic search system and stored in a data slice manner.

And judging whether the time span of the index fragment is larger than a preset time span or not.

And under the condition that the time span of the index fragment is judged to be larger than the preset time span, determining that the storage position of the index fragment corresponding to the data requested to be queried by the data query request is in an elastic search system.

In this embodiment, the time span of the index shard is greater than the preset time span, which indicates that the queried data is data before the preset time, specifically, data before two months.

And when the time span of the index fragments is judged to be smaller than the preset time span, determining that the storage position of the index fragment corresponding to the data requested to be queried by the data query request is in the HDFS system.

In this embodiment, the time span of the index shard is smaller than the preset time span, which indicates that the queried data is data within the preset time, specifically, data within two months.

In some embodiments, in a case where the storage location of the index shard corresponding to the data requested to be queried by the data query request includes an elastic search system and an HDFS system, the query method further includes the following steps:

splitting a data query request to obtain a first query request and a second query request, wherein the storage position of the index fragment corresponding to the data requested to be queried by the first query request is in an elastic search system, and the storage position of the index fragment corresponding to the data requested to be queried by the second query request is in an HDFS system.

In this embodiment, when the data queried by the user includes both data with a time span greater than a preset time span and data with a time span less than the preset time span, the background service splits the data query request and distributes the data query request to the elastic search system and the HDFS system, and then queries the data in different systems by using index shards.

The data corresponding to the first query request is queried from the index shards stored by the elastic search system and the data corresponding to the second query request is queried from the index shards stored by the HDFS system by calling the query interface of Lucene.

In this embodiment, the querying of the data corresponding to the first query request from the index shard stored in the elastic search system and the querying of the data corresponding to the second query request from the index shard stored in the HDFS system by invoking the query interface of Lucene are performed simultaneously, and the data is summarized after the corresponding data is queried.

Summarizing the data queried from the elastic search system and the data queried from the HDFS system to obtain the data requested by the data query request.

The data query request is split to obtain a first query request and a second query request, wherein the storage position of the index fragment corresponding to the data requested to be queried by the first query request is in an elastic search system, and the storage position of the index fragment corresponding to the data requested to be queried by the second query request is in an HDFS system; the method comprises the steps of executing data corresponding to a first query request from index fragments stored in an elastic search system and invoking a query interface of Lucene to query data corresponding to a second query request from index fragments stored in an HDFS system in parallel; the data inquired from the elastic search system and the data inquired from the HDFS system are summarized to obtain the data requested by the data inquiry request, so that the simultaneous inquiry of historical data (data with time span larger than the preset time span) and recent data (data with time span smaller than the preset time span) is realized.

In some embodiments, the method for calling the query interface of Lucene to query the corresponding data from the index shards stored by the HDFS system includes the following steps:

and calling a query interface of Lucene to query the corresponding index fragment from the HDFS system.

And carrying out Mapreduce data processing on the index fragments to obtain data requested by the data query request.

Inquiring the corresponding index fragment from the HDFS system through the inquiry interface for calling Lucene; and performing Mapreduce data processing on the index fragments to obtain data requested by a data query request, innovatively utilizing a query interface of Lucene to directly query and transfer the data to the index fragments in the HDFS system, and performing Mapreduce data processing on the process to solve the problem of difficult query in historical time.

FIG. 4 is a flowchart of PB level data storage and querying for an elastic search, according to a preferred embodiment of the present application, as shown in FIG. 4, comprising the steps of:

step S401, kafka receives data pushed by the user. In step S402, logstar collects data in kafka.

In step S403, the elastosearch system imports log-mesh acquired data.

Step S404, judging whether the total data amount in the elastic search system is more than 1000 hundred million or whether the index shard exceeds two months, if so, executing step S405, and then executing step S406.

In step S405, the HDFS system receives the index fragment transferred by the elastic search system, and then, performs step S406.

In step S406, data query is performed, and data of the last two months (data with a time span smaller than a preset time span) is queried in the elastic search system and historical data (data with a time span larger than the preset time span) is queried in the HDFS system.

It should be noted that the steps illustrated in the above-described flow or flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order other than that illustrated herein.

The embodiment also provides a PB level data storage device based on the elastic search, which is used for implementing the foregoing embodiment and the preferred embodiment, and is not described in detail. As used below, the terms "module," "unit," "sub-unit," and the like may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

Fig. 5 is a block diagram of a PB level data storage device based on an elastic search according to an embodiment of the present application, and as shown in fig. 5, the device includes:

the storage module 51 is configured to acquire data, store the data according to data slices of a preset duration, and generate a plurality of index slices for querying each data slice in the elastic search system.

The processing module 52 is coupled to the storage module 51, and is configured to, when the total amount of data of the data slices corresponding to the plurality of index slices is greater than a preset threshold value and/or the time span of the data slices corresponding to the plurality of index slices is greater than a preset time span, transfer at least one index slice with an earlier generation time in the plurality of index slices to the HDFS system, so that Lucene queries the corresponding data according to the index slices stored in the HDFS system.

In some embodiments, the processing module 52 is configured to store the plurality of index slices in the elastic search system when a total amount of data of the data slices corresponding to the plurality of index slices is less than a preset threshold and a time span of the data slices corresponding to the plurality of index slices is less than a preset time span.

In some embodiments, the storage module 51 is configured to collect data via logstack and access an elastic search system; data collected on a daily basis is stored as one data slice in an elastic search system, and an index slice associated with each data slice is generated on a daily basis.

In some embodiments, the processing module 52 is configured to, when a total amount of data of the data slices corresponding to the plurality of index slices is greater than a preset threshold and/or a time span of the data slices corresponding to the plurality of index slices is greater than a preset time span, forward an index slice with an earliest generation time among the plurality of index slices to the HDFS system, where the generation time of the index slice with the earliest generation time among the plurality of index slices is a time when the elastic search system starts to store the data slices.

Fig. 6 is a block diagram of a PB class data query device based on elastic search according to an embodiment of the present application, and as shown in fig. 6, the device includes:

the judging module 61 is configured to receive a data query request, and judge whether a storage location of an index fragment corresponding to data requested to be queried by the data query request is in an elastic search system or in an HDFS system, where at least one index fragment generated in the elastic search system with an earlier time is transferred to the HDFS system.

The query module 62 is coupled to the judging module 61, and is configured to query, when the storage location of the index fragment corresponding to the data requested to be queried by the data query request is in the case of the elastic search system, the corresponding data from the index fragment stored in the elastic search system; and under the condition that the storage position of the index fragment corresponding to the data requested to be queried by the data query request is in the HDFS system, calling the query interface of Lucene to query the corresponding data from the index fragment stored in the HDFS system.

In some embodiments, the data query request includes a storage timestamp of the data requested to be queried, and the judging module 61 is configured to determine a time span of an index fragment corresponding to the data requested to be queried by the data query request according to the storage timestamp; judging whether the time span of the index fragment is larger than a preset time span or not; under the condition that the time span of the index fragment is larger than the preset time span, determining that the storage position of the index fragment corresponding to the data requested to be queried by the data query request is in an elastic search system; and when the time span of the index fragments is judged to be smaller than the preset time span, determining that the storage position of the index fragment corresponding to the data requested to be queried by the data query request is in the HDFS system.

In some embodiments, the query module 62 is configured to split the data query request to obtain a first query request and a second query request, where the storage location of the index shard corresponding to the data requested to be queried by the first query request is in the elastic search system, and the storage location of the index shard corresponding to the data requested to be queried by the second query request is in the HDFS system; the method comprises the steps of executing data corresponding to a first query request from index fragments stored in an elastic search system and invoking a query interface of Lucene to query data corresponding to a second query request from index fragments stored in an HDFS system in parallel; summarizing the data queried from the elastic search system and the data queried from the HDFS system to obtain the data requested by the data query request.

In some embodiments, the query module 62 is further configured to invoke a query interface of Lucene to query the corresponding index shards from the HDFS system; and carrying out Mapreduce data processing on the index fragments to obtain data requested by the data query request.

The above-described respective modules may be functional modules or program modules, and may be implemented by software or hardware. For modules implemented in hardware, the various modules described above may be located in the same processor; or the above modules may be located in different processors in any combination.

The present embodiment also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:

Acquiring data, storing the data according to data fragments of a preset time length, and generating a plurality of index fragments for inquiring each data fragment in an elastic search system

And under the condition that the total data amount of the data fragments corresponding to the index fragments is larger than a preset threshold value and/or the time span of the data fragments corresponding to the index fragments is larger than a preset time span, transferring at least one index fragment with earlier generation time in the index fragments to an HDFS system so as to enable Lucene to inquire the corresponding data according to the index fragments stored in the HDFS system.

and receiving a data query request, and judging whether the storage position of the index fragment corresponding to the data requested to be queried by the data query request is in an elastic search system or in an HDFS system, wherein at least one index fragment which is generated in the elastic search system at an earlier time is transferred to the HDFS system, and when the storage position of the index fragment corresponding to the data requested to be queried by the data query request is in the elastic search system, querying the corresponding data from the index fragment stored in the elastic search system.

And under the condition that the storage position of the index fragment corresponding to the data requested to be queried by the data query request is in the HDFS system, calling the query interface of Lucene to query the corresponding data from the index fragment stored in the HDFS system.

It should be noted that, specific examples in this embodiment may refer to examples described in the foregoing embodiments and alternative implementations, and this embodiment is not repeated herein.

In addition, in combination with the method for querying PB level data based on the elastic search in the above embodiment and/or the method for querying PB level data based on the elastic search, the embodiments of the present application may provide a storage medium for implementation. The storage medium has a computer program stored thereon; the computer program, when executed by a processor, implements any of the PB level data query methods based on elastiscearch and/or PB level data query methods based on elastiscearch described in the embodiments above.

It should be understood by those skilled in the art that the technical features of the above-described embodiments may be combined in any manner, and for brevity, all of the possible combinations of the technical features of the above-described embodiments are not described, however, they should be considered as being within the scope of the description provided herein, as long as there is no contradiction between the combinations of the technical features.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A method for storing PB-level data based on elastic search, comprising:

when the total data amount of the data fragments corresponding to the index fragments is larger than a preset threshold value and/or the time span of the data fragments corresponding to the index fragments is larger than a preset time span, at least one index fragment with earlier generation time in the index fragments is transferred to an HDFS system so that Lucene can inquire corresponding data according to the index fragments stored in the HDFS system;

Acquiring data, storing the data according to data fragments of a preset duration, and generating a plurality of index fragments for querying each data fragment in an elastic search system comprises the following steps:

collecting the data through Logstash and accessing the elastic search system;

2. The PB level data storage method based on elastic search of claim 1, wherein in a case where a total amount of data of the data slices corresponding to the plurality of index slices is smaller than a preset threshold and a time span of the data slices corresponding to the plurality of index slices is smaller than a preset time span, the storage method comprises: the plurality of index shards are stored in the elastic search system.

3. The method for storing PB class data based on elastic search according to claim 1, wherein in case that a total amount of data of the data slices corresponding to the plurality of index slices is greater than a preset threshold and/or a time span of the data slices corresponding to the plurality of index slices is greater than a preset time span, the method for storing PB class data comprises: and transferring one index fragment with the earliest generation time in the index fragments to an HDFS system, wherein the generation time of the one index fragment with the earliest generation time in the index fragments is the time when the elastic search system starts to store the data fragments.

4. The PB level data query method based on the elastic search is characterized by comprising the following steps of:

when the storage position of the index fragment corresponding to the data requested to be queried by the data query request is in the HDFS system, calling a query interface of Lucene to query the corresponding data from the index fragment stored by the HDFS system;

the data query request comprises a storage timestamp of data requested to be queried, and judging whether the storage position of the index fragment corresponding to the data requested to be queried by the data query request is in an elastic search system or an HDFS system comprises:

5. The method for querying PB class data based on elastic search of claim 4 wherein, in a case where a storage location of an index shard corresponding to data requested to be queried by the data query request includes the elastic search system and the HDFS system, the querying method further comprises:

6. The method for querying PB class data based on elastic search of claim 4 wherein invoking a query interface of Lucene to query corresponding data from an index shard stored by the HDFS system comprises:

7. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to run the computer program to perform the method of storing the PB-level data based on elastic search according to any of claims 1 to 3 and/or to perform the method of querying the PB-level data based on elastic search according to any of claims 4 to 6.

8. A storage medium, wherein the storage medium has stored therein a computer program, wherein the computer program is arranged to perform the method of storing PB-level data based on elastic search according to any of claims 1 to 3 and/or the method of querying PB-level data based on elastic search according to any of claims 4 to 6 at runtime.