CN115982309A

CN115982309A - Rail transit data analysis method based on big data

Info

Publication number: CN115982309A
Application number: CN202211730650.2A
Authority: CN
Inventors: 黄相辉; 赵方捷; 金斌斌; 徐军; 林静; 陈帆; 邹海双; 吴磊
Original assignee: Zte Wenzhou Rail Communication Technology Co ltd
Current assignee: Zte Wenzhou Rail Communication Technology Co ltd
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-04-18

Abstract

The invention relates to the technical field of big data analysis, in particular to a track traffic data analysis method based on big data, which comprises the following steps: s1, collecting data; s2, cleaning data; s3, constructing a data dictionary; s4, segmenting words of the data dictionary; and S5, data indexing, namely, through the matching operation of the steps of collecting data, cleaning the data, constructing a data dictionary, segmenting words of the data dictionary and indexing the data, the existing rail transit data is favorably analyzed and processed, so that the utilization rate of the rail transit data is high, the data can be fully and effectively retrieved, and the retrieval efficiency is high.

Description

Rail transit data analysis method based on big data

Technical Field

The invention relates to the technical field of big data analysis, in particular to a track traffic data analysis method based on big data.

Background

The rail transit is a type of transportation means or transportation system in which an operating vehicle needs to run on a specific rail, the most typical rail transit is a railway system consisting of a traditional train and a standard railway, and with the diversified development of train and railway technologies, the rail transit is more and more in types, not only extends over long-distance land transportation, but also is widely applied to middle-short distance urban public transportation, and in the process, the analysis of large data of the rail transit is of great importance.

To this chinese patent publication no: CN111831658A discloses a rail transit big data analysis method and system, and the method comprises: collecting and summarizing original data of rail transit; storing the acquired rail transit original data; preprocessing the original data of the rail transit, correcting abnormal data, and removing data which cannot be corrected; uploading the preprocessed rail transit data to a cloud platform, and storing and calling the data by the cloud platform; the method comprises the following steps that a user sends a required analysis requirement to a cloud platform through a client, and the cloud platform sends an analysis result to the user client after data analysis; the invention can effectively eliminate the error information in the collected data, improves the accuracy of data analysis, can carry out predictive analysis on the data in a period of time in the future, can provide visual data analysis for users, can be called quickly after receiving the same analysis request again, and reduces the waiting time.

However, when the invention is used, there is still a problem that a large amount of unstructured data is contained in domestic subway equipment at present, but a consistent fault standard library is not available among rail companies, fault data lacks an accurate theoretical basis, and a fault phenomenon is described by using an unstructured natural language, which is not beneficial to data analysis and retrieval, so that the utilization rate of rail traffic data is not high, that is, when the existing rail traffic data is analyzed and processed, effective retrieval is difficult to perform, and the retrieval efficiency is low.

Therefore, it is necessary to design a rail transit data analysis method based on big data to solve the above problems.

Disclosure of Invention

The invention aims to provide a rail transit data analysis method based on big data, which aims to solve the problems that the prior rail transit data proposed in the background technology is difficult to effectively search and has low search efficiency when being analyzed.

In order to achieve the purpose, the invention provides the following technical scheme: a rail transit data analysis method based on big data comprises the following steps:

s1, collecting data;

s2, cleaning data;

s3, constructing a data dictionary;

s4, segmenting words of the data dictionary;

and S5, indexing data.

Preferably, the data acquisition in step S1 includes acquisition of maintenance data of the rail equipment, and the maintenance data includes structured data, semi-structured data and unstructured data.

Preferably, the step S2 of cleaning the data comprises the following steps:

s11, processing the missing value, firstly, analyzing the importance of the field needing the missing value, and calculating the proportion of the missing value;

s12, processing format contents, wherein if the format contents are data which are greatly influenced by human factors, such as filling by a user, the problems of the format contents include the following types: format problem, for example, date format, caused by different formats of input ends, one of the formats is generally selected as a standard in the processing process, and different formats are converted; content problems, such as the presence of non-conforming characters in the content, can be handled after identifying the problem type, and generally the data is selected to be filtered to remove non-conforming content.

S13, logical error cleaning, namely, firstly removing duplication, then removing unreasonable values and finally correcting contradictory contents;

and S14, cleaning the non-required data, wherein if the data size is not large enough to the extent that the field cannot be deleted, the data cannot be processed, and generally, the data is not deleted.

S15, correlation verification is carried out, and if the data have a plurality of ripples, correlation verification is required.

Preferably, the step S11 of calculating the missing value proportion is performed according to the following rule: the importance and deletion ratio processing rules are that fields with high importance and high deletion ratio need to be filled with missing contents, fields with high importance and low deletion ratio need to be filled with missing contents, fields with low importance and high deletion ratio need to be filled with missing contents, the fields can be directly removed, and fields with low importance and low deletion ratio can be supplemented without processing or simple operation.

Preferably, for the field needing to be filled with missing content, data filling can be generally performed in the following way: filling missing values in a manual participation mode in a first mode, such as experience accumulation, professional knowledge and the like; the method II is to simply calculate through data in the same field, such as average number, mode and the like; filling missing values by using the combination of a plurality of fields through a certain rule calculation result; and in a fourth mode, if certain indexes are very important and the loss rate is high, the operation which can be carried out comprises the step of carrying out data acquisition again or acquiring related data sets through other channels.

Preferably, the step S3 of constructing a data dictionary includes the following steps:

s21, fully segmenting data, wherein in the step, a non-structured maintenance data character is fully segmented into words;

s22, constructing a word relation graph, performing hash calculation on words divided into single words in the construction process to obtain a first word hash table, and generating a linked list of the words, wherein the linked list comprises one or more pointers pointing to the next single word, and the linked list stores the occurrence times of the current single word;

s23, extracting words, wherein in the extracting process, firstly, a threshold value is set to filter the times of single words, then, the filtering is carried out from the lowest times, when the times are the same, sentences are formed, meanwhile, longer sentences are formed from the times which are the same as the times of higher levels, then, the lowest times are deleted, and then, the filtering is carried out from the second lowest times again until the relation graph does not form sentences any more.

Preferably, in step S22, when a character string is processed, the character is searched from the first-character hash table, if the single character does not exist, the single character is created in the hash table, a linked list is created, and meanwhile, a hash table pointer points to the linked list; if the single character exists, pointer pointing processing of a linked list is carried out, and finally times of the single character are updated.

Preferably, in the step S4, in the process of segmenting words in the data dictionary, a forward maximum matching algorithm is adopted, in the algorithm, firstly, the longest word length L of the dictionary is found, for the character string to be segmented, L characters are taken out from the leftmost side for matching, if the taken out character string can be matched in the dictionary, the character string to be segmented is segmented into two, the front part is a sentence, and the rear part is the next character string to be segmented; if the retrieved character string cannot be matched in the dictionary, L is decremented by one for the next cycle.

Preferably, the data index in step S5 includes a global query and a local query, and the global query and the local query both perform segmentation processing on the data, and write the processed data into an Elasticsearch cluster, where the data writing process includes database operation and Elasticsearch cluster operation, the database operation obtains a data set after confirming that the client successfully connects to the database, and obtains a site information table at the same time, in the data processing process, the fields in the data set are analyzed, a field capable of providing site information is determined, a site and coordinate values thereof in the site information table are obtained, if one record in the data set includes site information, the corresponding site information is mapped, and for a record incapable of extracting site information, the corresponding site information is set as empty for mapping, and finally, the processed data set is sent to the Elasticsearch cluster.

Preferably, the Elasticsearch cluster operation firstly needs to confirm that the Elasticsearch cluster is successfully connected, after confirmation, whether the data set needing to be written has a corresponding index is judged, if not, the corresponding index is created, then in the space-time segmentation, the abscissa and the ordinate of the site are used in combination as a type, then for the data which can be written in is judged, the json format supported by the Elasticsearch cluster needs to be converted, and finally, after each record in the data set submitted by the database is written in, the data processing is finished.

Compared with the prior art, the invention has the beneficial effects that: through the matching operation of the steps of collecting data, cleaning data, constructing a data dictionary, segmenting the data dictionary and indexing the data, the method is favorable for analyzing and processing the existing rail transit data, so that the utilization rate of the rail transit data is high, the data can be sufficiently and effectively retrieved, and the retrieval efficiency is high.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of a rail transit data analysis method according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

The embodiment provided by the invention comprises the following steps:

as shown in fig. 1, a rail transit data analysis method based on big data includes the following steps:

s1, collecting data;

the step S1 includes data acquisition of maintenance data of the track equipment, wherein the maintenance data includes structured data, semi-structured data and unstructured data.

S2, cleaning data;

s3, constructing a data dictionary;

s4, segmenting words of the data dictionary;

and S5, indexing data.

Through the matching operation of the steps of collecting data, cleaning data, constructing a data dictionary, segmenting the data dictionary and indexing the data, the method is favorable for analyzing and processing the existing rail transit data, so that the utilization rate of the rail transit data is high, the data can be sufficiently and effectively retrieved, and the retrieval efficiency is high.

Specifically, the step S2 of cleaning the data includes the following steps:

s12, processing format contents, wherein if the format contents are data which are greatly influenced by human factors, such as filling by a user, the problems of the format contents include the following types: format problem, for example, date format, caused by different formats of input terminals, one of the formats is generally selected as a standard in the processing process, and different formats are converted; content problems, such as the presence of non-conforming characters in the content, can be handled after identifying the problem type, and generally the data is selected to be filtered to remove non-conforming content.

And S15, correlation verification, wherein if the data has a plurality of ripples, the correlation verification is required.

Further, the calculation of the missing value ratio in step S11 is performed according to the following rule: the importance and deletion rate processing rules are that fields with high importance and high deletion rate need to be filled with missing contents, fields with high importance and low deletion rate need to be filled with missing contents, fields with low importance and high deletion rate need to be filled with missing contents, the fields with low importance and low deletion rate can be directly removed, and the fields with low importance and low deletion rate can be supplemented without processing or simple operation.

Further, for a field that needs to be filled with missing content, data filling can be performed generally by the following method: filling missing values in a mode I and a mode of manual participation, such as experience accumulation, professional knowledge and the like; the second way, simple calculation is carried out through data of the same field, such as average number, mode and the like; filling missing values by using the combination of a plurality of fields through a certain rule calculation result; and in a fourth mode, if some indexes are very important and the loss rate is high, the operation which can be carried out comprises the step of carrying out data acquisition again or acquiring related data sets through other channels.

Specifically, the step S3 of constructing the data dictionary includes the following steps:

s22, constructing a relation graph of words, performing hash calculation on the words divided into single words in the construction process to obtain a first word hash table, and meanwhile, generating a linked list of the words, wherein the linked list comprises one or more pointers pointing to the next single word, and meanwhile, the linked list stores the occurrence times of the current single word;

s23, extracting words, wherein in the extracting process, firstly, a threshold value is set to filter the times of single words, then, the filtering is carried out from the lowest time, when the times are the same, sentences are formed, meanwhile, longer sentences are formed by the times which are the same as the times of higher levels, and then, the lowest time is deleted, and then, the filtering is carried out from the second lowest time until the relation graph does not form the sentences any more.

Further, when a character string is processed in step S22, the character is searched from the first character hash table, and if the character does not exist, the character is created in the hash table, a linked list is created, and a hash table pointer is pointed to the linked list; if the single character exists, pointer pointing processing of a linked list is carried out, and finally times of the single character are updated.

Further, in the step S4, in the process of segmenting words in the data dictionary, a forward maximum matching algorithm is adopted, in the algorithm, firstly, the longest word length L of the dictionary is found, for the character string to be segmented, L characters are taken out from the leftmost side for matching, if the taken out character string can be matched in the dictionary, the character string to be segmented is segmented into two, the front part is a sentence, and the rear part is the next character string to be segmented; if the retrieved string cannot be matched in the dictionary, then L is decremented by one for the next cycle.

Further, consider that queries are divided into global queries and local queries, where global queries are the case: if the data is selected not to be indexed, the method has the advantages that the consumption of creating the index is reduced in the data entry process, but the data segmentation is beneficial to distributed computation so as to improve the query efficiency;

locality query case: if the data is selected not to be subjected to the index operation, the consumption of carrying out the full-table query is overlarge under the condition of large data volume, and on the contrary, the data volume of the query can be greatly reduced under the condition of data segmentation, so that the query efficiency is improved.

Therefore, combining the two cases, in the case of the query with the majority of local queries, the data segmentation is very important, so that the data index in step S5 includes a global query and a local query, and both the global query and the local query segment the data, and write the processed data into the Elasticsearch cluster.

It is worth mentioning that the Elasticsearch is a distributed, highly-extended, highly real-time search and data analysis engine, which can conveniently provide a large amount of data with the capability of searching, analyzing and exploring, and makes full use of the horizontal scalability of the Elasticsearch, which can make the data more valuable in the production environment, and the Elasticsearch is a known prior art and therefore will not be elaborated herein.

Further, the data writing process includes a database operation and an Elasticsearch cluster operation, the database operation acquires a data set after confirming that a client successfully connects to the database, and simultaneously acquires a site information table, in the data processing process, the fields in the data set are analyzed to determine fields capable of providing site information, sites in the site information table and coordinate values thereof are acquired, if one record in the data set contains the site information, the corresponding site information is mapped, for the record incapable of extracting the site information, the corresponding site information is set to be null for mapping, finally, the processed data set is sent to the Elasticsearch cluster, the Elasticsearch cluster operation firstly needs to confirm that the Elasticsearch cluster is successfully connected, after confirmation, whether the data set needing to be written has a corresponding index is judged, if the data set does not have the corresponding index, the corresponding index is created, and then in space-time segmentation,

in order to facilitate indexing, the abscissa and the ordinate of the station are combined to be used as a type, so that each record in the data set needs to be judged whether the type exists, if the type does not exist, the record is created, otherwise, the time information in the data item is extracted for data routing setting, then for the data which is judged to be writable, the json format supported by the elastic search cluster needs to be converted, and finally, after each record in the data set submitted by the database is written, data processing is finished.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims

1. A rail transit data analysis method based on big data is characterized by comprising the following steps:

s1, collecting data;

s2, cleaning data;

s3, constructing a data dictionary;

s4, segmenting words of the data dictionary;

and S5, indexing data.

2. The big data based rail transit data analysis method according to claim 1, wherein the data collection in step S1 includes collection of rail equipment maintenance data, and the maintenance data includes structured data, semi-structured data, and unstructured data.

3. The big data-based rail transit data analysis method according to claim 1, wherein the step S2 of cleaning the data comprises the following steps:

s12, processing format contents, wherein if the format contents are data which are greatly influenced by human factors, such as filling by a user, the problems of the format contents include the following types: format problem, for example, date format, caused by different formats of input ends, one of the formats is generally selected as a standard in the processing process, and different formats are converted; content problems, such as the existence of non-conforming characters in the content, can be handled after identifying the problem type, and generally, filtering is performed on the data to remove the non-conforming content.

4. The big data-based rail transit data analysis method according to claim 3, wherein the calculation of the missing value ratio in step S11 is performed according to the following rules: the importance and deletion rate processing rules are that fields with high importance and high deletion rate need to be filled with missing contents, fields with high importance and low deletion rate need to be filled with missing contents, fields with low importance and high deletion rate need to be filled with missing contents, the fields with low importance and low deletion rate can be directly removed, and the fields with low importance and low deletion rate can be supplemented without processing or simple operation.

5. The big-data-based rail transit data analysis method according to claim 4, wherein for the field needing to be filled with missing content, data filling is generally performed by: filling missing values in a manual participation mode in a first mode, such as experience accumulation, professional knowledge and the like; the method II is to simply calculate through data in the same field, such as average number, mode and the like; filling missing values by using the combination of a plurality of fields through a certain rule calculation result; and in a fourth mode, if certain indexes are very important and the loss rate is high, the operation which can be carried out comprises the step of carrying out data acquisition again or acquiring related data sets through other channels.

6. The big data-based rail transit data analysis method according to claim 1, wherein the step S3 of constructing a data dictionary comprises the following steps:

7. The big-data-based rail transit data analysis method according to claim 6, wherein in step S22, when a character string is processed, a lookup is performed from a first word hash table, if the single word does not exist, the single word is created in the hash table, a linked list is created, and meanwhile, a hash table pointer is pointed to the linked list; if the single character exists, pointer pointing processing of a linked list is carried out, and finally times of the appearing single character are updated.

8. The rail transit data analysis method based on big data according to claim 1, characterized in that in the step S4, a forward maximum matching algorithm is used in the process of segmenting words into a data dictionary, in the algorithm, firstly, the longest word length L of the dictionary is found, for a character string to be segmented, L characters are taken out from the leftmost side for matching, if the taken out character string can be matched in the dictionary, the character string to be segmented is segmented into two, the front part is a sentence, and the rear part is the next character string to be segmented; if the retrieved string cannot be matched in the dictionary, then L is decremented by one for the next cycle.

9. The track traffic data analysis method based on big data according to claim 1, characterized in that the data index in step S5 includes global query and local query, and the global query and the local query both segment the data and write the processed data into an Elasticsearch cluster, where the data writing process includes database operation and Elasticsearch cluster operation, the database operation acquires a data set after confirming that the client connects the database successfully, and also acquires a site information table, and in the data processing process, analyzes a field in the data set, determines a field capable of providing site information, acquires sites and coordinate values thereof in the site information table, and if one record in the data set includes site information, maps the corresponding site information, and sets the corresponding site information as empty for a record incapable of extracting site information, and finally sends the processed data set to the Elasticsearch cluster.

10. The rail transit data analysis method based on big data according to claim 9, wherein the Elasticsearch cluster operation first needs to confirm that the Elasticsearch cluster is successfully connected, after confirmation, it is determined whether there is a corresponding index in the data set that needs to be written, if not, a corresponding index is created, then in space-time segmentation, the abscissa and ordinate of the station are used in combination as a type, then time information in the data item is extracted for routing setting of the data, then for the data that can be written, a json format supported by the Elasticsearch cluster needs to be converted, and finally, after each record in the data set submitted by the database is written, data processing is finished.