CN118378117B

CN118378117B - Ship data real-time intelligent analysis method based on data acquisition

Info

Publication number: CN118378117B
Application number: CN202410814496.XA
Authority: CN
Inventors: 高禄增; 王志
Original assignee: Dalian Haida Yinghai Technology Co ltd
Current assignee: Dalian Haida Yinghai Technology Co ltd
Priority date: 2024-06-24
Filing date: 2024-06-24
Publication date: 2024-08-20
Anticipated expiration: 2044-06-24
Also published as: CN118378117A

Abstract

The invention relates to the technical field of navigation track data processing, in particular to a ship data real-time intelligent analysis method based on data acquisition, which comprises the following steps: acquiring characteristic data of each ship under each dimension of a navigation track and longitude and latitude data of each moment; obtaining an initial cluster based on the distance distribution between the starting point position and the end point position of the navigation track; obtaining a navigation time interval of each initial cluster; obtaining the characteristic occupation coefficient of the navigation time interval of each initial cluster in each dimension according to the abnormal distribution condition of the difference distance between the characteristic data of the navigation track in each dimension in the navigation time interval of each initial cluster and the fluctuation condition of the longitude and latitude data at each moment; and carrying out comprehensive evaluation by utilizing the characteristic occupation coefficient, and clustering the navigation tracks of all the ships according to the evaluation result to obtain a clustering result of the navigation tracks. The invention can obtain more accurate clustering results of the ship navigation tracks.

Description

Ship data real-time intelligent analysis method based on data acquisition

Technical Field

The invention relates to the technical field of navigation track data processing, in particular to a ship data real-time intelligent analysis method based on data acquisition.

Background

In recent years, with the acceleration of the digital progress of the shipping industry, a ship data real-time intelligent analysis method based on data acquisition has become a key technology in the industry. At present, research in the field mainly focuses on real-time analysis and processing of ship operation data by using advanced sensor technology, big data processing and artificial intelligence algorithms so as to improve the operation efficiency, safety and environmental protection performance of the ship. In practical application, the method is widely applied to various aspects of navigation planning, energy consumption management, fault diagnosis and the like of ships. Through carrying out cluster analysis on the ship running track, different types of ship navigation modes and behavior modes, such as berthing, sailing, berthing and the like, can be identified, so that the action rule of the ship can be well understood. Meanwhile, the method can help identify abnormal behaviors in navigation track data, and is helpful for finding potential safety risks and dangerous situations.

In general, the AIS data of ship sailing contains data with rich attributes, including data with various attributes such as position coordinates, sailing speed, heading, and the like, and in the sailing track analysis of the ship, a single distance measurement method cannot accurately and comprehensively reflect the similar situation or the difference situation between two ship sailing tracks. Therefore, the conventional method often performs cluster analysis on the navigation tracks of different ships by combining the distance measurement results among the attribute data in multiple aspects in the navigation process of the ships, but the method often adopts the same weight or the weight set empirically to combine the distance measurement results of different attribute data, so that the distance measurement results of the final cluster are inaccurate, and the accuracy of the clustering results of the navigation tracks is further affected.

Disclosure of Invention

In order to solve the technical problem that the distance measurement results of the existing method are inaccurate and the clustering result precision of the navigation track is low, the invention aims to provide a ship data real-time intelligent analysis method based on data acquisition, and the adopted technical scheme is as follows:

acquiring the navigation track of each ship, and determining the characteristic data of the navigation track of each ship under each dimension and the longitude and latitude data of each moment based on the characteristic distribution of different attribute data of the navigation track of each ship;

clustering the navigation tracks based on the distance distribution between the starting point position and the end point position of the navigation track of each ship to obtain an initial cluster; dividing the time according to the fluctuation degree of longitude and latitude data of each navigation track in each initial cluster at each time to obtain a navigation time interval of each initial cluster;

Obtaining the characteristic occupation coefficient of the navigation time interval of each initial cluster in each dimension according to the abnormal distribution condition of the difference distance between the characteristic data of the navigation track in each dimension in the navigation time interval of each initial cluster and the fluctuation condition of the longitude and latitude data at each moment;

And comprehensively evaluating the difference distances of the characteristic data of the navigation tracks in all dimensions in the navigation time interval of each initial cluster by utilizing the characteristic occupation coefficient, and clustering the navigation tracks of all ships according to the evaluation result to obtain a clustering result of the navigation tracks.

Preferably, the dividing the time according to the fluctuation degree of the longitude and latitude data of each navigation track in each initial cluster at each time to obtain the navigation time interval of each initial cluster specifically includes:

For any one initial cluster, determining longitude and latitude discrete factors of the initial cluster at any one moment based on the discrete degree of longitude data of all navigation tracks at any one moment and the discrete degree of latitude data at any one moment;

Dividing all moments corresponding to the initial cluster into two time categories based on the difference distance between longitude and latitude discrete factors of every two moments corresponding to the initial cluster;

And screening the time category based on the characteristic distribution of the longitude and latitude discrete factors at all times in the time category, and dividing all times according to the screening result to obtain the navigation time interval of the initial cluster.

Preferably, the filtering of the time class is performed based on the feature distribution of the longitude and latitude discrete factors at all times in the time class, and the dividing of all times is performed according to the filtering result to obtain the navigation time interval of the initial cluster, which specifically includes:

The method comprises the steps of obtaining a time class corresponding to the average minimum value of longitude and latitude discrete factors at all moments in the time class, marking the time class as a first characteristic class, and classifying the time class based on the difference distance between every two moments in the first characteristic class to obtain two subcategories;

And dividing all the moments based on the average value moment of all the moments in each subcategory to obtain the navigation time interval of the initial cluster.

Preferably, the obtaining the feature occupancy coefficient of the navigation time interval of each initial cluster in each dimension according to the abnormal distribution condition of the difference distance between the feature data of the navigation track in each dimension in the navigation time interval of each initial cluster and the fluctuation condition of the longitude and latitude data of each moment specifically includes:

Obtaining data association indexes of each dimension according to abnormal distribution conditions of difference distances among all feature data of each dimension; obtaining the data association degree of the navigation time interval of each initial cluster in each dimension according to the abnormal distribution condition of the difference distance between the characteristic data in each dimension in the navigation time interval of each initial cluster;

the characteristic data of the navigation track under one dimension comprises Euclidean distance between the navigation track from the starting point to the end point, the dimension is marked as a selected dimension, and other dimensions except the selected dimension are marked as characteristic dimensions;

For any one initial cluster, determining the data importance degree of each navigation time interval of the initial cluster under the selected dimension by combining the data association index of the selected dimension based on the difference distribution condition of longitude and latitude discrete factors of the navigation time interval;

For any one navigation time interval, determining the data importance degree of the navigation time interval under each characteristic dimension by combining the negative correlation mapping value of the difference between the data association degree of the navigation time interval under each characteristic dimension and the data association index of the same characteristic dimension based on the negative correlation coefficient of the data importance degree corresponding to the navigation time interval;

And determining the characteristic occupation coefficient of each navigation time interval of the initial cluster in each dimension based on the data importance degree of each navigation time interval in each dimension, wherein the characteristic occupation coefficient is a normalized value.

Preferably, the obtaining the data association index of each dimension according to the abnormal distribution condition of the difference distances between all the feature data in each dimension specifically includes:

For any dimension, carrying out anomaly detection on the difference distance between every two feature data in the dimension, and determining a first duty ratio coefficient based on the number duty ratio of the anomaly data in the anomaly detection result; determining a first discrete coefficient based on the degree of dispersion of the difference distance between every two feature data in the dimension;

Obtaining a data association index of the dimension according to the first duty ratio coefficient and the first discrete coefficient; the first duty ratio coefficient and the first discrete coefficient are in negative correlation with the data association index;

Obtaining the data association degree of the navigation time interval of each initial cluster in each dimension according to the abnormal distribution condition of the difference distance between the characteristic data in each dimension in the navigation time interval of each initial cluster, wherein the method specifically comprises the following steps:

For any one navigation time interval of any one initial cluster, acquiring a quantity ratio of abnormal data belonging to the navigation time interval of the initial cluster in an abnormal detection result corresponding to any one dimension, determining a second duty ratio coefficient, and determining a second discrete coefficient based on the discrete degree of the difference distance between every two feature data of the navigation time interval of the initial cluster in the any one dimension;

And obtaining the data association degree of the navigation time interval of the initial cluster under any dimension according to the second duty ratio coefficient and the second discrete coefficient, wherein the second duty ratio coefficient and the second discrete coefficient are in negative correlation with the data association degree.

Preferably, the determining the data importance degree of each navigation time interval of the initial cluster under the selected dimension based on the difference distribution condition of longitude and latitude discrete factors of the navigation time interval and in combination with the data association index of the selected dimension specifically includes:

For any one initial cluster, marking all navigation time intervals as a first time interval, a second time interval and a third time interval according to the time sequence;

for the second time interval, determining the data importance degree of the second time interval under the selected dimension based on the normalized value of the data association index of the selected dimension;

Taking the mean value of longitude and latitude discrete factors at all moments in a first time interval and a third time interval as a first characteristic factor; taking the average value of longitude and latitude discrete factors at all moments in a third time interval as a second characteristic factor;

And for any one of the first time interval and the third time interval, determining the data importance degree of any one of the first time interval and the third time interval under the selected dimension based on the negative correlation coefficient of the difference between the first characteristic factor and the second characteristic factor and the normalized value of the data association index of the selected dimension.

Preferably, the comprehensive evaluation of the difference distance of the feature data of the navigation track in all dimensions in the navigation time interval of each initial cluster by using the feature occupancy coefficient specifically includes:

for any dimension, taking the difference distance between the characteristic data of each two navigation tracks in the dimension as the initial difference distance of each two navigation tracks in the dimension;

And for any two navigation tracks, weighting each initial difference distance in each dimension by using the characteristic occupation coefficient to obtain the comprehensive difference distance of the any two navigation tracks, wherein the comprehensive difference distance represents the comprehensive evaluation result of the difference distances of the characteristic data of the any two navigation tracks in all dimensions.

Preferably, the clustering of the navigation tracks of all the ships according to the evaluation result to obtain a clustering result of the navigation tracks specifically includes:

Based on the comprehensive difference distance between every two navigation tracks, a clustering result of the navigation tracks is obtained by using a clustering algorithm.

Preferably, the clustering of the navigation tracks based on the distance distribution between the starting point position and the ending point position of the navigation track of each ship to obtain an initial cluster specifically includes:

For the navigation track of any ship, acquiring the Euclidean distance between the starting point position and the end point position of the navigation track as the characteristic distance of the navigation track; based on the difference between the characteristic distances of the navigation tracks of the ships, clustering the navigation tracks by using a clustering algorithm to obtain an initial cluster.

Preferably, the characteristic data of the navigation track of each ship in each dimension comprises characteristic data corresponding to the heading, the speed, the track length and the distance between the starting point and the ending point of the track.

The embodiment of the invention has at least the following beneficial effects:

According to the invention, firstly, characteristic data under corresponding dimensions of different attribute data are acquired aiming at navigation tracks of different ships, longitude and latitude data of each moment are acquired, and a data base is provided for subsequent analysis of the position information distribution condition of the navigation tracks under each moment. Then, initial clustering is firstly carried out on the initial and final distance distribution of the ship navigation track, the obtained initial clustering cluster only considers the data difference distribution condition under a single dimension, and the accuracy is difficult to ensure that further analysis is needed. In other words, the situation that the importance of the data features of different dimensions in different sailing stages is different in the sailing process of the ship is considered, so that the time is divided by analyzing the fluctuation situation of the longitude and latitude data, the longitude and latitude data can reflect the position information of the ship at the moment of the sailing track, and the sailing time interval is further characterized in different sailing stages. Further, a self-adaptive divided navigation time interval exists in each initial cluster, so that abnormal conditions and longitude and latitude fluctuation conditions of characteristic data of a navigation track in each dimension in a local time range are analyzed, fluctuation of data abnormal conditions and position information is fully considered, characteristic occupation coefficients in corresponding dimensions in the local time range are determined, and occupation degree of data characteristics and importance degree of the data characteristics in the corresponding dimensions can be represented. When the characteristic occupation coefficient is finally utilized to carry out comprehensive evaluation on the difference distances of all dimension characteristic data, corresponding weights are adaptively set for the distance measurement characteristics in each dimension, so that more accurate combination characteristics can be obtained, and the clustering result is more accurate.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of steps of a method for intelligent analysis of ship data in real time based on data acquisition;

FIG. 2 is a flow chart of the steps of the method for acquiring the navigation time interval of each initial cluster provided by the invention;

FIG. 3 is a flow chart of steps of a method for obtaining feature occupancy coefficients in each dimension provided by the present invention;

Fig. 4 is a block diagram of a ship data real-time intelligent analysis system based on data acquisition.

Detailed Description

In order to further describe the technical means and effects adopted by the invention to achieve the preset aim, the following is a detailed description of specific implementation, structure, characteristics and effects of the ship data real-time intelligent analysis method based on data acquisition according to the invention with reference to the accompanying drawings and the preferred embodiment. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

The invention provides a specific scheme of a ship data real-time intelligent analysis method based on data acquisition, which is specifically described below with reference to the accompanying drawings.

Referring to fig. 1, a flowchart of a method for real-time intelligent analysis of ship data based on data acquisition according to an embodiment of the present invention is shown, the method includes the following steps:

Step S100, acquiring the navigation track of each ship, and determining the characteristic data of the navigation track of each ship under each dimension and the longitude and latitude data of each moment based on the characteristic distribution of different attribute data of the navigation track of each ship.

Satellite AIS is a ship positioning technology, and can realize the monitoring of the navigation ship in the open sea area. AIS data of the ship during sailing can be obtained, the AIS data comprise longitude latitude, sailing speed, course and other information of the ship during sailing, and AIS data distribution of the ship in the same area has certain regularity, so that characteristic data of a sailing track in each dimension and longitude and latitude data of each moment in the sailing process of the ship can be obtained through characteristic distribution conditions of sailing data of different attributes in the sailing process of different ships.

Specifically, each ship corresponds to a navigation track, and each dimension corresponds to navigation data of one attribute, including heading, speed, track length and distance between the start point and the end point of the track.

In this embodiment, an arbitrary ship will be described as an example, and the euclidean distance between the start point position and the end point position of the navigation track of the ship will be used as the feature data in one dimension. Taking the total length of the navigation track of the ship as the characteristic data in one dimension. It will be appreciated that the overall length of the voyage track reflects the overall distance travelled by the overall voyage of the vessel. Further, the course in the navigation track of the ship is used as the characteristic data in one dimension. Taking the navigation track speed of the ship as characteristic data in one dimension.

It should be noted that, the navigation track of each ship includes a feature data in the position information dimension, and includes a feature data in the navigation length dimension, and the longitude and latitude data of the navigation track at each moment refers to the longitude data and latitude data of each moment corresponding to one ship in the whole navigation process included in the navigation track, so that the navigation position information of the ship at each moment can be reflected. Similarly, the course and the speed are corresponding to a numerical value at each moment in the whole sailing process, and can be used for reflecting real-time sailing information at each moment.

Step S200, clustering the navigation tracks to obtain an initial cluster based on the distance distribution between the starting point position and the end point position of the navigation track of each ship; dividing the time according to the fluctuation degree of the longitude and latitude data of each navigation track in each initial cluster at each time to obtain the navigation time interval of each initial cluster.

For the characteristic data of the ship under one dimension of the navigation track, the Euclidean distance between the starting point position and the end point position of the navigation track can intuitively reflect the operation data characteristics of the navigation track, and firstly, the characteristic data under a single dimension is adopted to preliminarily classify the navigation track, so that whether the tracks of different ship routes are abnormal or not can be preliminarily reflected. That is, the navigation tracks are clustered based on the distance distribution between the starting point position and the ending point position of the navigation track of each ship to obtain an initial cluster.

In this embodiment, for a navigation track of any one ship, the euclidean distance between the start position and the end position of the navigation track is taken as the characteristic distance of the navigation track; based on the difference between characteristic distances of navigation tracks of ships, clustering the navigation tracks by using a K-means clustering algorithm to obtain an initial cluster. The distance for clustering can be measured by Euclidean distance among different characteristic distances, and the optimal cluster number of the clustering algorithm can be determined by a contour coefficient method.

The accuracy of clustering analysis of navigation data with single attribute is low, so that the preliminary clustering result needs to be adjusted by combining the distribution condition of the characteristic data in other aspects. In the preliminary clustering result, the start and stop positions of the navigation tracks in each initial cluster are similar, and under normal conditions, each navigation track belongs to the same initial cluster, and comprises the same navigation stage, such as a departure stage, a navigation stage and a departure stage. The degree of importance of feature data of different dimensions at different stages is also different.

Specifically, in the departure stage and the arrival stage, the ship is relatively close to the port position, that is, the difference between longitude and latitude data of the ship is small in the departure stage and the arrival stage of each navigation track, and at this time, the difference distribution among the characteristic data in other dimensions needs to be considered more seriously, so that the difference among different navigation tracks can be reflected more accurately and comprehensively. Based on this, dividing the time according to the fluctuation degree of the longitude and latitude data of each navigation track in each initial cluster at each time to obtain the navigation time interval of each initial cluster can be achieved through steps S201 to S203 as shown in fig. 2.

Step S201, for any one initial cluster, determining longitude and latitude discrete factors of the initial cluster at any one time based on the discrete degrees of longitude data of all navigation tracks at any one time and the discrete degrees of latitude data at any one time.

The longitude data and the latitude data of each moment in the navigation track of the ship reflect the navigation position of the ship at each moment, and the data among different navigation tracks are similar in the same initial cluster, so that the fluctuation condition of the longitude and latitude data at each moment is smaller in terms of the longitude and latitude data, but the fluctuation condition of the longitude and latitude data at different navigation stages of each navigation track also shows similar conditions in different forms.

For example, the longitude and latitude data of each navigation track in the same initial cluster are relatively close in the departure stage and the entry stage of navigation, and the longitude and latitude data of the navigation tracks of the ships with different navigation behaviors have larger difference in the navigation stage of navigation. Based on the method, firstly, fluctuation conditions of longitude and latitude data of all ships at each moment in the whole sailing process are analyzed, and data discrete conditions of all sailing tracks in the same initial cluster at each moment are obtained in a quantification mode.

In this embodiment, taking an arbitrary initial cluster as an example, the data analysis is performed together with longitude and latitude at each moment, specifically, the degree of dispersion of the longitude data may be represented by the variance of the longitude data of all navigation tracks belonging to the same initial cluster at the same moment, and in other embodiments, the degree of dispersion of the data may be represented by a discrete coefficient such as a standard deviation or a relative standard deviation.

Similarly, according to the same method, the degree of dispersion of the latitude data can be represented by using the variances of the latitude data of all navigation tracks belonging to the same initial cluster at the same moment. It should be noted that, the degree of dispersion of the longitude data and the latitude data needs to be measured by using the same discrete coefficient.

And further combining the analysis of the discrete degree of the two aspects, and taking the mean value between the variance corresponding to the longitude data and the variance corresponding to the latitude data as the longitude and latitude discrete factor of each moment. The longitude and latitude discrete factors reflect the discrete condition of the longitude and latitude positions of all navigation tracks at one moment in one initial cluster, and the larger the value of the longitude and latitude discrete factors is, the less the position distribution among all the navigation tracks in the initial cluster is near and the position distribution is scattered at the current moment.

Step S202, all moments corresponding to the initial cluster are divided into two time categories based on the difference distance between longitude and latitude discrete factors of every two moments corresponding to the initial cluster.

In the course of navigation, the distribution of position information between different navigation tracks in the port entering stage and the port leaving stage is close, the distribution of position information between different navigation tracks in the navigation stage is scattered and is not close, and the longitude and latitude discrete factors of the initial cluster at each moment reflect the degree of the discrete distribution of the position between different navigation tracks in the same cluster at each moment, so that the preliminary time division can be carried out through the discrete features of the distribution of the position information at different moments.

For any one initial cluster, classifying all the moments by using a clustering algorithm based on the difference distance between longitude and latitude discrete factors of every two moments, and in the embodiment, clustering all the moments under each initial cluster by using a K-means clustering algorithm, wherein the number of clustering categories is 2.

It should be noted that, the reason why the number of the cluster categories is 2 is that the entering and leaving phases can be regarded as time aggregation phases with relatively close location information distribution, and the voyage phase can be regarded as time aggregation phases with relatively discrete non-execution information distribution. Meanwhile, the measurement distance for clustering can be Euclidean distance between longitude and latitude discrete factors at two moments, and the Euclidean distance is used for reflecting the difference distance between the longitude and latitude discrete factors.

So far, each initial cluster performs the time class division operation, and as the maximum value of the time length of the navigation track contained in each initial cluster may have a certain difference, the division result of the time class corresponding to each initial cluster may be the same or different.

And step S203, screening the time category based on the characteristic distribution of longitude and latitude discrete factors at all times in the time category, and dividing all times according to the screening result to obtain the navigation time interval of the initial cluster.

For any one initial cluster, one of the two time categories is the time belonging to the sailing stage, and the other is the time belonging to the entering stage and the leaving stage, so that the time category is screened based on the characteristic distribution of longitude and latitude discrete factors at all moments in the time category to distinguish the sailing characteristic attributes of the two time categories.

Specifically, a time class corresponding to the minimum value of the mean value of longitude and latitude discrete factors at all times in the time class is obtained and is recorded as a first characteristic class. Because the distribution of the position information in the navigation track in the port entering stage and the port leaving stage is relatively close, the corresponding value of the longitude and latitude discrete factors is smaller, the overall distribution of the position information discrete condition in the time class can be reflected by calculating the average value of the longitude and latitude discrete factors at all moments in the time class for any one time class, and the class with the smaller average value of the longitude and latitude discrete factors in the two time classes can be regarded as the time class for gathering the position information corresponding to the port entering stage and the port leaving stage. That is, the first feature class characterizes the time period in which the approach and departure phases of all the travel tracks in the initial cluster are located.

The classification is carried out based on the difference distance between every two moments in the first characteristic class to obtain two subcategories, the first characteristic class comprises time periods of two phases, namely a port entering phase and a port leaving phase, and the first characteristic class can be classified into the class where the time periods corresponding to the two phases are located through the time distribution difference between the moments. The time sequence distribution of the time moments in the subcategories belonging to the departure phase is smaller, and the time sequence distribution of the time moments in the subcategories belonging to the departure phase is larger.

It should be noted that, in this embodiment, the K-means clustering algorithm is used to divide all the time instances in the first feature class, and the measurement distance between the time instances may be a difference in time sequence between every two time instances, for example, the difference in time sequence between the first time instance and the 5 th time instance is 4.

And dividing all the moments based on the average value moment of all the moments in each subcategory to obtain the navigation time interval of the initial cluster. For any one subcategory, calculating the time sequence average value of all the moments of the subcategory to obtain average value moments, taking two average value moments corresponding to the two subcategories as the segmentation points of the moments respectively, and carrying out time period division operation on all the moments to obtain three time intervals.

For example, for the course time distribution of any one of the initial clusters, the two mean moments can be expressed asAndThe three time intervals may be represented as，，T represents the maximum of the time lengths of all navigation tracks in the initial cluster. Time intervalCan be approximately represented as a sailing stage and a time intervalCan be approximately expressed as departure phase, time intervalMay be approximated as an entry phase.

It should be noted that, in this embodiment, each time interval includes an endpoint time, and in other embodiments, an implementer may set the time interval according to a specific implementation scenario, or may include only one endpoint time, for example, three time intervals may be respectively expressed as，，。

Step S300, obtaining the characteristic occupation coefficient of the navigation time interval of each initial cluster in each dimension according to the abnormal distribution condition of the difference distance between the characteristic data of the navigation track in each dimension in the navigation time interval of each initial cluster and the fluctuation condition of the longitude and latitude data of each moment.

Firstly, considering that the characteristic distribution of the sailing tracks of different ships in different initial clusters is different, the ship sailing tracks in each initial cluster need to be analyzed independently when the multidimensional data are weighted and integrated. If the difference distribution similarity and the relevance between the feature data in a certain dimension are too large, setting an excessive weight for the dimension may cause that the feature distribution condition of the feature data in other dimensions cannot perform well, so that the clustering effect is large.

Meanwhile, considering that different navigation time intervals are adaptively divided in each initial cluster, the attention degree of feature data of different dimensions in different navigation time intervals may be different. For example, the location distribution information belonging to the entering and leaving phases must be relatively close, so that the degree of attention is smaller in the data dimension where the location information is located, and the importance and degree of attention are larger in the other data dimensions. Based on this, with each navigation time interval in each initial cluster as an analysis unit, by combining the difference distance distribution condition of the feature data under each dimension, and the position distribution information of the navigation track, that is, the difference fluctuation of the longitude and latitude data, the important attention degree of the feature data under each dimension, that is, the feature occupation coefficient under each dimension, can be implemented by steps S301 to S304 shown in fig. 3.

Step S301, obtaining data association indexes of each dimension according to abnormal distribution conditions of difference distances among all feature data of each dimension; and obtaining the data association degree of the navigation time interval of each initial cluster in each dimension according to the abnormal distribution condition of the difference distance between the characteristic data in each dimension in the navigation time interval of each initial cluster.

Firstly, analyzing the distribution relevance of the difference distance between the characteristic data in each dimension, and further analyzing the distribution relevance specific to each dimension in the navigation time interval in each initial cluster, so that the difference condition between the whole relevance and the local relevance can be reflected.

First, an arbitrary dimension will be described as an example. And for any dimension, detecting the abnormality of the difference distance between every two feature data in the dimension. In this embodiment, an LOF anomaly detection algorithm is adopted to detect a difference distance between every two feature data, and the euclidean distance between the two feature data is used as the difference distance, if there are M ship navigation tracks, the total number of data for anomaly detection in any dimension is。

Determining a first duty ratio coefficient based on the number duty ratio of the abnormal data in the abnormal detection result; determining a first discrete coefficient based on the degree of dispersion of the difference distance between every two feature data in the dimension; obtaining a data association index of the dimension according to the first duty ratio coefficient and the first discrete coefficient; the first duty ratio coefficient and the first discrete coefficient are in negative correlation with the data association index. Wherein the negative correlation relationship may be a reciprocal relationship, a negative exponent power relationship, or the like. The degree of dispersion may be a dispersion coefficient such as a variance, a standard deviation, a mean of absolute values of the differences, and the like.

In this embodiment, taking the ith dimension as an example for explanation, the calculation formula of the data association index in the ith dimension may be expressed as:

；

Wherein, Representing the data association index in the ith dimension,Representing the total number of outlier data in the ith dimension,Representing the total number of difference distances in the i-th dimension,Representing the variance of all difference distances in the ith dimension, norm () represents the linear normalization function, exp () represents the exponential function based on the natural constant e.

The first duty ratio coefficient reflects the duty ratio condition of the difference distance of the abnormality in the ith dimension, the larger the value is, the more abnormal data quantity is indicated, more independent and discrete ship data are indicated in the current dimension, and the smaller the value of the corresponding data association index is through negative correlation mapping, the smaller the association between the current dimension data is indicated.

The first discrete coefficient reflects the discrete condition and fluctuation condition of all difference distances in the ith dimension, and the larger the value is, the larger the difference among all feature data in the current dimension is, the larger the discrete degree of the difference distance is, and the smaller the value of the corresponding data association index is after negative correlation mapping, the smaller the association among the difference distances of the data in the current dimension is.

The data association index in each dimension reflects the association condition between the data of the whole dimension in terms of abnormal distribution and discrete distribution of the difference distance of the whole data in the current dimension.

Further, for each initial cluster, the local time interval of adaptive division is analyzed, and the local relevance condition between the data under each dimension is analyzed.

Specifically, for any one navigation time interval of any one initial cluster, acquiring a second duty ratio coefficient based on the quantity duty ratio of abnormal data belonging to the navigation time interval of the initial cluster in an abnormal detection result corresponding to any one dimension, and determining a second discrete coefficient based on the discrete degree of the difference distance between every two feature data of the navigation time interval of the initial cluster in the any one dimension; and obtaining the data association degree of the navigation time interval of the initial cluster under any dimension according to the second duty ratio coefficient and the second discrete coefficient, wherein the second duty ratio coefficient and the second discrete coefficient are in negative correlation with the data association degree.

According to the same calculation method as the data association index in the ith dimension, the calculation formula of the data association degree of the kth navigation time interval in the ith dimension in any one initial cluster can be expressed as follows:

；

Wherein, Representing the degree of data association of the kth navigation time interval in the initial cluster in the ith dimension,Representing the number of outlier data contained in the kth voyage time interval in the initial cluster in the ith dimension,Representing the total number of difference distances in the i-th dimension,Representing the variance of all difference distances in the ith dimension of the kth navigation time interval in the initial cluster, norm () represents a linear normalization function, exp () represents an exponential function based on a natural constant e.

The second duty ratio coefficient reflects the duty ratio of abnormal data in a local time period, namely the duty ratio of abnormal data in the difference data in a single navigation time interval, the larger the value is, the more abnormal data in the current dimension in the navigation time interval is, the smaller the value of the corresponding data association degree is through negative correlation mapping, and the smaller the data association of the current dimension in the navigation time interval is.

The second discrete coefficient reflects fluctuation conditions and discrete conditions of the difference distance in the local time period, and the larger the value is, the larger the fluctuation of the data of the difference distance in the navigation time period is, the smaller the value of the corresponding data association degree is through negative correlation mapping, and the smaller the data association of the current dimension in the navigation time period is.

For the navigation time interval of the initial cluster, the data association degree reflects the local association condition of a single dimension in terms of abnormal distribution and discrete distribution of the single dimension data in a local time range.

Step S302, for any one initial cluster, determining the data importance degree of each navigation time interval of the initial cluster under the selected dimension by combining the data association index of the selected dimension based on the difference distribution condition of the longitude and latitude discrete factors of the navigation time interval.

Considering that the characteristic data of the navigation track in one dimension comprises Euclidean distance from the starting point to the end point of the navigation track, and the position information of different navigation tracks is relatively close in a time period corresponding to the port entering stage and the port leaving stage in the whole navigation time of the navigation track, if the characteristic data of the dimension where the position information is located is provided with a relatively large weight, the more important data difference condition of other dimensions can not be reflected well, and therefore, the attention condition of the dimension can be properly reduced through data characteristic analysis.

Firstly, the data dimension corresponding to the Euclidean distance between the starting point and the end point of the navigation track in the navigation process is marked as a selected dimension, and then other dimensions except the selected dimension are marked as characteristic dimensions.

In the port entering stage and the port leaving stage, the distribution among the data in the selected dimension in the same initial cluster is relatively close, and in the navigation stage, the distribution among the data in the selected dimension in the same initial cluster can be relatively discrete, so that under the selected dimension, the feature analysis of the importance degree and the attention degree of the data in different navigation time intervals is required to be carried out respectively.

Specifically, for any one initial cluster, all the navigation time intervals are respectively marked as a first time interval, a second time interval and a third time interval according to the sequence of time, namely, the navigation time intervals can be a departure stage, a navigation stage and a departure stage in sequence according to the sequence of time.

For the sailing stage, the position information between the sailing tracks of different ships may be more discrete or may be closer, so that the data characteristic distribution of the stage in the selected dimension does not need to be adjusted. And determining the importance degree of the data of the second time interval in the selected dimension based on the normalized value of the data association index in the selected dimension for the second time interval. That is, the data correlation index in the selected dimension is normalized to obtain the data importance degree of the second time interval in the selected dimension. The normalization method can select a maximum value and minimum value normalization method to process.

And characterizing the importance of the data distribution of the second time interval by selecting the relevance distribution condition of the whole data in the dimension, wherein the larger the relevance among the whole data is, the more important the distribution of the data in the dimension is, and the more attention is required to be given to the feature data of the dimension.

For the port entering stage and the port leaving stage, the distribution of the position information among the navigation tracks of different ships is relatively close, and the characteristic data of the selected dimension is required to be given smaller weight through characteristic analysis, so that important data characteristics of other dimensions can be focused more.

Specifically, taking the average value of longitude and latitude discrete factors at all moments in a first time interval and a third time interval as a first characteristic factor; taking the average value of longitude and latitude discrete factors at all moments in a third time interval as a second characteristic factor; and for any one of the first time interval and the third time interval, determining the data importance degree of any one of the first time interval and the third time interval under the selected dimension based on the negative correlation coefficient of the difference between the first characteristic factor and the second characteristic factor and the normalized value of the data association index of the selected dimension. The data importance degree is normalized value.

As a specific example, the negative correlation coefficient of the difference between the corresponding first and second characteristic factors at the selected dimension in the first time interval of any one initial cluster may be expressed asWherein, the method comprises the steps of, wherein,A first characteristic factor is indicated and is indicated,A second characteristic factor is indicated and is indicated,The value of the super parameter is 0.1 or 0.01, and the denominator is 0 for preventing the super parameter from being preset.

The first characteristic factor reflects the discrete distribution of the position information in two phases with relatively close position information distribution, and the second characteristic factor reflects the discrete distribution of the position information in one phase with relatively fluctuating position information distribution.The difference between the first characteristic factor and the second characteristic factor reflects the difference condition between the discrete degree of the position information of the departure or arrival stage and the third time interval, and the larger the difference is, the larger the difference of the position information between the departure or arrival stage and the navigation stage is, and further the more concentrated the position information distribution corresponding to the first time interval and the second time interval is, so that the importance degree of the first time interval and the second time interval in the selected dimension is set to be smaller, and the mapping is carried out by adopting a negative correlation method, so that the importance of the data in the selected dimension is smaller.

Each navigation time interval in the initial cluster is in a selected dimension, different data importance degrees are determined in a self-adaptive mode through unique data feature distribution of different stages, and the data importance and the data attention degree conditions in the selected dimension can be fully represented.

Step S303, for any one of the voyage time intervals, determining the data importance degree of the voyage time interval in each feature dimension based on the negative correlation coefficient of the data importance degree corresponding to the voyage time interval, in combination with the negative correlation mapping value of the difference between the data association degree of the voyage time interval in each feature dimension and the data association index of the same feature dimension.

Because there are special aggregated and dispersed distribution features in the feature data of the selected dimension, step S302 performs an independent analysis for the importance and the attention of the feature data of the selected dimension, but does not exhibit a more specific feature distribution for the feature data of other dimensions, and further performs an importance analysis of the data for each dimension other than the selected dimension. Meanwhile, considering that the importance of the data is smaller in the selected dimension, the importance of the data in other dimensions is more focused, and the importance of the data in each characteristic dimension is determined by carrying out negative correlation mapping on the importance of the data in the selected dimension and combining the difference distribution in multiple aspects.

Specifically, for any one voyage time interval, the negative correlation mapping value of the difference between the data association degree and the data association index and the negative correlation coefficient of the data importance degree corresponding to the voyage time interval can be combined in a multiplication or addition mode under any one characteristic dimension.

In this embodiment, as a specific example, the method for obtaining the data importance level of the kth navigation time interval of the initial cluster in the mth feature dimension may be expressed by a formula as follows:

；

Wherein, Representing the importance degree of the data of the kth navigation time interval in the initial cluster under the mth characteristic dimension,Representing the data association degree of the kth navigation time interval in the initial cluster under the mth characteristic dimension,Representing the data association index in the mth feature dimension,Representing the importance of the data in the kth navigation time interval in the initial cluster under the selected dimension, exp () represents an indicator function based on a natural constant e.

The negative correlation coefficient representing the importance of the data in the selected dimension for the voyage time interval, illustrates that the greater the importance of the data in the selected dimension, the less the corresponding attention to the data in the other dimensions. The smaller the importance of the data in a selected dimension, the greater the relative degree of interest in the data corresponding to the other dimensions.

The difference between the local data association degree of the navigation time interval and the overall dimension data association index under the current characteristic dimension is reflected, the difference between the local time range data association condition and the overall dimension overall data association condition is reflected, the larger the value is, the larger the data difference under the current dimension in the current local time range is, the smaller the corresponding importance is, namely the smaller the attention is, the negative correlation mapping is carried out in the form of negative exponent power, and the smaller the corresponding obtained data importance degree is.

And step S304, determining the characteristic occupation coefficient of each navigation time interval of the initial cluster in each dimension based on the data importance degree of each navigation time interval in each dimension, wherein the characteristic occupation coefficient is a normalized value.

The data importance degree of each local time range in each dimension is quantified in a self-adaptive manner through the data feature distribution conditions in different dimensions, so that the degree and importance condition of the data needing to be concerned in each navigation time interval can be fully and accurately represented. And further, normalizing the importance degree of the data in each navigation time interval in each initial cluster under each dimension to obtain a corresponding feature occupancy coefficient, wherein the normalization method can adopt a maximum value and minimum value normalization method or a ratio method for processing, and is not limited. It should be noted that, the normalization processing operation may not be performed for the degree of importance of the data that is already the normalized value.

The characteristics of each navigation time interval in each initial cluster under each dimension occupy the coefficient, the importance condition of the data under the corresponding dimension is represented, the degree of attention to the data under the dimension is higher, the degree of attention to the data of the dimension is higher, and further the larger weight can be given to the data of the dimension for carrying out characteristic fusion.

And S400, comprehensively evaluating the difference distances of the characteristic data of the navigation tracks in all dimensions in the navigation time interval of each initial cluster by utilizing the characteristic occupation coefficient, and clustering the navigation tracks of all ships according to the evaluation result to obtain a clustering result of the navigation tracks.

The feature occupation coefficient of each navigation time interval of the initial cluster under each dimension can reflect the importance degree of the navigation data features of the corresponding types of the dimensions, and represents the occupation degree of each dimension in comprehensive navigation data and the attention degree of each dimension in the global data range. Based on the method, the characteristic occupation coefficient corresponding to each dimension can be utilized, the difference distance between the characteristic data of each navigation track in all dimensions is integrated, the measurement distance for clustering among the navigation tracks of different ships is adjusted, the data difference conditions of a plurality of dimensions can be effectively and comprehensively considered, and the final clustering effect is good.

Firstly, the data difference distance between every two different navigation tracks of the ship in each dimension needs to be obtained, and the characteristic occupation coefficients are further utilized to integrate the data difference distances in all dimensions. Specifically, for any one dimension, the difference distance between the feature data of each two navigation tracks in the dimension is taken as the initial difference distance of each two navigation tracks in the dimension. The difference distance may be measured by using euclidean distance, or by calculating the absolute value of the difference.

It should be noted that, in this embodiment, the feature data of the heading dimension and the navigational speed dimension include data values at each time of the whole time length corresponding to the heading trace, and further, for any two heading traces, an initial difference distance is corresponding at each time under the two dimensions. In both dimensions, the track length and the distance between the start and end of the track correspond to only one initial differential distance between any two navigation tracks.

And then, weighting each initial difference distance under each dimension by utilizing the characteristic occupation coefficient for any two navigation tracks to obtain the comprehensive difference distance of the any two navigation tracks, wherein the comprehensive difference distance represents the comprehensive evaluation result of the difference distances of the characteristic data of the any two navigation tracks under all dimensions.

It should be noted that, each navigation track in each initial cluster has different characteristic occupation coefficients in each dimension in each navigation time interval, and the weighted summation needs to be performed by using the characteristic occupation coefficients corresponding to the navigation time interval in which the initial difference distance corresponding characteristic data in each dimension is located.

Specifically, in any one dimension of the heading and the navigational speed, taking any two navigational tracks in any one initial cluster as an example for explanation, the weighted difference distance of the two navigational tracks in the initial cluster in the dimension can be expressed as:

；

Wherein, Representing the weighted difference distance of the navigation track a and the navigation track b in the initial cluster in the navigation direction dimension, T represents the maximum value of the navigation time length of the navigation track a and the navigation track b in the initial cluster,Representing the initial differential distance of the navigation track a and the navigation track b in the initial cluster at the t-th moment in the navigation dimension,And the characteristic occupation coefficients corresponding to the navigation time interval where the t-th moment corresponds to the navigation track a and the navigation track b in the initial cluster in the navigation dimension are represented.

The weighted difference distance represents the feature occupation coefficient corresponding to each initial difference distance as the weight of the data in the navigation dimension, and the corresponding initial difference distances are weighted and summed to obtain the comprehensive evaluation distance in the current dimension.

It should be noted that the navigation time lengths of the navigation track a and the navigation track b may be different, the difference comparison of the feature data is performed by using the maximum value, and when the navigation data does not exist in a certain navigation track, the difference distance between the two may still be measured, and the distance may be larger.

For two dimensions of track length of navigation tracks and distance between track start point and end point, for two navigation tracks in an initial cluster, only one initial difference distance is included, and a plurality of different navigation time intervals have different characteristic occupation coefficients in the two dimensions, in this embodiment, an average method is adopted to obtain overall evaluation values of characteristic occupation coefficients of all navigation time intervals, and in other embodiments, a median method, a maximum value method and the like can also be adopted as weighting weights in the two dimensions to represent overall distribution levels of characteristic occupation coefficients of the initial cluster in the corresponding dimensions.

Specifically, in this embodiment, for a track length dimension, a mean value of feature occupation coefficients of all navigation time intervals in an initial cluster is used as a weighted weight of the initial cluster in the track length dimension, and then for any two navigation tracks in the initial cluster, a product between the weighted weight and a corresponding initial difference distance is obtained, so that a weighted difference distance between the two navigation tracks in the dimension can be obtained.

Further, for the navigation track a and the navigation track b in the initial cluster, the comprehensive difference distance between the navigation track a and the navigation track b can be obtained by integrating the weighted difference distances in all dimensions, and the embodiment adopts a summation mode for integration.

In other embodiments, considering that the longitude and latitude information of the navigation track also corresponds to a data value at each moment in the navigation process, the difference distance of the longitude and latitude information between every two navigation tracks can be additionally considered besides the characteristic data of each dimension, the characteristic occupation coefficient of the navigation time interval at the corresponding moment is adopted as the weight to carry out weighted summation, the corresponding weighted difference distance is obtained, and the final clustering measurement distance is comprehensively evaluated.

And finally, obtaining a corresponding comprehensive difference distance between every two navigation tracks in each initial cluster, taking the comprehensive difference distance as a clustering measurement distance between the navigation tracks of different ships, and adopting a clustering algorithm to perform re-clustering operation. In this embodiment, a DBSCAN clustering algorithm is adopted, where the epsilon parameter in the algorithm has a value of 0.1 and minpts has a value of 6, and finally a clustering result of the navigation track is obtained.

Therefore, the invention can adaptively set corresponding weights for the distance measurement features under each dimension according to the distribution features of the navigation tracks of different ships, so that more accurate combination features can be obtained, and the clustering result is more accurate.

An embodiment of a ship data real-time intelligent analysis system based on data acquisition:

Fig. 4 is a block diagram of a ship data real-time intelligent analysis system based on data acquisition, which includes:

The data preprocessing module is used for acquiring the navigation track of each ship, and determining the characteristic data of the navigation track of each ship under each dimension and the longitude and latitude data of each moment based on the characteristic distribution of different attribute data of the navigation track of each ship;

The interval dividing module is used for clustering the navigation tracks to obtain an initial cluster based on the distance distribution between the starting point position and the end point position of the navigation track of each ship; dividing the time according to the fluctuation degree of longitude and latitude data of each navigation track in each initial cluster at each time to obtain a navigation time interval of each initial cluster;

The characteristic analysis module is used for obtaining the characteristic occupation coefficient of the navigation time interval of each initial cluster in each dimension according to the abnormal distribution condition of the difference distance between the characteristic data of the navigation track in each dimension in the navigation time interval of each initial cluster and the fluctuation condition of the longitude and latitude data of each moment;

and the evaluation clustering module is used for comprehensively evaluating the difference distances of the characteristic data of the navigation tracks in all dimensions in the navigation time interval of each initial cluster by utilizing the characteristic occupation coefficient, and clustering the navigation tracks of all ships according to the evaluation result to obtain a clustering result of the navigation tracks.

In other embodiments, an apparatus is also provided that includes a memory and a processor. The memory is used for storing executable program codes, and the processor is used for calling and running the executable program codes from the memory, so that the device executes the ship data real-time intelligent analysis method based on data acquisition. The apparatus may be embodied as a chip, component or module, which may include a processor and memory coupled together; the memory is used for storing instructions, and when the processor calls and executes the instructions, the chip can be enabled to execute the ship data real-time intelligent analysis method based on data acquisition.

In other embodiments, there is also provided a computer program product which, when run on a computer, causes the computer to perform the above-mentioned related steps to implement a method for real-time intelligent analysis of ship data based on data acquisition provided in the above-mentioned embodiments.

In other embodiments, a computer readable storage medium is provided, in which a computer program code is stored, which when run on a computer causes the computer to execute the above-mentioned related method steps to implement a method for real-time intelligent analysis of ship data based on data acquisition provided in the above-mentioned embodiments.

The system, the apparatus, the computer program product, and the computer readable storage medium are all configured to perform the corresponding methods provided above, and therefore, the advantages achieved by the system, the apparatus, the computer program product, and the computer readable storage medium are referred to as the advantages of the corresponding methods provided above, and are not described herein.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application and are intended to be included within the scope of the application.

Claims

1. The ship data real-time intelligent analysis method based on data acquisition is characterized by comprising the following steps of:

Comprehensively evaluating the difference distances of the characteristic data of the sailing tracks in all dimensions in the sailing time interval of each initial cluster by utilizing the characteristic occupation coefficient, and clustering the sailing tracks of all ships according to the evaluation result to obtain a clustering result of the sailing tracks;

The method for acquiring the characteristic occupation coefficient comprises the following steps:

determining a characteristic occupation coefficient of each navigation time interval of the initial cluster under each dimension based on the data importance degree of each navigation time interval under each dimension, wherein the characteristic occupation coefficient is a normalized value;

Dividing the time according to the fluctuation degree of longitude and latitude data of each navigation track in each initial cluster at each time to obtain a navigation time interval of each initial cluster, wherein the method specifically comprises the following steps:

Screening the time category based on the characteristic distribution of longitude and latitude discrete factors at all times in the time category, and dividing all times according to screening results to obtain the navigation time interval of the initial cluster;

The characteristic data of the navigation track of each ship in each dimension comprises characteristic data corresponding to the heading, the speed, the track length and the distance between the starting point and the ending point of the track.

2. The method for real-time intelligent analysis of ship data based on data acquisition according to claim 1, wherein the method is characterized by screening time categories based on feature distribution of longitude and latitude discrete factors at all times in the time categories, and dividing all times according to screening results to obtain navigation time intervals of initial clusters, and specifically comprises the following steps:

3. The method for intelligently analyzing ship data in real time based on data acquisition according to claim 1, wherein the method is characterized in that the data association index of each dimension is obtained according to abnormal distribution conditions of difference distances between all feature data in each dimension, and specifically comprises the following steps:

4. The method for intelligently analyzing ship data in real time based on data acquisition according to claim 3, wherein the determining the importance degree of the data of each navigation time interval of the initial cluster under the selected dimension by combining the data association index of the selected dimension based on the difference distribution condition of longitude and latitude discrete factors of the navigation time interval specifically comprises the following steps:

5. The method for real-time intelligent analysis of ship data based on data acquisition according to claim 1, wherein the comprehensive evaluation of the difference distance of the characteristic data of the navigation track in all dimensions in the navigation time interval of each initial cluster by using the characteristic occupation coefficient comprises the following steps:

6. The method for intelligent analysis of ship data in real time based on data acquisition according to claim 5, wherein the clustering of the navigation tracks of all ships according to the evaluation result to obtain the clustering result of the navigation tracks specifically comprises:

7. The method for intelligent analysis of ship data in real time based on data collection according to claim 1, wherein the clustering of the navigation tracks based on the distance distribution between the starting point position and the end point position of the navigation track of each ship to obtain an initial cluster specifically comprises: