CN110826686A

CN110826686A - Machine learning system and method with attribute sequence

Info

Publication number: CN110826686A
Application number: CN201910719231.0A
Authority: CN
Inventors: 庄中方; A·阿罗拉; J·扎瓦维; 孔祥南; E·润丹斯坦纳
Original assignee: Amadeus SAS
Current assignee: Amadeus SAS
Priority date: 2018-08-07
Filing date: 2019-08-06
Publication date: 2020-02-21
Anticipated expiration: 2039-08-06
Also published as: CN110826686B

Abstract

The invention relates to a machine learning system and method with attribute sequences. A machine learning system and method for embedding attribute sequence data. The attribute-containing sequence data includes an attribute data section having a fixed number of attribute data elements and a sequence data section having a variable number of sequence data elements. The attribute network module includes a feed-forward neural network configured to convert the attribute data portion into an encoded attribute vector having a first number of attribute features. The sequence network module includes a recurrent neural network configured to convert the sequence data portions into an encoded sequence vector having a second number of sequence features. In use, the machine learning system learns and outputs a fixed length feature representation of the input attributed sequence data encoding dependencies between different attribute data elements, dependencies between different sequence data elements, and dependencies between attribute data elements and sequence data elements within the attributed sequence data.

Description

Machine learning system and method with attribute sequence

Technical Field

The invention relates to an application of machine learning. In particular, embodiments of the present invention provide unsupervised and supervised learning for feature embedding of attributed sequences (attributed sequences), i.e., data instances including both fixed length attributed data and variable length sequence data with desired characteristics for use in practical applications including, but not limited to, fraud detection, and click streams for web users, purchase history for online customers, or analysis and data mining of DNA sequences.

Background

Ordered data naturally occurs in a wide range of applications. Examples of ordered data include click streams of web users, purchase history of online customers, and DNA sequences of genes. The ordered data comprises a variable length sequence of classification items and typically requires careful design of the feature representation before feeding to the learning algorithm. One method of feature learning for ordered data is known as sequence embedding, where the goal is to transform variable length sequences into fixed length feature representations.

Prior art methods for sequence embedding focus on learning from ordered data only. However, in many real-world applications, variable-length sequences are typically associated with a fixed-size set of attributes. For example, in an online purchasing system, each user transaction includes a set of attributes (e.g., "username," "browser," and "IP address") indicating the context of the transaction and a series of user actions (e.g., "login," "search," "add item to shopping cart," "check out," etc.). As another example, in gene function analysis, each gene may be represented by a DNA sequence and a set of attributes that indicate the level of expression of the gene in different types of cells.

In the sequence embedding problem, conventional approaches focus on modeling item dependencies, i.e., dependencies between different items within a sequence. However, a given ordering of items may have different meanings when associated with different attribute values. Therefore, learning an embedding with the desired characteristics of a practical application requires jointly considering three types of dependencies: item dependencies (i.e., dependencies between different items in a sequence); attribute dependencies (i.e., dependencies between different attributes); and there is attribute sequence dependency (i.e., the dependency between attributes and items in a sequence).

A closely related problem is distance metric learning. It is generally desirable that the feature representations of the observation data have the property that similar observations have similar features, i.e., such observations are clustered in feature space, while the representations of dissimilar observations are more separated in distance. Thus, in distance metric learning, the goal is to learn a suitable distance metric based on a set of similar/dissimilar instance pairs. Many real-world applications from information retrieval to healthcare informatics may benefit greatly from distance metric learning. For example, in healthcare informatics, it may be desirable to learn a distance metric that accurately measures similarity between patients to find the correct treatment for the patient.

Conventional approaches to distance metric learning generally focus on learning a Mahalanobis (Mahalanobis) distance metric, which is equivalent to learning a linear transformation on data attributes. In a non-linear setting, the non-linear mapping function may be learned first to project an instance into a new space, and then the final metric becomes the euclidean distance metric in that space. Depth metric learning is generally the method of choice for learning nonlinear mappings in practice. While advances have been made with respect to metric learning using ordered data, the challenges discussed above reappear where the ordered data depends on the associated context/attribute.

Thus, for many practical applications, there is a need for effective systems and methods to learn data sets and observed features and distance metrics that include fixed length attribute data and associated variable length ordered data.

Disclosure of Invention

In one aspect, the invention provides a machine learning system for embedding attributed sequence data including an attribute data portion having a fixed number of attribute data elements and a sequence data portion having a variable number of sequence data elements into a fixed-length feature representation. The system comprises: an attribute network module comprising a feed-forward neural network configured to convert the attribute data portion into an encoded attribute vector having a first predetermined number of attribute features; and a sequence network module comprising a recurrent neural network configured to convert the sequence data portions into an encoded sequence vector having a second predetermined number of sequence features. The attribute network module and the sequence network module are operably coupled such that, in use, the machine learning system is configured to learn and output a fixed length feature representation of the input attribute-containing sequence data, the fixed length feature representation encoding dependencies between different attribute data elements in the attribute data portion, dependencies between different sequence data elements in the sequence data portion, and dependencies between attribute data elements and sequence data elements within the attribute-containing sequence data.

Advantageously, the coupling of the attribute network module comprising the feedforward neural network to the sequence network module comprising the recurrent neural network enables the system to learn a non-linear function of the input attribute-like sequence data that can account for both isomorphic dependencies (i.e., those within the attribute and sequence data portions) and heterogeneous dependencies (i.e., those between the attribute and sequence data portions) of items within the attribute-like sequence.

In an embodiment of the invention, the attribute network module comprises a multi-layer feedforward neural network having an attribute vector output layer comprising a first predetermined number of units, and the recurrent neural network of the sequence network module comprises a Long Short Term Memory (LSTM) network having a second predetermined number of hidden units. In this way, the number of features in the attribute vector becomes a design parameter of the attribute network, and the number of features in the sequence vector becomes a design parameter of the sequence network. Advantageously, the design parameters are independent of the number of attribute data elements, the length of any sequence data portion, and the number of different items comprising sequence data.

In another aspect, embodiments of the invention provide a training method of a machine learning system for embedding attributed sequence data comprising an attribute data portion having a fixed number of attribute data elements and a sequence data portion having a variable number of sequence data elements into a fixed length feature representation. The machine learning system includes a multi-layer feed-forward neural network having an attribute data input layer and an attribute vector output layer, the attribute vector output layer including a first predetermined number of cells, the multi-layer feed-forward neural network operatively coupled to an LSTM network including a second predetermined number of hidden cells. The training method comprises the following steps: a data set including a plurality of attributed sequences is provided, and for each attributed sequence in the data set, a multi-layer feedforward neural network is trained via backpropagation using an attributed data portion of the attributed sequence relative to a first objective function, and an LSTM network is trained via backpropagation using a sequence data portion of the attributed sequence relative to a second objective function. The training of the multi-layer feedforward neural network is coupled with the training of the LSTM network such that, when trained, the machine learning system is configured to output a fixed-length feature representation of the input attributed sequence data, the fixed-length feature representing dependencies between different attribute data elements in the encoded attribute data portion, dependencies between different sequence data elements in the sequence data portion, and dependencies between attribute data elements and sequence data elements within the attributed sequence data.

Another advantage is that in various embodiments of the invention, different coupling arrangements may be employed, resulting in different embedded alternative network architectures capable of generating input attributed sequence data.

Thus, in one exemplary arrangement, the attribute network module is operatively coupled to the sequence network module by passing the output of the attribute vector output layer to the attribute vector input of the sequence network module. In particular, said attribute vector input of the sequence network module may comprise the hidden state of the LSTM network at the first evaluation step, the first predetermined number of attribute vector output layer elements may be equal to the second predetermined number of hidden elements of the sequence network module, and the fixed length feature representation into which the attribute sequence data is input may comprise the hidden state of the LSTM network at the final evaluation step. In this case, the resulting number of features in the embedding is equal to the second predetermined number, i.e. the number of hidden units in the LSTM network.

In a related embodiment of the training method, the multi-layer feedforward neural network includes an encoder having an encoder input layer including an attribute data input layer and an encoder output layer including an attribute vector output layer. The encoder also includes a decoder having a decoder input layer coupled to the encoder output layer and a decoder output layer including a reconstructed estimate of an input to the encoder input layer. The first objective function may comprise a distance measure between an input of the encoder input layer and the reconstructed estimate. Training the multi-layer feedforward neural network may then include iteratively performing the steps of forward and backward propagation using the attribute data portion having the attribute sequence as an input to the encoder input layer until the distance measurement meets the first convergence goal. The second objective function may include a likelihood measure of incorrect predictions for the next sequence of items at each of a plurality of training time steps of the LSTM network. Training the LSTM network may include iteratively repeating a plurality of training time steps until the likelihood measures satisfy a second convergence goal. Each iteration includes copying the output of the attribute vector output layer to a hidden state of the LSTM network at a first training time step; and at a final training time step, calculating a likelihood measure. The distance measure may comprise a mean square error loss function and the likelihood measure may comprise a categorical cross entropy loss function.

In another exemplary arrangement, the attribute network is operatively coupled to the sequence network module by passing the output of the sequence network module to an input layer of the attribute network module. In particular, the number of cells in the input layer of the attribute network module may be equal to the sum of the fixed number of attribute data elements and the second predetermined number of sequence network module hidden cells, the output of the sequence network module may comprise the hidden state of the LSTM network at the final evaluation step, the hidden state of the LSTM network at the final evaluation step is concatenated with the fixed number of attribute data elements to produce a concatenated attribute network input vector which is passed to the input layer of the attribute network module, and the fixed length feature representation into which the attribute sequence data is input may comprise the output of the attribute vector output layer. In this case, the number of features in the resulting embedding is equal to the first predetermined number, i.e., the number of cells in the attribute vector output layer.

In a related embodiment of the training method, the second objective function may include a likelihood measure of incorrect predictions for the next sequence item at each of a plurality of training time steps of the LSTM network, and training the LSTM network may include iteratively repeating the plurality of training time steps until the likelihood measure satisfies the second convergence goal. Each iteration may include: at a first training time step, copying the output of the attribute vector output layer to a hidden state of the LSTM network; and at a final training time step, calculating a likelihood measure. The multi-layer feedforward neural network may include: an encoder having an encoder input layer comprising an attribute data input layer and an encoder output layer comprising an attribute vector output layer; and a decoder having a decoder input layer coupled to the encoder output layer, and a decoder output layer including a reconstructed estimate of an input to the encoder input layer. The first objective function may comprise a distance measure between an input of the encoder input layer and the reconstructed estimate. Training the multi-layer feedforward neural network may include applying the hidden state of the LSTM network to the encoder input layer at a final training time step concatenated with a fixed number of attribute data elements, and iteratively performing the steps of forward propagation and backward propagation until the distance measurement meets a second convergence goal.

In yet another exemplary arrangement, the attribute network is operatively coupled to the sequence network via a converged network, the converged network including an input concatenation layer configured to concatenate an output of the attribute vector output layer with an output of the sequence network module, and a nonlinear function module configured to learn a nonlinear function of the concatenated input, the nonlinear function encoded with dependencies between attribute data elements and sequence data elements within the attribute sequence data. In particular, the number of cells in the input cascade layer may be equal to a sum of a first predetermined number of attribute features and a second predetermined number of sequence features, the output of the sequence network module may include a hidden state of the LSTM network at the final evaluation step, the nonlinear function module may include a fully connected feedforward neural network layer, and the fixed length feature representation of the input attributed sequence data includes an output vector of the fully connected feedforward neural network layer.

In this case, the number of features in the resulting embedding is equal to the size of the output of the non-linear function module, and may in particular be equal to the sum of the first and second predetermined numbers, i.e. the combined count of the units in the attribute vector output layer and the hidden units in the LSTM network.

In some embodiments advantageously configured to learn embedding in a supervised manner using similar and dissimilar labeled samples having an attribute sequence, the system further comprises a metrics network module bidirectionally coupled to the attribute network module and the sequence network module. The metric network module is configured to receive fixed-length feature representation pairs with corresponding samples of attribute sequence data. Each pair is labeled as indicating whether it includes similar or dissimilar attributed sequence data. The metric network module is further configured to calculate gradient information based on a loss function defined according to a predetermined distance metric. The goal is to learn embedding whereby fixed length feature representations of corresponding samples of attribute sequence data have a smaller distance to the predetermined distance metric when the markers are similar than when the markers are dissimilar. The metric network module is further configured to back-propagate the gradient information through the attribute network module and the sequence network module, thereby updating parameters of the attribute network module and the sequence network module toward achieving the goal.

In yet another aspect, embodiments of the invention provide a training method of a machine learning system for embedding attributed sequence data comprising an attribute data portion having a fixed number of attribute data elements and a sequence data portion having a variable number of sequence data elements into a feature representation of fixed length. The machine learning system includes: a multi-layer feedforward neural network module having an attribute data input layer and an attribute vector output layer, the attribute vector output layer including a first predetermined number of cells; a Long Short Term Memory (LSTM) network including a second predetermined number of hidden units; and a converged network, comprising: an input cascade layer having a number of cells equal to a sum of the first predetermined number and the second predetermined number; and a nonlinear function layer comprising a fully connected feedforward neural network layer. The training method comprises the following steps: a data set is provided comprising a plurality of pairs of attributed sequences, wherein each pair is marked to indicate whether it comprises similar or dissimilar attributed sequence data. For each pair of attributed sequences in the data set, the method includes computing a pair of attribute vectors using a multi-layer feed-forward neural network, each attribute vector having a number of elements equal to a first predetermined number, corresponding to an attribute data portion of an attributed sequence; computing pairs of sequence vectors using the LSTM network, each sequence vector having a number of elements equal to a second predetermined number, corresponding to the sequence data portions of the attributed sequences; concatenating the calculated attribute vectors and corresponding vectors of the sequence vectors to generate a fixed-length feature representation pair of the pair of attributed sequences; computing a non-linear transformation function of the fixed length feature representation to generate transformed feature representation pairs; gradient information is calculated based on a loss function defined according to a predetermined distance metric over the transformed feature representation pairs. The goal is to learn embedding whereby fixed length features with corresponding samples of attribute sequence data represent pairs having a smaller distance under a predetermined distance metric when the markers are similar than when the markers are dissimilar. For each pair of attributed sequences in the dataset, the method includes propagating gradient information back through the multi-layer feedforward neural network and the LSTM network, thereby updating parameters of the attribute network module and the sequence network module toward achieving the goal.

Other aspects, advantages, and features of embodiments of the present invention will be apparent to those skilled in the relevant art from the following description of the various embodiments. It is to be understood, however, that the invention is not limited to the described embodiments, which are provided to illustrate the principles of the invention and to assist those skilled in putting such principles into practice.

Drawings

Embodiments of the present invention will now be described with reference to the drawings, wherein like reference numerals refer to like features.

FIG. 1 is a block diagram illustrating an exemplary networked system including an e-commerce fraud detection system according to an embodiment of the present invention.

FIG. 2 is a schematic diagram illustrating data associated with a user's interaction with the e-commerce system of FIG. 1.

FIG. 3 is a schematic diagram illustrating the importance of attributed sequence data in the context of the fraud detection system of FIG. 1.

FIG. 4 is a diagram illustrating a mapping of a sequence of attributes to an exemplary feature space.

Fig. 5 is a schematic diagram of an attribute network according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of a sequence network according to an embodiment of the present invention.

Fig. 7 illustrates an attribute-network-first (attribute-network-first) coupling in accordance with an embodiment of the present invention.

Fig. 8 illustrates sequence-network-first (sequence-network-first) coupling according to an embodiment of the invention.

FIG. 9 illustrates balanced coupling according to an embodiment of the present invention.

Fig. 10 is a flow diagram illustrating an exemplary algorithm for unsupervised learning for embedding of attributed sequences using attribute-network-first coupling in accordance with an embodiment of the present invention.

FIG. 11 is a schematic diagram illustrating a supervised distance metric learning system in accordance with an embodiment of the present invention.

Fig. 12 is a flow diagram illustrating an exemplary algorithm for supervised distance metric learning, according to an embodiment of the present invention.

Detailed Description

FIG. 1 is a block diagram illustrating an exemplary networked system 100 including a fraud detection system 102 embodying the present invention. In particular, the fraud detection system 102 includes a machine learning system configured to generate an embedding of attributed sequence data according to embodiments of the invention. As will be understood by those skilled in the art of machine learning, the term "embedding" refers to inputting a feature representation of a data sample, whereby characteristics of the data are encoded within a feature space such that similarities or differences between samples may be represented by a measure of distance in the feature space. The meaning of the term "attributed sequence" is discussed in more detail below with reference to FIG. 2. It should be appreciated that the example of the fraud detection system 100 is provided by way of illustration only, as a specific context for illustrating the principles of the invention and to assist a skilled person in putting these principles into practical effect. However, embodiments of the invention may be applied in other contexts where embedding of generated attributed sequence data is advantageous, such as click streams for web users (e.g., for targeted advertising or recommender systems), analysis and data mining of DNA sequences or purchase history for online customers.

The fraud detection system 102 may include a computer system having an architecture. In particular, as shown, fraud detection system 102 includes a processor 104. The processor 104 is operatively associated with non-volatile memory/storage 106, such as via one or more data/address buses 108 as shown. The non-volatile storage 106 may be a hard disk drive and/or may include solid state non-volatile memory, such as ROM, flash memory, a Solid State Drive (SSD), and the like. The processor 104 is also interfaced with a volatile storage device 110 (such as RAM) containing program instructions and transient data related to the operation of the fraud detection system 102.

In configuration, the storage device 106 maintains program and data content related to the normal operation of the fraud detection system 102. For example, storage 106 may contain operating system programs and data, as well as other executable application software needed for the intended function of fraud detection system 102. The storage device 106 also contains program instructions that, when executed by the processor 104, cause the fraud detection system 102 to perform operations associated with embodiments of the invention, such as the operations described in more detail below and in particular with reference to FIGS. 5-12. In operation, instructions and data held on storage device 106 are transferred to volatile memory 110 for execution as needed.

The processor 104 is also operatively associated with a communication interface 112. The communication interface 112 facilitates access to a wide area data communication network, such as the internet 116.

In use, the volatile storage 110 contains corresponding bodies of program instructions 114, the program instructions 114 being transferred from the storage device 106 and configured to perform processes and other operations that implement features of embodiments of the present invention. The program instructions 114 comprise technical contributions to the art specifically developed and configured to implement embodiments of the invention beyond and above the activities well known, routine and conventional in the art of machine learning systems, as further described below, particularly with reference to fig. 5-12.

With respect to the foregoing overview of the fraud detection system 102, and the other processing systems and devices described in this specification, unless the context requires otherwise, terms such as "processor," "computer," and the like, should be understood to refer to a range of possible implementations of devices, apparatuses, and systems that include a combination of hardware and software. This includes single-processor and multi-processor devices and apparatus, including portable devices, desktop computers, and various types of server systems, including cooperating hardware and software platforms that may be co-located or distributed. The physical processors may include general purpose CPUs, digital signal processors, Graphics Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), and/or other hardware devices suitable for efficiently executing desired programs and algorithms. As will be appreciated by those skilled in the art, a GPU may be employed, among other things, for high performance implementation of a deep neural network that includes various embodiments of the present invention, under the control of one or more general purpose CPUs.

The computing system may include a personal computer architecture or other general purpose hardware platform. The software may include open source and/or commercially available operating system software as well as various applications and service programs. Alternatively, the computing or processing platform may include a custom hardware and/or software architecture. To enhance scalability, the computing and processing system may include a cloud computing platform, enabling physical hardware resources to be dynamically allocated in response to service demands. While all such variations fall within the scope of the present invention, for ease of explanation and understanding, the exemplary embodiments are described herein with illustrative reference to a single processor general purpose computing platform, a commonly available operating system platform, and/or a widely available consumer product, such as a desktop PC, a notebook or laptop PC, a smart phone, a tablet computer, and the like.

In particular, the terms "processing unit" and "module" are used in this specification to refer to any suitable combination of hardware and software configured to perform certain defined tasks, such as accessing and processing offline or online data, performing an unsupervised or supervised training step of a machine learning model, performing a feature embedding step of a machine learning model, performing a distance metric evaluation step, or performing a fraud detection step. Such processing units or modules may include executable code that executes at a single location on a single processing device or may include cooperating executable code modules that execute at multiple locations and/or on multiple processing devices. For example, in some embodiments of the invention, the embedding of the data samples may be performed entirely by code executing on a single system, such as fraud detection system 102, while in other embodiments, corresponding processing may be performed in a distributed manner across multiple systems.

Software components (e.g., program instructions 114) that implement features of the invention may be developed using any suitable programming language, development environment, or combination of languages and development environments as will be familiar to those skilled in the art of software engineering. For example, the C programming language, Java programming language, C + + programming language, Go programming language, Python programming language, R programming language, and/or other languages suitable for implementing machine learning algorithms may be used to develop suitable software. Development of software modules embodying the present invention can be supported by using machine learning code libraries, such as the TensorFlow, Torch, and Keras libraries. However, those skilled in the art will recognize that embodiments of the present invention relate to the implementation of software structures and code that are not well known, routine or conventional in the art of machine learning systems, and that although pre-existing libraries may assist in implementation, they require specific configurations and extensive enhancements (i.e., additional code development) in order to implement the specific structures, processes, calculations and algorithms described below, particularly with reference to fig. 5-12.

The above examples of languages, environments, and code libraries are not intended to be limiting, and it should be appreciated that any convenient language, library, and development system may be employed according to system requirements. The descriptions, block diagrams, flowcharts, equations, etc. presented in this specification are provided by way of example to enable one skilled in the software engineering and machine learning arts to understand and appreciate the features, nature and scope of the present invention and to implement one or more embodiments of the present invention by implementing suitable software code using any suitable language, framework, library or development system in accordance with the present disclosure without employing additional inventive innovations.

Program code embodied in any of the applications/modules described herein can be distributed as a program product in a variety of different forms, individually or collectively. In particular, the program code may be distributed using a computer readable storage medium having computer readable program instructions thereon for causing a processor to perform aspects of embodiments of the present invention.

Computer-readable storage media that are non-transitory in nature may include volatile and nonvolatile, removable and non-removable tangible media implemented in any method or technology for storage of information, such as computer-readable instructions, data structures, program modules or other data. Computer-readable storage media may also include Random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other solid state memory technology, portable compact disc read-only memory (CD-ROM), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be read by a computer. The computer-readable storage medium should not be interpreted as a transitory signal per se (e.g., a radio wave or other propagating electromagnetic wave, an electromagnetic wave propagating through a transmission medium such as a waveguide, or an electrical signal transmitted through a wire). The computer-readable program instructions may be downloaded from a computer-readable storage medium to a computer, another type of programmable data processing apparatus, or other devices, or may be downloaded to an external computer or external storage device via a network.

The computer readable program instructions stored in the computer readable medium may be used to direct a computer, other type of programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function, act, and/or operation specified in the flowchart, sequence diagram, and/or block diagram block or blocks. The computer program instructions may be provided to one or more processors of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the one or more processors, cause a series of computations to be performed to implement the functions, acts, and/or operations specified in the flowchart, sequence diagram, and/or block diagram block or blocks.

Continuing with the discussion of FIG. 1, the networked system 100 also includes a monitored system 118. By providing a specific example to illustrate the principles of the present invention, monitored system 118 may be an online sales or e-commerce system. As is well known, users may employ a web browser or other application software to access an e-commerce system 118 from their own personal computer 120 or other suitable device via the Internet 116. User interaction with the e-commerce system 118 may generally involve a number of ordered events or actions, such as logging in, searching for and/or browsing items, selecting items, adding items to an electronic shopping cart, performing checkout processing (e.g., providing payment details, providing shipping details, and confirming purchases), and logging out. These exemplary events and actions are not intended to be limiting, and it should be appreciated that any online system, such as the e-commerce system 118, supports a particular limited (albeit possibly large) set of individual events and actions, and/or sequences of individual events and actions.

In this context, FIG. 2 is a diagram 200 illustrating data associated with a user's interaction with the e-commerce system 118. Two

exemplary data samples

202, 204 are shown, each

exemplary data sample

202, 204 being associated with a single interaction of a user with the e-commerce system 118. Each interaction has a number of associated

attributes

206, 208 including, for example, a name or identifier of the user, an IP address associated with the device (e.g., 120) used to access the system 118, the operating system of the device, and information about the web browser or other application software used to access the system 118. These

attributes

206, 208 provide a "fingerprint" form of the user, device, and software. As will be appreciated, the user's interaction with the website may have alternative or additional attributes (not shown), such as the time of the interaction, and the geographic location of the user device. The attributes selected in any particular implementation include data records having a known fixed size.

Each interaction also has an associated sequence of actions or

events

210, 212, such as those outlined above. In contrast to the

attributes

206, 208, each

sequence

210, 212 includes data records containing a variable number of items. Furthermore, the sequential ordering of the items in the sequence is often important.

The term "attributed sequence" is used throughout this specification to refer to any data sample, such as

e-commerce interaction data

202, 204, that includes associated attributes and sequence records. More particularly, a fixed length attribute vector x is included_kAnd variable length sequences S_kHas an attribute sequence J_kCan be represented as J_k＝(x_k，S_k). In some cases, S may be conveniently filled by determining the length T of the longest sequence in a set of sequences, and filling all shorter sequences to that length with null entries_kConversion to a fixed length representation.

FIG. 3 is a diagram 300 illustrating the importance of attribute sequence data in the context of an exemplary fraud detection application. Five attributed sequences 302 are shown, labeled J₁To J₅. Embedding sequence data separately results in a set of feature vectors represented by a tree 304, which represents the sequence J₁And J₂Are similar (i.e., separate relatively short distance measurements in feature space), and sequence J₃、J₄And J₅Are similar. Thus, no individual sequences are highlighted as abnormalities or outliers. Embedding attribute data separately results in a separate set of feature vectors represented by treemap 306, which represents attribute record J₁、J₂And J₅Are similar, and attribute record J₃And J₄Are similar. Likewise, no individual record stands out as an outlier or outlier.

A problem with the approach of processing sequence data and attribute data separately is that while this may explain the dependencies between different items in the sequence and the dependencies between different elements in the attribute record, it does not take into account the dependencies between sequence data and attribute data. As shown in tree 308, once this heterogeneous dependency is taken into account, different packets may appear. For example, as shown, an alternative feature vector derived from a sequence of attributes may reveal J₁And J₂Are similar, J₃And J₄Are similar, and J₅Maps to an embedding 310 that is very different from all other feature vectors. This is further illustrated in FIG. 4, which is a schematic diagram 400 (limited to two dimensions/features for simplicity of illustration) illustrating the mapping of attributed sequences 402 to a feature space 404, where one of the attributed sequences 406 has a cluster of distances corresponding to the other attributed

sequencesThe classes

408, 410 are relatively far embedded.

Thus, even in the case where sequence embedding and attribute embedding considered separately do not lead to identification of abnormal data, having attribute sequence embedding can lead to identification of abnormal data. Such outliers 310 are important because they may represent fraud that should be flagged by the fraud detection system 102. Therefore, it is necessary to generate an embedding of attributed sequences that takes into account all three dependencies, i.e. isomorphic dependencies within the sequence and attribute data and heterogeneous dependencies between the sequence and attribute data.

Embodiments of the present invention generate such embedding by coupled combination of at least two machine learning modules. More particularly, in some embodiments of the invention, as described below with reference to fig. 5-10, an attribute network module is coupled to the sequence network module to provide a system configured to learn a feature representation of a sequence of attributes in an unsupervised manner (i.e., without identifying any labeled data of similar and/or dissimilar sequences of attributes). In other embodiments of the present invention, as described with reference to fig. 11-12, a third module identified as a "metrology network" is additionally coupled to the attribute network module and the sequence network module to provide a system configured to learn the characterization of the attributed sequences in a supervised or semi-supervised manner (i.e., by learning based at least in part on data that has been tagged by a human expert, for example).

In particular embodiments, as disclosed herein, the attribute network may be a fully connected neural network configured to encode fixed length attribute data portions of the attributed sequence using a non-linear transformation. The sequence network may be a Long Short Term Memory (LSTM) network, i.e., a recurrent neural network, configured to encode structural information of variable-length sequence data portions of the attributed sequence into fixed-length vectors. The metric network may be a feedback module configured to generate gradient information from the loss function and the learning objective based on the marker data back-propagated through the attribute and sequence networks.

FIG. 5 is a property netSchematic diagram of network 500, the attribute network 500 having a fixed number u of entries x_kInput attributes 502, an input layer 504, and a plurality of other layers, e.g., 506, 508. In particular, the attribute network 500 may include M layers with d in the mth layer_mA hidden unit and a corresponding output V_k ^(m)M ═ 1.. M). The structure of the property network 500 may then be represented as:

in equation (1), δ is a non-linear activation function, e.g., sigmoid, ReLU or tanh, W_A ^(m)Is a matrix of weight parameters, and b_A ^(m)Is a vector of bias parameters. Where the system is configured to learn the signature representation of the attributed sequences in an unsupervised manner (i.e. without any tagged data identifying similar and/or dissimilar attributed sequences), conveniently, a substitute network size parameter M 'is defined such that M is 2M', and the structure of the attributed network 500 is defined as:

in equation (2), the activation functions ρ and σ may be the same or different. In certain embodiments, it has been found that using ρ (z) ═ relu (z) and σ (z) ═ sigmoid (z) performs better than using a single activation function. In the attribute network 500 having 2M' layers, there are two components as defined in equation (2): an encoder including a first M' layers, which generates a signal having d_M'A feature representation of the individual components; and a decoder comprising a further M' layers, which attempts to reconstruct the input, thereby

Is the reconstruction result.

In equation (1) by V_k ^(M)Number of cells in defined output layer d_MAnd, likewise, by V in equation (2)_k ^(M'⁾Number of cells in defined output layer d_M'Are parameters of the attribute network 500 that are determined at the time of design and/or configuration of the network 500 and are subsequently fixed during operation. Thus, the parameter includes a first predetermined number that facilitates a particular embedding of attributed sequence data generated by embodiments of the invention.

Fig. 6 is a schematic diagram of a sequence network 600. Sequence network 600 is a variant of the LSTM model. As will be appreciated by machine learning technicians using neural networks, LSTM models are recurrent neural networks, i.e., they operate via internal feedback at each evaluation time step. Typically, however, the sequence network 600 is represented in "expanded" form, whereby the input 602 represents a sequential input transmitted to the elements 604 of the network at each successive step, resulting in the generation of a corresponding successive hidden state 606. The size of the sequence network (i.e. the number of hidden units) is denoted d_S. The structure of sequence network 600 can be represented as:

in the case of the equation (3),

representing the sequence S at time t_kThe classification item of (1); σ is a sigmoid gating (gating) function; i.e. i_k ^(t)、f_k ^(t)、o_k ^(t)And g_k ^(t)Is an interior department; c. C_k ^(t)Is a cell state, h_k ^(t)Is in a hidden state (all expressed as length-d)_SA vector); w_i、W_f、 W_o、W_c、U_i、U_f、U_oAnd U_cIs a weight matrix; and b is_i、b_f、b_oAnd b_cIs a bias vector operator ⊙ represents a multiplication element by element.

The output of the sequence network 600 may then be defined as:

in equation (4), W_yIs a weight matrix, and b_yIs a bias vector. Quantity y_k ^(t)Is a vector of length r equal to the number of different items from which the input sequence is selected and can be interpreted as a probability distribution over r items that can be used to predict the next item in the input sequence.

Number of hidden units d_SAre parameters of network 600 that are determined at the time of design and/or configuration of network 600 and are subsequently fixed during operation. Thus, the parameter includes a second predetermined number that facilitates the specific embedding of attributed sequence data generated by embodiments of the invention.

To generate an embedding with attribute sequences, embodiments of the present invention employ a coupling between attribute network 500 and sequence network 600. Fig. 7 illustrates an attribute-network-priority coupling 700 in which an attribute network 702 is coupled to a sequence network 704 via a connection 706, the connection 706 communicating the output of the attribute network 702 to the input of the sequence network 704. In the case of an unsupervised system, i.e. as described in equation (2), the output of the M' th layer of the attribute network 702 is coupled to the hidden state of the sequence network 704 in a first step, i.e. by modifying equation (3) according to:

in the case of a supervised system, i.e. as described in equation (1), a similar modification can be made to replace M' in equation (5) with M. In order for this coupling to work, the number d of hidden units in the coupling layer of the attribute network_M'(or d)_M) Must equal the number d of hidden units in the sequence network_S. Both of these values are design parameters of the network. Then, after the last time step in the processing sequence, will have a sequence length l_kHas an attribute sequence J_k＝ (x_k，S_k) By embedding (i.e. fixed length feature representation) ofIs the cell state of the sequence network 704

Fig. 8 illustrates a sequence-network-first coupling 800 in which a sequence network 802 is coupled to an attribute network 804 via a connection 806, the connection 806 passing the output of the sequence network 802 to the input of the attribute network 804. By processing the hidden state of the sequence network 802 after the last time step

And attribute data x_kCascading, i.e., by modifying equations (1) and (2) according to the following, the coupling can be affected.

Wherein

Is the concatenation operator.

Fig. 9 illustrates a balanced coupling 900 in which an attribute network 902 and a sequence network 904 are coupled to a converged network 906, the converged network 906 including a cascade layer 908 and a fully connected layer 910 that implements nonlinear functions on the cascade to capture dependencies between attributes and sequences. Without a supervision system, i.e., as described in equation (2), the output V of the M' th layer of the attribute network 902_k ^(M'⁾Coupled to the cascade layer 908 via connection 912 and the hidden state of the sequence network 904 after processing the last time step

Coupled to the cascaded layer 908 via connection 914. By y_kAs an output of the cascaded layer 908, and z_kAs having a weight matrix W_zAnd an offset vector b_zThe output of the fully connected layer 910, which can be expressed as:

fig. 10 is a flow diagram 1000 illustrating an exemplary algorithm for unsupervised learning of an embedding of an attributed sequence using an attribute-network-first coupling 700. From the following description, those skilled in the art will readily recognize the modifications required to apply the algorithm in the case of the sequence-network-first coupling 800 and the balanced coupling 900. The algorithm utilizes the following definition of network parameters:

the attribute network 702 is intended to minimize the difference between the input and reconstructed attribute values. The learning objective function of the attribute network 702 is defined as:

the sequence network 704 is intended to minimize the log-likelihood of incorrect predictions of the next item at each time step. Thus, the class cross entropy can be used to express the sequence network 704 learning objective function as:

the learning process consists of multiple iterations, and the parameters are updated during each iteration based on the calculated gradient. L is^τ _AAnd L^τ _SRepresenting the τ th iteration of the attribute network and the sequence network, respectively. L is^τ _AAnd L^τ _SIs defined as epsilon_AAnd ε_S. The maximum iteration number of the attribute network and the sequence network is T_AAnd T_S。T_AAnd T_SAre not necessarily equal becauseThe number of iterations required for the attribute network and the sequence network may be different. After the attributed sequence learning process, each attributed sequence can be embedded using the learned parameters of the resulting attributed network 702 and sequence network 704.

Returning to flowchart 1000, at step 1002, the parameter vector

And

for example using random values selected from a uniform distribution. Learning begins at step 1003, where an initial attributed sequence J is selected₁. Using the attribute data portion with the attribute sequence as input, a loop 1004 loops over each of the 2M' attribute network layers to compute a forward propagation 1006 through the attribute network 702. The loop 1008 then loops back over each of the 2M' attribute network layers, thereby calculating the gradient 1008 via back propagation. Loop 1012 loops back over the attribute network to update 1014 the network parameters

At step 1016, a learning objective function is calculated according to equation (10). In the second and subsequent cycles through the learning process, it is compared to the value of the previous iteration to determine whether convergence has been reached (i.e., the difference is less than ε)_A). If so, or if the maximum number of iterations T has been reached_AThen the algorithm continues with sequence network training. Otherwise, control returns to loop 1004 for further iterations.

The loop 1020 loops over all items in the current sequence using as inputs the sequence data portion of the attributed sequence and the output of layer M' of the attributed network 702. The loop computes the forward propagation 1022 to obtain the output y_k ^(t)(see equation (4)), calculate the gradient 1024 of the sequence network, and update 1026 the network parameters at each time step

At step 1028, a learning objective function is calculated according to equation (11). In the second and subsequent cycles through the learning process, it is compared to the value of the previous iteration to determine whether convergence has been reached (i.e., the difference is less than ε)_S). If so, or if the maximum number of iterations T has been reached_SThe sequence training loop terminates. Otherwise, control returns to loop 1020 for further iterations.

At step 1032, the algorithm checks whether there is a further attributed sequence J_k. If so, control returns to step 1003 and selects a further attributed sequence. Otherwise, the algorithm terminates.

Fig. 11 is a schematic diagram illustrating an embedded supervised distance metric learning system 1100 for generating an attributed sequence. The system 1100 may be employed when feedback (i.e., labeled training data) is available, for example, based on manual recognition by a person with appropriate expertise. In particular, the feedback items may be defined as triplets (p)_i，p_j，l_ij) Wherein p is_iAnd p_jIs from the set { J₁，...，J_nDifferent attributed sequences of the pull, and_ijis an indication of p_iAnd p_jAre similar (l)_ij1) are also dissimilar (l)_ij0) is used. Therefore, a similar feedback set S { (p) can be defined_i，p_j，l_ij)|l _ij1 and a set of dissimilar feedbacks D { (p)_i，p_j，l_ij)|l_ij0 }. The goal of the system 1100 is then to learn the embedding of the attributed sequences, which under a predetermined distance metric results in the attributed sequences in the similar feedback sets being "more closely spaced" and the attributed sequences in the dissimilar feedback sets being "less closely spaced" (under appropriate definition of these terms).

In particular, given a sequence p of generated attributes_iAnd p_jAnd a distance metric D, and_Θ(p_i,p_j) Learning objective of System 1100Can be defined as:

in equation (12), g is a group-based margin parameter that specifies that the distance between two attributed sequences from dissimilar feedback sets should be greater than g. This prevents the dataset from being reduced to a single point. As will be appreciated by those skilled in the art of depth metric learning, a common approach is to employ a mahalanobis distance function:

in equation (13), Λ is a symmetric, semi-deterministic, and positive matrix. When Λ ═ I, equation (13) is transformed into euclidean distances as follows:

D_Θ(p_i，p_j)＝||Θ(p_i)-Θ(p_j)||_2·(14)

as will be appreciated, an attributed sequence p is generated_iAnd p_jThe embedded nonlinear transformation function Θ of (a) can be defined by any of the

coupling network structures

700, 800, 900 described above. As a specific example, the system 1100 employs a balanced network architecture 900 and includes two such

balanced networks

1102, 1104. Each of these balancing networks includes an

attribute network

1106, 1112, a

sequence network

1108, 1114, and a

fusion network

1110, 1116, where a nonlinear transformation function Θ can be defined as

The two

balancing networks

1102, 1104 are identical and are each used to generate the embedding Θ (p)_i) And Θ (p)_j). As will be appreciated, since the two

networks

1102, 1104 are identical, in an alternative embodiment, a single network can be employed to sequentially generate the embedding Θ (p)_i) And Θ (p)_j) But where Θ (p)_i) And Θ (p)_j) Parallel implementation of simultaneous computations at sufficient multiprocessing resourcesThe common case where a source is available is more efficient. Another metric network 1118 is coupled to the

balancing networks

1102, 1104 to receive the encoded attributed sequences via

connections

1120, 1124 and propagate the learning information (i.e., gradients) back to the networks via

connections

1122, 1126.

The metric network 1118 is designed using a contrast loss function so that after learning the distance metric, the attributed sequences in each similarity pair in S have a smaller distance than the attributed sequences in D. In a particular embodiment, the metric network 1118 uses the labels to compute the Euclidean distance between each pair and propagates the gradient back through all the components in the

networks

1102, 1104. The learning objective of the metric network can be written as:

for the learning rate γ, the parameter W may be updated using the following equation_A、W_S、U_S、b_AAnd b_SUntil convergence:

to enable these updates to be performed, the gradients computed and backpropagated by the metrology network 1118 may be determined using the following equations:

for the mth layer of the attribute network, the update equation is then given by:

when deriving the update equation for the sequence network, it may be convenient to express Δ t ═ Δ_it，Δ_ft，Δ_ot，Δ_ct) Its components can be written as:

by substituting z in equation (22) with the appropriate parameter_k ^(t)The update equation of the sequence network at time step t is given by:

where I is an identity matrix of appropriate dimensions.

The initialization of the parameters may be important when using a gradient descent method during training of the network. In an embodiment of the present invention, a uniform distribution method is used to initialize Θ_AWeight matrix W in_AAnd Θ_SWeight matrix W in (1)_SAnd initializing bias b with zero vector 0_AAnd b_S. Initializing a recursive matrix U using orthogonal matrices_S. By using d_mAs output dimension of the m-th layer and using d_SAs theta_SOf the output dimension, theta_AWeight sum Θ of mth layer in (1)_SWeight W in_SIs initialized to:

in an embodiment of the present invention,/, has been used₂Regularization, combined with an early stopping strategy to prevent overfitting.

Fig. 12 is a flow diagram 1200 illustrating an exemplary algorithm for supervised distance metric learning of an attributed sequence using a balanced network structure 1100. At step 1202, network parameters are initialized, for example, using the methods described above. At step 1204, the algorithm resets to gather { J ] from the feedback₁，...，J_nThe start of the set is pulled and at step 1206 the next (initially the first) feedback triplet is pulled from the set. At step 1208, calculate the embedding Θ (p)_i) And Θ (p)_j). At step 1214, D is calculated using equation (14)_ΘAnd then calculates the loss according to equation (15) at step 1216. A convergence check is performed at step 1218 (e.g., by comparing the calculated penalty to the penalty of the previous iteration and determining whether it is within a defined convergence error epsilon). If convergence has occurred, the algorithm terminates, otherwise the gradient is calculated using equations (17) through (23) at step 1220, and the network is updated using equation (16) at step 1222. A check is then performed at step 1224 to determine if there are more feedback items available, and if so, control returns to step 1206. Otherwise, control passes to step 1226, where a check is performed to determine if the maximum number of iterations has been reached. If not, control returns to step 1204 and another process is performed on the set of feedback. Otherwise, the algorithm terminates.

Mining tasks on ordered data (such as click streams and gene sequences) requires careful design of feature representations that can be used by learning algorithms. Many real-world applications involve a sequence of attributes, where each instance consists of a series of classification items and a set of attributes. Advantageously, embodiments of the invention disclosed herein are capable of learning representations of sequences of attributes in an unsupervised or supervised manner. Obtaining such representations is central to many important data mining tasks, from user behavior analysis to gene sequence clustering. The embedding generated by embodiments of the present invention is task independent and can be used for a variety of mining tasks with a sequence of attributes.

An exemplary system for fraud detection employing embodiments of the present invention is also disclosed. Such a system is able to learn the embedding of user action sequences in conjunction with associated attributes such that "normal" or usual behavior is represented by clusters in points in the feature space, while unusual, abnormal or outlier behavior can be identified as more distant or isolated points.

Embodiments of the present invention have been disclosed that include a supervised learning capability that employs a deep learning framework to learn distance metrics that effectively measure similarity and dissimilarity between attributed sequences.

It is to be understood that although specific embodiments and variations of the invention have been described herein, further modifications and substitutions will be apparent to those skilled in the relevant art. In particular, these examples are provided by way of illustration of the principles of the present invention and provide many specific methods and arrangements for implementing these principles. In general, embodiments of the invention rely on providing a technical arrangement whereby embedding or feature representations of a sequence of attributes can be autonomously learned using a coupled combination of at least two machine learning modules. In some such technical arrangements, an attribute network module is coupled to the sequence network module to provide a system configured to learn feature representations of attributed sequences in an unsupervised manner (i.e., without any labeled data identifying similar and/or dissimilar attributed sequences). In other such technical arrangements, a third module is additionally coupled to the attribute network module and the sequence network module to provide a system configured to learn the feature representations of the attributed sequences in a supervised or semi-supervised manner (i.e., by learning based at least in part on data that has been tagged by a human expert, for example).

Accordingly, the described embodiments should be understood as being provided by way of example for the purpose of teaching general features and principles of the invention, but should not be construed as limiting the scope of the invention.

Claims

1. A machine learning system (102) for embedding attributed sequence data (202, 204) including an attribute data portion (206, 208) having a fixed number of attribute data elements and a sequence data portion (210, 212) having a variable number of sequence data elements into a fixed-length feature representation, the machine learning system comprising:

an attribute network module (500) comprising a feed forward neural network configured to convert an attribute data portion into an encoded attribute vector having a first predetermined number of attribute features; and

a sequence network module (600) comprising a recurrent neural network configured to convert sequence data portions into an encoded sequence vector having a second predetermined number of sequence features,

wherein the attribute network module and the sequence network module are operatively coupled (700, 800, 900) such that, in use, the machine learning system is configured to learn and output a fixed length feature representation of the input attribute-like sequence data, the fixed length feature representation encoding dependencies between different attribute data elements in the attribute data portion, dependencies between different sequence data elements in the sequence data portion, and dependencies between attribute data elements and sequence data elements within the attribute-like sequence data.

2. The machine learning system of claim 1, wherein the attribute network module (500) comprises a multi-layer feedforward neural network having an attribute vector output layer comprising a first predetermined number of units, and the recurrent neural network of the sequence network module (600) comprises a long-short term memory (LSTM) network having a second predetermined number of hidden units.

3. The machine learning system of claim 2, wherein the attribute network module (702) is operatively coupled (706) to the sequence network module (704) by passing an output of the attribute vector output layer to an attribute vector input of the sequence network module.

4. The machine learning system of claim 3, wherein:

said attribute vector input of the sequence network module comprises the hidden state of the LSTM network at the first evaluation step,

the first predetermined number of attribute vector output layer units equals the second predetermined number of sequence network module hidden units, an

The fixed length feature representation input with the attribute sequence data includes the hidden state of the LSTM network at the final evaluation step.

5. The machine learning system of claim 2, wherein the attribute network module (804) is operatively coupled (806) to the sequence network module (802) by passing an output of the sequence network module to an input layer of the attribute network module.

6. The machine learning system of claim 5, wherein:

the number of units in the input layer of the attribute network module is equal to the sum of the fixed number of attribute data elements and the second predetermined number of hidden units of the sequence network module,

the output of the sequence network module comprises the hidden state of the LSTM network at the final evaluation step, the hidden state of the LSTM network at the final evaluation step concatenated with a fixed number of attribute data elements to produce a concatenated attribute network input vector that is passed to the input layer of the attribute network module, and

the fixed-length feature representation with the attribute sequence data input includes the output of the attribute vector output layer.

7. The machine learning system of claim 2, wherein the attribute network module (902) is operatively coupled to the sequence network module (904) via a converged network module (906), the converged network module (906) comprising an input cascade layer (908) and a nonlinear function module (910), the input cascade layer (908) being configured to generate a cascade comprising an output (912) of the attribute vector output layer cascaded with an output (914) of the sequence network module, and the nonlinear function module (910) being configured to learn a nonlinear function of the cascade, the nonlinear function being encoded with dependencies between attribute data elements and sequence data elements within the attribute sequence data.

8. The machine learning system of claim 7, wherein:

the number of cells in the input cascade layer is equal to the sum of the first predetermined number of attribute features and the second predetermined number of sequence features,

the output of the sequence network module includes the hidden state of the LSTM network at the final evaluation step,

the nonlinear function module includes a fully connected feedforward neural network layer, an

The fixed-length feature representation with the input attribute sequence data includes an output vector of fully connected feedforward neural network layers.

9. The machine learning system of any of the preceding claims, further comprising:

a metric network module (1118) bi-directionally coupled to the attribute network module and the sequence network module, the metric network module configured to:

receiving fixed-length feature representation pairs (1120, 1124) of corresponding samples of attribute sequence data, wherein each pair is marked as indicating whether it includes similar or dissimilar attribute sequence data;

calculating gradient information based on a loss function defined according to a predetermined distance metric, wherein the goal is learning embedding, whereby when the markers are similar, fixed length feature representations of corresponding samples of the attributed sequence data have a smaller distance to the predetermined distance metric than when the markers are dissimilar; and

propagating (1122, 1126) the gradient information back through the attribute network module and the sequence network module, thereby updating the parameters of the attribute network module and the sequence network module toward achieving the goal.

10. A training method of a machine learning system (102) for embedding attributed sequence data (202, 204) comprising attribute data portions (206, 208) having a fixed number of attribute data elements and sequence data portions (210, 212) having a variable number of sequence data elements into a fixed-length feature representation, wherein the machine learning system comprises a multi-layer feed-forward neural network (500) having an attribute data input layer and an attribute vector output layer, the attribute vector output layer comprising a first predetermined number of units, the multi-layer feed-forward neural network (500) operatively coupled to a Long Short Term Memory (LSTM) network (600) comprising a second predetermined number of hidden units, the training method comprising:

providing a data set comprising a plurality of attributed sequences;

for each of the attributed sequences in the dataset,

training a multi-layer feedforward neural network using the attribute data portion having the sequence of attributes via back propagation with respect to the first objective function, an

Training the LSTM network via backpropagation with respect to a second objective function using the sequence data portion having the attribute sequence,

wherein training of the multi-layered feedforward neural network is coupled with training the LSTM network such that, when trained, the machine learning system is configured to output a fixed-length feature representation of the input attributed sequence data, the fixed-length feature representation encoding dependencies between different attribute data elements in the attribute data portion, dependencies between different sequence data elements in the sequence data portion, and dependencies between attribute data elements and sequence data elements within the attributed sequence data.

11. The training method of claim 10, wherein a first predetermined number of attribute vector output layer units equals a second predetermined number of LSTM network hidden units, and the multi-layer feedforward neural network comprises:

an encoder having an encoder input layer including an attribute data input layer and an encoder output layer including an attribute vector output layer; and

a decoder having a decoder input layer coupled to the encoder output layer and a decoder output layer comprising a reconstructed estimate of an input to the encoder input layer,

and wherein:

the first objective function comprises a distance measure between the input of the encoder input layer and the reconstructed estimate, an

Training the multi-layer feedforward neural network includes iteratively performing forward and backward propagation steps using the attribute data portion having the attribute sequence as an input to the encoder input layer until the distance measurement satisfies a first convergence goal.

12. The training method of claim 11, wherein the second objective function comprises a likelihood measure of incorrect predictions for the next sequence of items at each of a plurality of training time steps of the LSTM network, and training the LSTM network comprises:

iteratively repeating the plurality of training time steps until the likelihood measure satisfies a second convergence goal, each iteration comprising:

at a first training time step, copying the output of the attribute vector output layer to a hidden state of the LSTM network; and

at the final training time step, likelihood measures are calculated.

13. The training method of claim 12, wherein the distance measure comprises a mean square error loss function and the likelihood measure comprises a categorical cross entropy loss function.

14. The training method of claim 10, wherein:

the number of units in the attribute data input layer is equal to the sum of the fixed number of attribute data elements and the second predetermined number of LSTM network hidden units, an

The second objective function includes a likelihood measure of incorrect predictions for the next sequence item at each of a plurality of training time steps of the LSTM network, an

Training the LSTM network includes iteratively repeating the plurality of training time steps until the likelihood measures satisfy a first convergence goal, each iteration including:

at the final training time step, likelihood measures are calculated.

15. The training method of claim 14, wherein the multi-layer feedforward neural network comprises:

an encoder having an encoder input layer comprising an attribute data input layer and an encoder output layer comprising an attribute vector output layer; and

and wherein:

Training the multi-layer feedforward neural network includes applying the hidden state of the LSTM network to the encoder input layer at a final training time step in cascade with a fixed number of attribute data elements, and iteratively performing the steps of forward propagation and backward propagation until the distance measurement satisfies a second convergence target.

16. A method of training a machine learning system (102) for embedding attributed sequence data (202, 204) comprising an attribute data portion (206, 208) having a fixed number of attribute data elements and a sequence data portion (210, 212) having a variable number of sequence data elements into a fixed-length feature representation, wherein the machine learning system comprises:

an attribute network module comprising a multi-layer feed-forward neural network (500) having an attribute data input layer and an attribute vector output layer, the attribute vector output layer comprising a first predetermined number of cells;

a sequence network module comprising a Long Short Term Memory (LSTM) network (600) comprising a second predetermined number of hidden units; and

a converged network (906), comprising: an input cascade layer (908) having a number of cells equal to a sum of the first predetermined number and the second predetermined number; and a non-linear function layer (910) comprising a fully connected feedforward neural network layer,

wherein the training method comprises the following steps:

providing a data set comprising a plurality of pairs of attributed sequence data, wherein each pair is marked to indicate whether it comprises similar or dissimilar attributed sequence data; and

for each pair of attributed sequences in the dataset,

computing pairs of attribute vectors, each attribute vector having a number of elements equal to a first predetermined number, corresponding to attribute data portions having a sequence of attributes, using a multi-layer feed-forward neural network;

computing pairs of sequence vectors using the LSTM network, each sequence vector having a number of elements equal to a second predetermined number, corresponding to the sequence data portions of the attributed sequences;

concatenating the computed attribute vector and a corresponding vector of the computed sequence vector to generate a fixed-length feature representation pair of the pair of attributed sequences (1120, 1124);

computing a non-linear transformation function of the fixed length feature representation to generate transformed feature representation pairs;

computing gradient information based on a loss function defined according to a predetermined distance metric on transformed feature representation pairs, wherein the objective is learning embedding, whereby fixed length feature representation pairs having corresponding samples of attribute sequence data have smaller distances under the predetermined distance metric when the tokens are similar than when the tokens are dissimilar; and

propagating (1122, 1126) the gradient information back through the multi-layer feed-forward neural network and the LSTM network, thereby updating the parameters of the attribute network module and the sequence network module toward achieving the goal.

17. A computer-readable storage medium having stored thereon a computer program comprising instructions which, when executed by a processor, cause the processor to carry out the steps of the method according to any one of claims 10-16.

18. An apparatus comprising means for performing the method of any of claims 10-16.

19. An apparatus, comprising:

a processor; and

a memory coupled to the processor and comprising instructions stored thereon that, when executed by the processor, cause the processor to perform the method of any of claims 10-16.