EP2304726A1

EP2304726A1 - Audio mix instruction file with timing information referring to unique patterns within audio tracks

Info

Publication number: EP2304726A1
Application number: EP09745757A
Authority: EP
Inventors: Jonas Norberg
Original assignee: Tonium AB
Current assignee: PRODIGIUM LTD.
Priority date: 2008-05-16
Filing date: 2009-05-13
Publication date: 2011-04-06
Also published as: WO2009138425A1

Abstract

The present invention facilitates the sharing of mix instructions files by taking into account timing differences between different files holding the same audio information. Unique reference points or segments are defined in each audio file and are used to determine the timing information related to the respective audio file within the mix instructions file.

Description

AUDIO MIX INSTRUCTION FILE WITH TIMING INFORMATION REFERRING TO UNIQUE PATTERNS WITHIN AUDIO TRACKS

Technical Field

The present invention relates to methods concerning the creation of mix instructions files, or mix instructions files for audio files, to computer program products and to an apparatus for playing audio files.

Background and Related Art

Since it has become possible to store music and other audio information in data files on hard disks or other memory, DJ systems have been developed in which the songs that are to be played are stored in a memory, for example, a data base from which they can be retrieved and played when desired. Typically these systems comprise similar functions to the traditional DJ systems in that they enable the mixing of two tracks and manipulation of each of the tracks to achieve a good mix, for example, a smooth transition between two songs.

In applicant's co-pending application PCT/SE2006/050030 a hand-held DJ system comprising a data base for music tracks is proposed in which one set of controls is used to control both tracks.

Recently systems have become available in which the parameters used to achieve a certain mix can be stored in a mix instructions file and applied to the music files at a later time. The mix instructions file is sometimes referred to as a mix recipe or a mix recipe file. Such a system is the TRAKTOR DJ Studio 3 available from Native In- struments Software Synthesis GmbH. The content of the mix instructions file can be created, for example, by registering the actions performed during a DJ session, that is, which music tracks are used and how the DJ manipulates them. This may be assisted by software that, for example, adjusts the playback speed of two tracks. For example, the mix instructions file can comprise data such as the music files' identi- fication, the point in each music file where the playback should start from, and if the playback speed should be modified, how much and at which point and/or in which interval. The system may apply the data in the mix instructions file to recreate the same mix again based on the same music files, which are stored in the system. This is a great improvement over the method previously known, which was to record the entire mix. Also, this system makes it easier to edit the mix since the parameters set in the mix instruction file can be altered instead of recording the entire mix again.

As disc jockey systems become more and more computer based, the possibility to make mixes of music tracks and share them with other users arises. Co-pending pat- ent application PCT/SE2007/050491 discloses the possibility to create a so-called "mix instructions file" identifying each music track that is used in the mix, and setting parameters for the tracks and how to mix them, but without including the actual music tracks themselves. The mix instructions file can then be shared with other users without having to include all the music tracks. This means that a smaller file can be shared, and that copyright issues can be avoided in that each person using the mix instructions file must obtain the music files.

In this document the creator of such a recipe file will be referred to as a "Recipe Creator", and users of such a file will be referred to as "Recipe Reader". Note that a recipe file normally has a single Creator, whereas it can be shared by one or more Readers. Of course, any given user may be both a reader and a creator.

When the creator shares a certain mix instructions file with many readers, each reader is expected to (1) identify the audio files that are needed in the mix, and (2) follow the included instructions that use these files in order to reproduce the expected mix result. When executing the included instructions, the timing of the instructions plays a critical role in producing the same results as expected by the creator of the mix. This is especially critical since the reader does not have access to the exact same files used by the creator. Instead, another audio file that is identified as the same as the creator's is used.

Given that various digital representations of an audio track exist, a problem may arise when a reader uses one audio file representation that differs from that used by the creator, even though they are identified as representing the same audio track.

For example, the starting point of the music within the file may vary. Also, the files may have been created using different encoding and/or different sampling frequencies. Hence, although the audio contents of the files, as experienced by the user, are similar, the files may not be identical when the time-scale is considered and one file cannot necessarily be replaced for another without some adjustments. Unless the difference in the timing characteristics of the different representations of the same audio track are handled by the reader of the recipe file, the mix experienced by a reader may differ from that expected of the mix creator.

Object of the Invention

It is an object of the invention to facilitate the sharing of mix instructions files.

Summary of the Invention

This object is achieved according to the present invention by a method of processing an audio file representing an audio track, comprising the steps of analyzing the audio file to define at least one reference point or reference segment in the audio file, said reference point or segment indicating an identifiable pattern within the audio file, said reference point or reference segment identifying a unique position within the audio track, independent of the audio file representing the audio track, storing information related to the reference point or reference segment, using this pattern to align files for use in a mix. This first method of the invention enables the analysis of an audio track to determine its timing within the file containing it, to enable the use of the audio track with an existing mix instructions file.

The object is also achieved by a method of creating a mix instructions file specifying a playback mix of at least a first and a second audio file Said method comprising the steps of:

- identifying an audio track to be used in the mix instructions file, - storing in the mix instructions file information related to the playback of the audio track in the mix, including timing information,

- analyzing the audio file to define at least one reference point or reference segment in the audio file, said reference point or segment indicating an identifiable pattern within the audio file independent of the audio file representation, - storing information related to the reference point or reference segment in the mix instructions file,

- defining the timing information in relation to the at least one reference point or reference segment.

This second method enables the creation of mix instructions files using audio files having a determined timing of the audio information within the file.

The identifiable pattern should define a position within the audio track as uniquely as possible.

The object is also achieved by a method of playing a mix of at least a first and a second audio track using a mix instructions file, said method comprising the following steps:

- identifying an audio file comprising an audio track used by the mix instructions file, - identifying at least one reference point or reference segment in the mix instructions file, associated with the audio track, said reference point or segment indicating an identifiable pattern within the identified audio file,

- analyzing the audio file to define at least one indication point or indication seg- ment in the audio file matching the at least one reference point or reference segment,

- comparing the indication point or segment to at least one reference point or segment stored in the mix instructions file,

- aligning the audio file on the basis of the comparison

This method enables the playing of a mix instructions file generated according to the second method above, using audio files generated according to the first method above.

By introducing in the mix instructions file reference time points or segments for each audio file that is used in the mix, any reader can use a mix instructions file created and shared by another creator, even if the audio files to be used by the mix instructions file do not match the exact digital representations used by the creator. Unique reference points or segments are defined in each audio file and are used to determine the timing information related to the respective audio file within the mix instructions file. The reference time points or segments are expected to be purely based on the audio properties of the corresponding track, and are independent of the digital representation of the audio tracks.

To align the two audio files sufficiently precisely to be able to replace one with the other in a mix instructions file, the reference time points or segments require a sufficiently high resolution

According to the invention, therefore, the reader of a mix instructions file is enabled to use a different set of tracks that are identified as similar to those used by the Mix Creator, yet the tracks used by the Reader may differ in quality and digital representation, causing a difference in the timing properties of the tracks. So, unless this difference in timing between the tracks is taken care of, the timing of the mix instructions during playback may no longer match the Creator's timing.

In any of the three methods above, the reference point or segments may be defined, for example, as the point in the audio file having the highest amplitude. This will constitute an easily identifiable reference point in each audio file. Of course a defined number of amplitude maxima may be used to obtain several reference points.

Alternatively a beat analysis may be performed to define the reference point or segment.

Each of the methods may further comprise the following steps: Dividing the second file into frames,

Performing a Fourier Transform on at least two frames,

Converting the transformed signal to periodograms to obtain corresponding spectra,

Determining the difference between the at least two frames,

Selecting at least one processed frame for use in determining the offset between a first and a second audio track comprising the same audio information.

In the latter case, the method may further comprise the steps of determining the offset between the first and the second audio track using cross-correlation, by identifying the maximum value of the cross-correlation of a first and a second signal, each of said first and second signal being a scalar or a vector signal, the first signal representing a characteristic segment of the first track and the second signal representing at least a part of the second track, then determining the time scale to determine the offset. Preferably, to increase the resolution, the method further comprises the steps of obtaining a first and a second cross-correlation value and interpolating between the first and second cross-correlation values.

The method may also further comprise the step of converting the spectra corresponding to segment(s) to a representation that uses a perception-based frequency scale before comparing the characterizing sequences. This will result in a lower dimensionality of the vector signal, while keeping the most significant part of the audio signal, that is, the part that is in the audible range.

The characterizing sequence of a track may be selected as the one having the highest entropy of a number of such sequences. This will optimize the reliability of the matching.

The invention also relates to a computer program product characterized in that it comprises computer readable code means which, when run in an apparatus for playing audio tracks, will cause the apparatus to perform any of the above methods, and to an apparatus for playing audio tracks, comprising such a computer program product.

Brief Description of the Drawings

The invention will be described in more detail in the following, by way of example and with reference to the appended drawings in which:

Fig. 1 illustrates a DJ system that may be used according to the invention, Fig. 2 is a flow chart of a method for creating a mix instructions file according to an embodiment of the invention,

Fig. 3 is a flow chart of a method for using a mix instructions file with new audio files, Fig. 4 is a graphical representation of the audio content of an example audio file, Fig. 5 is a flow chart of a method for determining reference points or segments according to a second embodiment of the invention, Fig. 6 is a flow chart of a method of determining the offset between two files.

Detailed Description of Embodiments

Fig. 1 is a simplified version of Fig. 2 of co-pending application PCT/SE2007/050491. A more detailed description is given in this co-pending application. Fig. 1 illustrates a DJ system according to the invention. A computer 23 comprises a first data base 25 for holding audio files and a second data base 27 for holding mix instructions files similar to the ones known in the art. The computer has user input/output means represented by a keyboard 6 and a screen 8.

Mix instructions files may be created in the computer and/or retrieved from another source, for example, through the Internet, and stored in the second data base 27.

According to the invention the computer comprises a playback software program 33 arranged to retrieve a mix instructions file from the second data base 27 and, as prescribed in the mix instructions file, to retrieve at least one audio file comprising a music track at a time from the first data base 25 and manipulate them according to the mix instructions file to create a mix of the music tracks. The computer preferably also has retrieval software 35 for retrieving, through a data network 36, mix instructions files and/or audio files from external sources. These may be used directly upon retrieval or may be stored in the second data base 27.

According to the invention, even if the mix instructions file is based on music tracks that are found in the first data base 25, it may be that the copies of the music tracks do not have the exact same digital representation of the tracks as the files used when the mix instructions file is created. The computer also preferably has retrieval software 35 for retrieving, through a data network 36, mix instructions files and/or mu- sic tracks from sources such as a music track data base 37 or a mix instructions file data base 39 in the network.

The portable DJ system also comprises a mix creation unit 40 to enable the operator to create mix instructions files in the portable DJ system and/or the direct retrieval of mix instructions files and/or music tracks from the network to the portable DJ system. From the creator's point of view, the mix instructions files are created by user input to the computer in a way known per se. According to the invention, the portable DJ system is arranged to analyze each audio file used in a mix instructions file and define one or more reference time points or reference segments in the file to be included in the mix instructions file. The system is also arranged to define all timing properties of the mix instructions file in relation to the reference points or reference segments. In this way, when the mix instructions file is used with other audio files, the same reference time points, or segments, can be identified in these other audio files and used to align the audio file as required for use in the mix. As will be understood, this is achieved by means of software arranged in the portable DJ system. How this can be achieved will be discussed in more detail in the following.

It will be understood that each of the units 33, 35 and 40 in the computer 23 and the playback unit 43 of the portable unit 41 are implemented as software modules stored in the computer, or portable unit, respectively. As the skilled person will understand, the illustration shown in Figure 2 is only a logical diagram, and the actual functions can be implemented in software in a number of different ways.

As will be understood, different audio files that use different digital representations, yet have the same audio content may differ in several respects. For example they may have different delay times before the audio signal actually starts, or after the audio signal ends. Also, they may have a different file format, have been created us- ing a different encoding scheme and/or sampling frequency. Further, they may be affected by noise in different ways. Hence, for example, if a reference time point has been set a number of seconds into a music track based on a file of a particular representation of the track, if a different representation of the track, obtained from another source, is used, the reference time point may be placed in the wrong position relative to the actual content of the file, if the reference time point is simply placed at the same number of seconds into the music track. According to the invention, therefore, the mix creation unit 40 of the portable DJ system is arranged, when creating a mix, to identify one or more reference points, or reference segments, which are well defined points or segments in each music track, related to the audio content of the music track. According to the invention, any time properties in the mix instructions file (such as the point in time of a cue point) are defined in relation to the reference points or reference segments of the appropriate track, making the mix instructions file independent of any digital representation of the included audio tracks. According to the invention, the mix playback unit 33 of the portable DJ system is also arranged, when reading a mix, to identify these reference points, or reference segments in order to properly align the reader's audio tracks. These reference time points or segments may be determined in a number of different ways, as will be discussed below.

According to the invention, therefore, when creating a mix instructions file each of the audio files used in the mix instructions file should have one or more reference time points, or reference segments defined, and the definitions should be included in the mix instructions file. The reference time points or segments may be specified in the audio file, or stored in a separate data base, associated to the audio file. Exam- pies of how to do this will be given in the following. An overall procedure for creating the mix instructions file is shown in Figure 2.

In step S21 a reference to an audio file is included in the mix instructions file. The audio file can be manipulated in any way that is common in the art, to mix it with one or more other audio files, increase its speed etc. In step S22 the audio file is analyzed to see if it already has reference time points or segments defined for it. The reference time points or segments can be either directly embedded in the audio file, or stored in an independent data base that refers to the audio file. Step S23 is a decision step. If reference points or segments exist for the file, go to step S25, if not, go to step S24.

In step S24 at least one reference point or segment is defined for the file. This can be done in a number of different ways, some of which will be discussed below. In step S25 information about the reference time points, or segments is stored in connection with the audio file itself, and with the mix instructions file.

In step S26, the mix instructions file is redefined so that any time properties in the mix instructions file are defined in relation to the reference points or reference segments for the corresponding audio track.

Figure 3 illustrates a method of using a mix instructions file retrieved from another source, with the reader's own audio files. As discussed above, these may not be identical to the ones used when creating the mix instructions file. Co-pending application PCT/SE2007/050491 discloses a method for retrieving audio tracks to be used with the mix instructions file, if necessary.

In step S31 the mix instructions file is searched to identify the audio files used by it. Then for each audio file, the following steps are performed: In step S32 the audio file is analyzed to see if it has reference time points, or segments, that match the ones defined for the audio file in the mix instructions file. This means that the references should be created in the same way, so that they represent the same characteristic of the audio file.

Step S33 is a decision step: if matching references are found, go to step S35; if not, go to step S34. In step S34 reference time points, or segments, are defined for the audio file in the same way as was done in S24 of Figure 2. Information about the references is stored in association with the audio file.

In step S35 the references created in step S34, or found in step S33, are compared to the reference time points for the audio file in the mix instructions file, and the result of the comparison is used to align the audio file with the one used when creating the mix instructions file.

The alignment of the reference time points or reference segments can be performed by determining the time difference between the reference points or reference segments of the track to be used and the reference points or reference segments of the same track as defined in the mix instructions file. This difference can then be offset for each time property in the mix instructions file, which is defined in relation to the creator's reference points or reference segments for that track. If the differences are not identical for all reference time points or reference segments, then the difference can advantageously be interpolated between such points (before the first or after the last time point or segment the difference can be held constant). Linear interpolation of the differences can be used.

When all the audio files to be used with the mix instructions file have been aligned properly, the mix instructions file can be used together with the reader's audio files to play the mix. It would of course be possible to align the audio files as they become needed while playing the mix as well, if this could be handled fast enough. It would of course also be possible to first redefine the mix instructions file so that any time properties in the mix instructions file are defined in relation to the reader's audio files, before using the mix instructions file.

With reference to Figure 4, some methods of determining the reference points, or segments, in the time domain will be discussed. Figure 4 shows a graphical repre- sentation in the time domain of the audio content of an audio file. As can be seen, the audio content has clearly distinguishable features, such as amplitude peaks in a certain pattern. A simple way of determining a reference time point would be to find the maximum amplitudes in the track and determine the distance to this maximum peak from the start of the file. To create a more reliable reference, a number of the highest peaks, for example three or five of the highest peaks could be used.

In another embodiment of the invention, the reference points are determined based on an analysis of certain time characteristics of the audio track such as the time positions of the audio beats in the track. An example of such an analysis is the beats analysis produced by the software algorithm aufTAKT (http ://www.zplane.de/).

Such an analysis produces a vector of time positions. With the same analysis applied on different digital representation of the same audio track, different time vectors will be produced. However, given that the analysis focuses on the audio properties of the tracks, and is independent of the digital representations, the different time vectors can be aligned to provide a time alignment of the different track representations.

In another embodiment of the invention, the reference points, or segments, are determined based on a sequence of periodograms, which form estimates of short-time Fourier transforms, as will be discussed in more detail below.

Several methods are known for generating a hash signal. For example, WO 02/065782 discloses a method of generating a hash signal, also referred to as a signature. The hash signal may be seen as a summary of the file, and can be matched with hash signals stored in a database to identify the information signal. This can be used, for example, to verify correct receipt of a large file by sending only the hash value of the file. Fig. 5 is a flow chart of the steps performed to generate one or more sequences of periodograms, that can be used according to the invention to determine the offset between two files comprising the same music track.

In the first step S51, at least a part of the music track is divided into frames. A frame length of 20-30 ms has been found to be suitable. To increase the resolution, preferably, the frames overlap.

In step S52 a Fast Fourier Transform (FFT) is performed on each frame, to transform the frames to the frequency domain. The resulting frequency domain represen- tation is discrete and each frequency point is referred to as a frequency "bin".

In step S53, for each frame, the signal is then converted to a periodogram by computing the energy for each frequency bin and discarding the phase. Advantageously the square-root of the periodogram spectrum values can be taken to obtain an esti- mate of the short-term magnitude spectrum. Below we refer to the estimate of the magnitude spectrum or the periodogram, whichever is used, as the "spectrum".

Step S54 is an optional step in which the spectrum is converted to a perceptual scale, that is, the spectrum values within each of a set of pre-defined frequency bands are summed and scaled, where the pre-defined frequency bands and scaling factors are selected in a manner consistent with the audible frequency range. The summations can be weighted for increased accuracy. The summations result in a "perceptual" spectrum with fewer bins. The well-known mel or ERB (equivalent rectangular band) scales can be used to construct the pre-defined frequency bands.

Advantageously, a processing step S55 can be included, that removes the undesired sensitivity to fixed offsets in the spectrum. Such fixed offsets can be caused by stationary noise For each time frame, step S55 has as input a spectrum corresponding to that time frame. The signal is now represented by a sequence of spectra. For each frequency bin a scalar time sequence of spectrum values exists. Thus, we can distin- guish a set of scalar frequency "channels", each channel corresponding to one frequency bin. Each channel is a time signal that has one time sample per time frame of the original signal. These channels are significantly down-sampled compared to the original audio signal. The sequence of spectra forms a vector signal with each of the channels being a component of the vector. In step S55 each frame is compared to a previous frame to determine the difference between them. That is, for each channel an output sample the previous sample is subtracted from the current sample. The resulting vector signal of time differences of the spectra describes the changes in the spectra corresponding to successive time segments. Thus, such vector signals are less sensitive to stationary additive noise than the spectra themselves.

The difference signals are sensitive to the power level of the audio file. A simple method to remove information about the power level of a signal is to consider only the sign of the signal. That is, positive signal samples are represented by +1 and negative signal samples are represented by -1. In step S56 the sign of the difference signal is determined. The end result is a binary representation of the time differences of the sequence of spectrums.

In step S57 one or more processed frames are selected for use in the method de- picted in Fig. 6, for determining the offset between the file used in Fig. 5 and another file comprising the same music track. The result, as illustrated by step S57, is a characteristic representation of a part of the audio file. We refer to this characteristic representation of the part as the "characteristic vector sequence". Some characteris sequences will be more efficient in the sense that they can be shorter to obtain a cer- tain alignment performance. A suitable selection criterion for selecting such a part or parts of the signal to be used for computing the characteristic vector sequence would be entropy of the characteristic vector sequence, as matching such a characteristic vector sequence in different files is likely to have features facilitating alignment. However, the parts can also be selected manually. As a third option, the parts can be selected as the part of 500 ms centered around the middle of the file used for the original recipe.

Fig. 6 is a flow chart of how the characteristic vector sequence obtained according to Fig. 5 can be used to find the offset between two files. As explained above, each characteristic vector sequence is a vector signal segment or a set of vector signal segments. In step S61 the characteristic vector sequence is cross-correlated with the entire vector sequence representation of the other file. This can be described as sliding the characteristic vector sequence across the vector sequence of the other file and cross-correlating at different positions to find the best match. By finding the maximum in the cross-correlation, the location of the characteristic vector sequence in the other file is found, which means that the time scales of the two files can be synchronized. It is noted that the cross-correlation is performed on a vector signal that is down-sampled significantly compared to the audio signal. Thus, in step S62 an interpolation is made between the distinct cross-correlation sample values obtained, to increase the resolution of the cross-correlation curve. In step S63 the maximum vale of the cross-correlation curve is found, to identify more precisely the part of the other file that matches the characteristic vector sequence. In step S64 the time scale is determined to determine the offset between the two files.

The offset value obtained in step S64 may be used to correct the parameters in the mix instructions file related to the processed audio file.

Claims

1. A method of analyzing an audio file representing an audio track, comprising the steps of analyzing the audio file to define at least one reference point or reference segment in the audio file, said reference point or segment indicating an identifiable pattern within the audio file, said reference point or reference segment identifying a unique position within the audio track, independent of the audio file representing the audio track, storing information related to the reference point or reference segment, using this pattern to align files for use in a mix

2. A method of creating a mix instructions file specifying a playback mix of at least a first and a second audio file Said method comprising the steps of:

- identifying an audio track to be used in the mix instructions file,

- storing in the mix instructions file information related to the playback of the audio track in the mix, including timing information,

- analyzing the audio file to define at least one reference point or reference seg- ment in the audio file, said reference point or segment indicating an identifiable pattern within the audio file, independent of the audio file representation,

- storing information related to the reference point or reference segment in the mix instructions file,

3. A method of playing a mix of at least a first and a second audio track using a mix instructions file, said method comprising the following steps:

- identifying an audio file comprising an audio track used by the mix instructions file, - identifying at least one reference point or reference segment in the mix instructions file, associated with the audio track, said reference point or segment indicating an identifiable pattern within the identified audio file, independent of the audio file representation - analyzing the audio file to define at least one indication point or indication segment in the audio file matching the at least one reference point or reference segment,

- comparing the indication point or segment to at least one reference point or segment stored in the mix instructions file - aligning the audio file on the basis of the comparison

4. A method according to any one of the previous claims, wherein the reference point is defined as the point in the audio file having the highest amplitude.

5. A method according to any one of the claims 1-3, wherein a beat analysis is used to define the reference point or segment.

6. A method according to any one of the preceding claims, further comprising the following steps: Dividing the second file into frames

Performing a Fourier Transform on at least two frames

Determining the difference between the at least two frames

7. A method according to claim 6, comprising the step of determining the offset between the first and the second audio track using cross-correlation, by identifying the maximum value of the cross-correlation of a first and a second signal, each of said first and second signal being a scalar or a vector signal, the first signal representing a characteristic segment of the first track and the second signal representing at least a part of the second track, then determining the time scale to determine the offset.

8. A method according to claim 7, comprising the steps of obtaining a first and a second cross-correlation value and interpolating between the first and second cross- correlation values to increase the resolution

9. A method according to any one of the claims 6-8, further comprising the step of converting the spectra corresponding to segment(s) to a representation that uses a perception-based frequency scale before comparing the characterizing sequences.

10. A method according to any one of the preceding claims wherein the characterizing sequence of a track is selected as the one having the highest entropy of a number of such sequences.

11. A computer program product characterized in that it comprises computer readable code means which, when run in an apparatus for playing audio tracks, will cause the apparatus to perform the method according to any one of the claims 1-9.

12. An apparatus for playing audio tracks, characterized in that it comprises a computer program product according to claim 11.