WO2010020646A2

WO2010020646A2 - A method of processing packetised data

Info

Publication number: WO2010020646A2
Application number: PCT/EP2009/060686
Authority: WO
Inventors: Jay Fenton
Original assignee: Howler Technologies Limited
Priority date: 2008-08-19
Filing date: 2009-08-18
Publication date: 2010-02-25
Also published as: WO2010020646A3; GB2462825A; GB0815121D0

Abstract

A method of processing, in real-time, packetised network-sourced data using a multi-core processor comprising one or more first processing cores and a plurality of second processing cores, the one or more first processing cores being configured to allocate processing tasks to the plurality of second processing cores and each of the second processing cores having a dedicated working memory for receiving data to be processed by the second processing core, the method comprising: receiving a packet of data at an I/O interface; allocating processing of the packet to a processing core of the plurality of second processing core; processing the packet on the processing core of the plurality of second processing cores; and transferring the packet to the I/O interface. The multi-core processor may be based on the STI Cell Broadband Engine architecture.

Description

A method of processing packetised data

Field of the Invention

The invention relates to the processing of packetised data in real-time using a multi- core processor.

Background of the Invention

Many applications require the transfer of data in real-time. For example, when data is transmitted across different networks, the data may have to be compressed, encrypted, encoded, decoded, decrypted, decompressed, analysed and processed in real-time as the data passes between the networks. Examples of such applications are voice calls and video streaming. In both examples, an audio or video codec is typically used to encode and packetise the data at the source into a form suitable for transmission over a network. As the packets traverse the networks, in cases where two networks do not share a common codec, or have bandwidth or other constraints, it may be required to re-encode the data at the network edge into another codec format, a process known as transcoding. This is especially prevalent in voice networks, an example of which would be at the boundary between a Mobile Network Operator (MNO) and the Public Switched Telephone Network (PSTN), the two of which are based on differing technologies and incompatible codecs. Any delays in the conversion of the data will be perceptible to the persons on the call. Consequently, such transcoding systems must operate within real-time constraints and commit to processing a packet with a guaranteed Round-Trip Time (RTT) . Conventionally, the prior art systems which perform the transcoding and signal processing operations do so using Digital Signal Processor (DSP) technology, a type of processor which is specifically designed for streaming real-time math operations. However, extensive software development, hardware engineering, and manufacturing resources are required to develop a solution using this technology, and the implementation of the entire system is always highly bespoke. Additionally, due to their limited individual processing capacity a large number of DSPs are required in order to concurrently process the number of calls typically seen by providers today, resulting in substantial power, cooling and space requirements for hardware using this technology.

There exist general purpose processing units that can carry out parallel processing on the same silicon die, examples of which would be the homogenous multi-core CPUs from the likes of Intel and AMD. However these general purpose CPU implementations hit a performance ceiling due to the requirement that a traditional programming model be maintained for backwards compatibility with existing software. As more cores are added, there is increased contention amongst the cores for access to finite resources such as the shared memory space and input/output (I/O) peripherals, and additional overhead in maintaining synchronicity between the CPU cores.

Some vendors are now creating a new generation of multi-core processing units using a different approach wherein the individual cores operate as discrete units, sharing minimal resources, and yet still co-exist on the same silicon die. An example of such a processing unit is the STI Cell Broadband Engine processor (Cell/B.E). A typical Cell/B.E processor contains two sets of disparate processor elements. The first set consists of one or more Power Processing Elements (PPE), each a general-purpose Power architecture processor. The PPEs are typically configured to run a single operating system image such as Linux, by way of Symmetric Multi-Processing (SMP). The second set of processor elements, known as Synergistic Processing Elements (SPEs), each consist of one or more Synergistic Processing Units, a Memory Flow Controller and a small amount of local memory, dubbed Local Store (LS). The PPEs and the SPEs communicate over the Element Interconnect Bus (EIB), consisting of a number of paired high-bandwidth counter-rotating rings to which all components of the system are connected. Two or more Cell Broadband Processors can be linked via a Broadband Interface Protocol (BIF) which allows the multiple processors to act as a single SMP-enabled system image. The design enables a minimally contended, highly asynchronous, highly parallel and scalable computer environment in a small power efficient form-factor, whilst retaining the ability for high-speed, low latency inter-core communication. However, whilst this new architecture radically increases performance, it comes at the expense of a substantially modified programming model when compared to a traditional CPU. A processing unit of this type will excel at problem sets that are highly parallelisable, where each block of data has minimal or no dependencies on other blocks of data. For problems where there exists large numbers of inter-dependencies between blocks of data, it is possible for this particular architecture to be less performant than a traditional CPU due to the overhead of having to transfer blocks of data from the main memory store to the local store on which it will be processed, and the limited local memory of the local store. In programming for the Cell/B.E in particular, the various nuances and caveats have to be taken into account in order to achieve a system that uses the processor to its fullest potential. Issues typically abstracted away in a more typical CPU must be dealt with directly by the programmer, including memory alignment, asynchronous memory transfers and memory management, vectorisation of algorithms, and instruction pipelining. Additionally, as is typical for a processor unit of this type, the data bus connecting the components is optimised for the transfer of large blocks of data, and performs less efficiently with many small transactions. This directly affects the processing of real-time network- sourced data, as it is typically delivered in very small frames.

The invention aims to improve on the prior art.

Summary of the Invention

According to the invention, there is provided a method of processing, in real-time, packetised network-sourced data using a multi-core processor comprising one or more first processing cores and a plurality of second processing cores, the one or more first processing cores being configured to allocate processing tasks to the plurality of second processing cores and each of the second processing cores having a dedicated working memory for receiving data to be processed by the second processing core, the method comprising: receiving a packet of data at an I/O interface; allocating processing of the packet to a processing core of the plurality of second processing core; processing the packet on the allocated processing core; and transferring the packet to the I/O interface. - A -

The packet may be a Real-time Transfer Protocol (RTP) packet sourced from a network. The multi-core processor may be based on STI Cell Broadband Engine architecture. The I/O interface may be a network interface card.

The method may further comprise transferring the packet to one of the one or more first processing cores and after allocating processing of the packet to the processing core of the plurality of second processing cores, transferring the packet to the processing core of the plurality of second processing cores. Additionally, the method may comprise creating a unit for receiving two or more packets of data; and adding two or more packets of data to the unit; and wherein transferring the packet comprises transferring a unit comprising the packet. In response to adding a first packet to said unit, a timer may be started and transfer of the unit may be initiated on expiry of a time window defined by the timer.

Consequently, according to an alternative embodiment of the invention, the invention provides a way of using a multi-core processor system, such as a STI Cell Broadband Engine, to process packetized data in real-time. By aggregating the packets into units comprising more than one packet, the blocks of data transferred within the system are larger than the size of a single packet and therefore provided in a form that the system can handle more efficiently.

The creating of a unit for receiving two or more packets of data may be carried out by an external processing device connected to the I/O interface and transferring the packet to the one or more first processing cores may comprise transferring the unit comprising the packet. By creating the aggregate units in external processing devices, the per-packet overhead on the one or more first processing cores is reduced.

Alternatively, the method may comprise transferring a packet directly from the I/O interface to a dedicated memory of the processing core of the plurality of second processing cores using a memory transfer operation. Transferring a packet directly from the I/O interface may comprise transferring a packet directly from the I/O interface to a processing core of the plurality of second processing cores selected in accordance with a scheduling algorithm.

Consequently, according to another embodiment of the invention, the invention provides a way of processing packetized data in real-time, using a multi-core processor system such as a STI Cell Broadband Engine, in which the overhead of having to transfer blocks of data from the main memory store to the local store by the PPE is avoided.

Allocating processing of the packet may be carried out by a processing kernel of the processing core of the plurality of second processing cores. Allocating processing of the packet may also be carried out by one of the one or more first processing cores.

The transferring of the packet to the I/O interface may comprise transferring the packet to the first processing core for subsequent transfer to the I/O interface. Alternatively, the transferring of the packet to the I/O interface may comprise transferring the packet directly from the second processing core of the plurality of second processing cores to the I/O interface through a memory transfer operation.

The method may further comprise transferring the packet from a processing core of the plurality of second processing cores to another processing core of the plurality of second processing cores for processing. Yet further, the method may also comprise calculating one or more packet checksums on a processing core of the plurality of second processing cores.

According to the invention, there is provided a computer program comprising instructions that when executed by a processor system comprising a plurality of processor cores causes the processor system to carry out the method described above. The instructions may comprise multiple program object codes for execution by the plurality of processor cores.

Furthermore, according to the invention, there is also provided an apparatus for processing, in real-time, packetized network-sourced data, the apparatus comprising one or more multi-core processors comprising one or more first processing cores and a plurality of second processing cores, the one or more first processing cores being configured to allocate processing tasks to the plurality of second processing cores and each of the second processing cores having a dedicated working memory for receiving data to be processed by the second processing core, and an I/O interface for receiving a packet of data; the one or more first processing cores comprising a processing core for allocating processing of a packet received at the I/O interface to one of the plurality of second processing cores and said one of the plurality of second processing cores being operable to process the packet.

The multi-core processor system may be based on the STI Cell Broadband Engine architecture. The dedicated working memory may be an on-die local working memory.

The processing core of the plurality of second processing cores may be operable to transfer the packet from the processing core of the one or more first processing cores to the dedicated memory of the processing core of the plurality of second processing cores using a memory transfer operation.

The apparatus may further comprising means for creating a unit for receiving two or more packets of data and adding two or more packets of data to the unit, and transferring the packet may comprise transferring a unit comprising the packet. The means for creating a unit may further be operable to start a timer when a first packet has been added to the unit and initiate transfer of the unit on expiry of a time window defined by the timer. The processing core of the one or more first processing cores may comprise the means for creating a unit. Alternatively, the apparatus may further comprise a further processing device, the further processing device comprising the means for creating a unit and means for transferring the unit to the processing core of the one or more processing cores.

The processing core of the plurality of second processing cores may be operable to transfer a packet directly from the I/O interface to a dedicated memory of the processing core of the plurality of second processing cores using a memory transfer operation.

The processing core of the plurality of second processing cores may configured to transfer a processed packet to the processing core of the one or more first processing cores for subsequent transfer to the I/O interface. Alternatively, the processing core of the plurality of second processing cores may be configured to transfer a processed packet directly to the I/O interface using a memory transfer operation.

The processing core of the plurality of second processing cores may also be configured to transfer the packet to another processing core of the plurality of second processing cores for further processing. Furthermore, the processing core of the plurality of second processing cores may be operable to calculate a packet checksum for the packet.

The apparatus may comprise telecom switching apparatus and the processing core of the plurality of second processing cores may be operable to perform at least one out of encoding, decoding, compressing, decompressing, encrypting, decrypting, signal analysis and signal processing.

Brief Description of Drawings

Embodiments of the invention will now be described, by way of example, with reference to Figures 1 to 10 of the accompanying drawings, in which: Figure 1 is a schematic diagram of a typical communication system; Figure 2 is a schematic diagram of an example of a data processing apparatus of the communication system of Figure 1 ;

Figure 3 is a schematic diagram of a multi-core processor of the data processing apparatus; Figure 4 shows how the virtual memory of a first type of processor of the multi-core processor can be organised;

Figure 5 shows how received packets can be organised;

Figure 6 illustrates a method of receiving, processing and transmitting data packets by the data processing apparatus; Figure 7 illustrates a method of preparing, in the first type of processor, a packet for processing by another type of processor in the multi-core processor;

Figure 8 illustrates a method of indicating, by the first type of processor, that packets are available for processing by the second type of processor;

Figure 9 illustrates a method of transferring packets from the first type of processor to the second type of processor and back;

Figure 10 illustrates a method of processing packets in the second type of processor of the multi-core processor;

Figure 11 is a schematic diagram of another example of a data processing apparatus of the communication system of Figure 1.

Detailed Description

With reference to Figure 1, a communication system 1 comprises a first user agent 2 operating in a first network 3, a second user agent 4 operating in a second network 5.

The first and second networks comprise data processing apparatus 6, 7 for processing the data communicated from the first user agent 2 to the second user agent 4 and vice versa. The first user agent 2 may be a mobile phone and the first network may be a

Mobile Network Operator (MNO) network. The first mobile phone may utilise the

Adaptive Multi-Rate Narrow Band (AMR-NB) audio codec to communicate with the

Mobile Network Operator (MNO) network. The second user agent 4 may be a conventional telephone and the second network 5 may be a Public Switched Telephone

Network (PSTN). The conventional telephone may be connected via an Analogue circuit to the PSTN. The first and second networks 3 and 5 are inter-connected by way of a Time-Division Multiplexed (TDM) circuit. The packets of audio data are transcoded from the AMR-NB codec used in network 3 to the G.711 codec used in network 5 (and vice versa in the opposite direction). The mobile network operator may perform the transcoding steps in both directions using the processing apparatus 6, which will decode the packets from the first network 3 into a normalised form such as signed linear, and re-encode the signed linear into a new packet using the codec of the second network 5. The capacity of the processing apparatus is greatly reduced when transcoding to a complex codec such as AMR-NB, when compared to the processing of a less computationally expensive codec such as G.711. The data processing apparatus 6 may also carry out further processing of the data, such as encryption, decryption, signal analysis and signal processing. For example, the signal processing may involve echo cancellation.

With reference to Figure 2, the processing apparatus 6 comprises a multi-core processor system 8, a network interface 9 for receiving data from, and transmitting data to, the networks 3, 5 and main memory 10. The multi-core processor system may comprise one or more multi-core processors in communication with each other. The multi-core processors may have separate and shared regions of the main memory 10, managed by the operating system. The network interface 9 may comprise one or more network cards for each multi-core processor.

With reference to the Figure 3, each multi-core processor 8a comprises one or more marshalling cores 11 and a plurality of processing cores 12a-12h. The marshalling core 11 may be a general purpose processor. As shown in Figure 3, one or more marshalling cores are connected to and control one or more associated processing cores 12a- 12h. Each processing core has a separate dedicated working memory for receiving and working with data to be processed by that core. The dedicated working memory is not managed by the operating system for programs running on the processing core, and is accessible by other cores only through an explicit memory transfer operation. The dedicated working memory of an SPE may be an on-die local working memory. A marshalling core 11 communicates with its processing cores 12a-12h via a circular highspeed data bus 13. The bus also connects the marshalling core to the main memory 10 and the network interface 9 in Figure 2. It is contemplated that for each marshalling core there may exist a separate region of the main memory 10 and a separate I/O lane forming part of the network interface 9. The separate region of the main memory may be a local main memory bank 10a and the I/O lane may host a topologically local Network Interface Card (NIC) 9a. Two or more multi-core processors can be linked via a coherent interface 14 to form the multi-processing system. In one embodiment, the multi-core processor system 8 in Figure 2 may comprise four marshalling cores, and sixteen associated processing cores, provided as two multi-core processors 8a.

The multi-core processor system 8 of Figure 2 may consist of one or more Cell Broadband Engine processors or its derivatives and variations, such as the PowerXCell/8i processor. In a Cell Broadband Engine processor, one or more marshalling cores form a unit referred to as a Power Processing Element (PPE). The PPEs are formed from general purpose Power architecture based processors. The processing cores are referred to as Synergistic Processing Elements (SPEs). The circular high-speed data bus is an Element Interconnect Bus (EIB). The coherent interface may be, but is not limited to, a Broadband Interface Protocol (BIF). The marshalling core units, the processing cores, the bus and the coherent interface will hereinafter be referred to as the PPEs, the SPEs, the EIB and the BIF. However, it should be recognised that a Cell Broadband Engine is just one example of a multi-core processing system and other examples are contemplated.

The PPE allocates processing tasks to the SPEs. It also transfers code for carrying out the processing tasks to the SPEs through the same memory transfer mechanism used for data. Each SPE comprises a Memory Flow Controller (MFC) for asynchronously transferring data to and from the SPE, and a Local Store (LS) for storing data and instructions. In some embodiments, each SPE has a 256KB on-chip local store, which is locally addressable only by that particular SPE through load/store instructions. Consequently, the other SPEs or the PPE cannot directly access the local store other than through memory transfer operations. Both the code and data executing on the

SPE must reside within its local store memory region. The SPEs cannot directly access other memory regions in the system through load/ store instructions except their own dedicated on-chip local store. If the SPE needs to access main memory or other memory regions within the system, then it needs to initiate memory transfer operations, such as Direct Memory Access (DMA) operations, via the MFC. The MFC can be used to effectively transfer data between the main memory and an SPE local store, or between two different SPE local stores.

AU the PPEs in the system may run a shared operating system, such as, but not limited to, Linux. With reference to Figures 4, the Operating System divides virtual memory, which maps onto memory addresses in the main memory 10, into a kernel-space 15 for running the kernel and a user-space 16 for running user applications. The kernel-space 15 comprises a first kernel buffer 17 for receiving data from the network via the network interface 9 and a second kernel buffer 18 for transferring data to the network via the network interface 9. The user-space comprises a user-space buffer 19 for storing data packets to be processed by the SPEs and a memory mapped bit-field 20 for each SPE 12a-12h for indicating to the SPEs when data is available to be processed. The memory mapped bit-field 20 in the user space corresponds to a direct problem state register on the SPE to which both the PPE and associated SPE have transparent coherent access to through an initial memory mapping operation performed during initialisation.

Data is transferred from the PPE 11 to the SPE 12a-12h in groups of packets which will hereinafter be referred to as stripes. With reference to Figure 5, a stripe 21 comprises a stripe metadata field 22 and a plurality of stripe slots 23a-23h for receiving a structure consisting of a packet of data and session information including routing information. The stripe metadata field 22 contains a clock field and storage space for passing general system state information between the PPE and the SPE. A stripe comprises 'n' stripe slots for receiving 'n' data packets, where n may be varied at runtime based on the availability of local store memory after the code modules have been loaded onto a given SPE. As an example, 'n' is typically varied between 4 and 32.

An example of a process for constructing and processing a packet in a user agent and in the data processing apparatus will now be described with reference to Figure 6. At step 6.1, the first user agent samples data. If the first user agent is a VoIP phone, the data may be 20 ms of audio data sampled by the microphone. Alternatively, if the first user agent is a streaming video server, the data may consist of video data sourced from, for example, an attached video camera through an appropriate device driver. At step 6.2, the user agent encodes the data into a desired codec and constructs a packet according to a suitable protocol. As mentioned with respect to Figure 1, for voice data the desired codec may be AMR-NB. The protocol may be the Real-time Transfer Protocol (RTP). The RTP packet is then transmitted over the first network 3 at step 6.3. The negotiation of desired codec, IP addresses and port numbers for each channel end- point of the call (such as the network interface of the data processing apparatus 6) will have been determined at call setup time by, for example, a Mobile Switching Controller (MSC), Media Gateway Controller (MGC), or Session Border Controller (SBC) using, for example, the Session Initiation Protocol (SIP) and/or Megaco/H.248 signalling protocols.

At step 6.4, the network card of the network interface 9 of the data processing apparatus 6 of the first network receives the RTP packet into a local buffer, and examines its next free buffer descriptor to determine the address in, for example, main memory to which to relay the packet. Subsequently, at step 6.5, the network card initiates a memory transfer, such as a DMA transfer, of the packet to the address in the first kernel buffer 17 and on completion of one or more transfers, generates an interrupt to indicate to the operating system that one or more packets are waiting. At step 6.6, the operating system processes the packet through the IP stack in the kernel and places the packet into a queue for an existing socket. At step 6.7, a user-space application checks whether there are any packets available on the relevant socket, transfers available packets to the buffer 19 in the user-space 16 and prepares the packets into a form suitable for processing by the SPEs 12a-12h. When a stripe has been prepared for an SPE, the user-space application indicates to the SPEs at step 6.8 that packets are ready for processing. The SPEs then collect, process and return the packets at step 6.9. A packet may be transferred between several SPEs for different processing operations before being returned to the user application buffer 19. Steps 6.7, 6.8 and 6.9 will be described in more detail with respect to Figures 7, 8, 9 and 10. At step 6.10, the user application checks for processed stripes in the user-space buffer 19, loops through the packets to transfers the processed packets to the kernel and the kernel dispatches the received packets to the second user agent 4. The user application may open a raw socket in the second buffer 18 and data copied into this buffer may be immediately dispatched to the network card, bypassing the IP stack. In an alternative embodiment, the processed packets are copied directly from the SPEs to a NIC via a memory transfer operation across the EIB 13.

An example of a method of preparing packets by the PPE for processing by an SPE (step 6.7) will now be described in more detail with respect to Figure 7. At step 7.1, the user application checks whether there are packets in a queue on the relevant sockets. The sockets are monitored continuously for new packets. In an alternative embodiment, a raw socket may be used such that only one socket is needed to receive packets for a plurality of destination addresses and ports. If there are packets available, the packets are transferred to the user-space buffer 19 at step 7.2.

The user application then parses the packet at step 7.3 to identify relevant information in the header of the packet. The relevant cache of session information previously stored in a buffer in main memory is identified based on the source and/or destination address and port stored in the header of the packet. It may also be identified based on other information in the header. The session information in the cache includes information such as which codec has been used to encode the packet, which codec should be used for re-encoding the packet before forwarding, and the source and destination network addresses.

At step 7.4, a suitable SPE is allocated based on the session information. It is contemplated that some of the SPEs are allocated to decode packets and some SPEs are allocated to encode packets, amongst other tasks. Different SPEs may decode/encode packets with different codec schemes. The remaining SPEs may be allocated for compressing the data in the packets. For example, three SPEs may be allocated for decoding packets, four SPEs may be allocated for re-encoding decoded packets and one SPE may be allocated for running signal analysis using an algorithm such as Goertzel for DTMF tone detection. An SPE may also host one or more of these processing kernels, subject to the finite local store space available. The data in a packet may require multiple operations such as decoding, signal analysis, and re- encoding. One or more SPEs are allocated to fulfil each operation based on which processing kernels they are hosting, and the routing data for the packet is updated in the session information to reflect this. The routing data will be interrogated during each operation such that the packet can then be transferred from the selected SPE to other SPEs for further operations such as compression and the encoding tasks, and ultimately returned to the PPE for potential dispatch to the network.

After a suitable SPE for a packet has been determined, the next free stripe slot 23a-23h for the SPE is determined at step 7.5. In some embodiments, the stripes are arranged in a ring buffer in the user-space buffer 19. The ring buffer may comprise four stripes, each containing eight slots, and there may exist one ring buffer for each SPE. The next free slot is determined by following a pointer which is updated each time a slot is filled to point to the next slot (and possibly stripe) on the ring for a particular SPE. At step 7.6, the user application marks the identified next free stripe slot for the SPE as 'building'. At step 7.7, the user application determines whether the identified stripe slot 23a- 23h is the first stripe slot 23a in a stripe. In other words, it determines whether any other packets have previously been associated with the stripe. If the stripe slot is the first stripe slot in the stripe, the clock field in the stripe metadata field 22 is set at step 7.8. The clock field may be set to the current time of the system clock plus an offset. The setting of the clock field starts a time window, at the expiry of which the stripe will be dispatched to an SPE regardless of whether it is fully populated. If the packet is not the first packet in the stripe, the process proceeds directly to step 7.9.

At step 7.9, the packet is copied into the stripe slot and at step 7.10, session information from the session information cache is added to the slot. The session information includes information about the processing that is required to each packet. The session information that is added to the slot is also updated with routing information for the packet. For example, the session information may instruct the SPE to perform a series of operations such as decode the packet into signed linear, perform signal analysis, perform echo cancellation, encode the packet to another codec, compute checksums, and return the packet to the PPE for dispatch. The routing information includes information about which SPEs the packet should be transferred to for the required processing. The routing information includes the first selected SPE module, any subsequent modules and SPEs, and ends with the PPE. The session information may also contain session information from previous transcoding operations for the session, in cases where some persistent state is required to be kept. This persistent session information will be stored in the session information cache. The stripe slot is then marked as 'ready' at step 7.11.

An example of a method of deciding and indicating that a stripe is ready for processing by an SPE (step 6.8) will now be described in more detail with reference to Figure 8. The user application loops through the buffer containing the stripes and selects a stripe at step 8.1. The clock field of the stripe metadata field 22 is checked at step 8.2 to determine whether the timer has expired. If a predetermined time window has ended since the first packet was placed in the stripe, the timer has expired and the process proceeds to step 8.3. The predetermined time period is configured so as not to impose substantial latency on the real-time stream and is typically kept low enough to be imperceptible to the channel end-points. At step 8.3, the user-space application indicates to the SPE for which the stripe is intended that the stripe is ready for pick-up and processing. In some embodiments, the user-space application toggles a bit corresponding to the stripe in the shared bit-field 20 for the SPE to indicate that the stripe is ready. Consequently, the stripe may be indicated as ready for pick-up by the SPE even if it is not full. If at step 8.2 it is determined that the timer has not expired, the process proceeds to step 8.4 instead and it is checked whether the stripe is full. If the stripe is full, the user application indicates to the SPE that the stripe is ready for pick up even if the timer has not expired. If the timer has not expired and the stripe is not full, the user application loops to the next stripe in the buffer and checks whether the next stripe is ready to be processed by the SPE. The user application loops through all the stripes repeatedly to update the states of the stripes and indicate to the SPEs when the stripes are ready for pick-up. An example of a process for transferring stripes to SPEs, processing the packets in the stripes and returning the packets back to the PPE (step 6.9) will now be described with respect to Figures 9 and 10. The process will be described for one SPE but the process applies, of course, to all SPEs. The SPE checks the relevant shared bit-field 20 at step 9.1 to check whether the PPE has stripes that are ready to be processed for the SPE. If there are ready stripes, the SPE accesses a local table at step 9.2 to find the memory addresses in the main memory 10 of the ready stripes. The local table of memory addresses of stripes for the SPE is loaded into the local store of the SPE from the PPE at start up. In one embodiment, the local table may be optimised away such that the address can be calculated using a pre-shared address and a computable offset from this address, based on the stripe ID in question and the known byte size of a stripe. Once the address is found, the transfer of the stripe is started at step 9.3. The SPE issues a non-blocking asynchronous request to the MFC to retrieve the stripe at the address in the main memory 10 and transfer it to an address in the local store. The transfer may be initiated with a memory transfer command to retrieve the stripes from the main memory and transfer into the SPE local store. The number of stripes that can be held by the SPE can be dynamically configurable, and is based on the available local store memory. Once all requests have been dispatched, the SPE checks whether any stripes have arrived from other SPEs at step 9.4 by interrogating a local bit-field register or cache line on the SPE. If stripes have arrived in the local bank of stripes reserved for SPE-SPE stripes, the relevant stripes are processed at step 9.5. The processing of the stripes will be described in more detail with respect to Figure 10.

If no SPE-SPE stripes have arrived, the SPE checks whether there are stripes in transfer from the PPE at step 9.6 and if there are, checks to see if a first stripe has been completely transferred at step 9.7. If not, the SPE returns to 9.4. If a stripe has been transferred, the SPE then counts the leading zeroes on the bit-field that is returned by the transfer request function, at step 9.8, to identify the ID of the stripe that has been completely transferred so that the completely transferred stripe can be accessed in the local store. The stripe can then be processed at step 9.5 and transferred back to the PPE at step 9.9. Not all packets in the stripe will be ready for transfer back to the PPE. Some of the packets will have been transferred as part of the processing of the packets to other SPEs for further processing, leaving empty slots. A stripe may therefore be sent back to the PPE at step 9.9 with empty slots. If packets received in slots from other SPEs at step 9.4 have been processed and are ready to be sent back to the PPE the packets can be included in the empty slots in a stripe that is being sent back to the PPE. Alternatively, they can be sent back as a stripe comprising a single slot.

When the stripe has been transferred at step 9.9, it is checked whether there are further stripes to be processed. The SPE checks both whether stripes have arrived from other SPEs and whether there are stripes in transfer from the PPE at steps 9.4 and 9.6. If no stripes have arrived from other SPEs, it continues to check whether another transfer request has been completed and processes the transferred stripe. When there are no more transfer requests pending, the process returns to step 9.1 and it is determined whether any stripes are ready for processing.

It should be realised that although a certain order of processing steps has been described with respect to Figure 9, the order can be varied. It is contemplated that the stripes are processed in parallel with transfers happening in the background.

With reference to Figure 10, an example of a method for processing a stripe starts with the SPE parsing the session information stored in the first slot in the stripe at step 10.1. The SPE then determines the processing operations required for the packet data in the slot and where the required processing will take place. At step 10.2, it is determined based on the session information whether the first processing task can be performed locally. If the packet can be processed locally, the action will be performed at step 10.3. The process then proceeds, at step 10.4, to the next action listed in the session information and checks whether the next action can be performed locally at step 10.2. If the process cannot be performed locally, the routing data is checked at step 10.5 to determine whether the slot should be transferred to another SPE for the other SPE to perform the required processing. If the routing data indicates that the slot should be transferred, the target SPE is interrogated by reading its associated cache line to identify a target slot memory address, and the slot is transferred directly, at step 10.6, to the SPE local store indicated in the routing data through a memory transfer operation. The routing data may indicate that the slot should be transferred to two or more SPEs for processing in parallel, in which case the procedure will be performed multiple times to transfer the slot directly to the multiple SPE local stores. In one embodiment, a new stripe may be formed to transfer the packet to the other SPEs. If at step 10.2, it is determined that the action cannot be performed locally and, at step 10.6, the routing data does not indicate that it should be transferred to another SPE, the processing is complete. The routing data may indicate the PPE as the final destination at step 10.7. The slot is then marked as processed at step 10.8. The slot is transferred back to the PPE in a stripe, as mentioned with respect to step 9.9, when all the slots in the stripe have been processed. If some of the packets of a stripe are transferred to another SPE, the corresponding slots are returned empty to the PPE or filled with processed packets transferred to the SPE from other SPEs. As mentioned above, in an alternative embodiment, the packet contents in the slot can be copied directly to the network interface 9 as soon as it has been processed. In this instance, the slot may still be returned to the PPE to update the session information cache.

Before transferring a slot to another SPE or a PPE, the SPE can add more session information to the slot. Also, before copying the slot back to the PPE or the network interface, the SPE may calculate checksums for the packet and add those checksum to the packet. The checksum may be, but is not limited to, an Ethernet, IP, TCP and/or UDP checksum.

When the slot has been transferred to another SPE at step 10.6 or marked as processed at step 10.8, the SPE checks at step 10.9 if there are more slots in the stripe. If there are more slots, the SPE proceeds to the next slot at step 10.10 and steps 10.1 to 10.9 are repeated for the next slot. If, instead, all slots in the stripe have been processed, the process ends at step 10.11 and the SPE transfers the processed slots of a stripe back to the PPE and checks whether there are other stripes to process or transfer as shown in Figure 9. As mentioned above, in alternative embodiments, the outbound packets may be transferred directly from the SPE to the network card. Similarly, some of the steps, shown in Figures 6 and 7, for transferring the inbound packets from the network card to the SPEs may be avoided. For example, the user application may use a raw socket to receive the packets directly, bypassing the Operating Systems IP stack and allowing it to receive packets for multiple IP addresses and ports without the overhead of managing a plurality of sockets. Event sockets may be used such that the Operating System will inform the user application when there are new packets. The PPE user application may also reside in the form of a kernel module to reduce context switching.

Moreover, the PPE may instruct the network interface to send packets directly to the SPE local stores. For example, PPE user code may configure buffer descriptors in the network interface such that the SPE may receive packets directly from the network card into its local store, bypassing main memory. The PPE user code may further elect to configure the buffer descriptors such that a sequence of SPE local stores receives packets directly from the network card, in accordance with a scheduling algorithm such as 'round-robin'. The SPEs may also poll the network card buffers directly. In situations where the PPE is bypassed, processing kernels on the SPE may emulate the packet parsing and session information caching and retrieval functionality, with the session information cache residing in main memory. The network interface may send an interrupt to the operating system when one or more packets have been transferred to an SPE and the operating system may inform the SPE that a packet has arrived. The PPE can then retrieve the session information from the session information cache in main memory and send routing and processing task instructions to the SPE. The SPE will perform any local tasks and transfer the packet on for further processing on other SPEs. The processed packets can subsequently be transferred directly to the network interface or to the PPE for dispatch to the network interface. Elements of the PPE user application may be ported to run on the SPE as processing kernels. Some of the task allocation functions of the PPE may then be implemented on the processing kernel of the SPE. The PPE would still store and update the session information cache. Typically, a dual Cell/B.E system comprises two dual core PPEs and sixteen SPEs. In order to speed up the processing, provisions are made to ensure that data handled by a first PPE are processed by the SPEs associated with the first PPE and not by the SPEs associated with the second PPE. Provisions are also made such that interrupts for the network interface card are delivered to the same processor core as will initially process the network data. Provisions are also made such that buffers used for memory access are topologically local to the processor core(s) that will handle the packet. Additionally, provisions are made to maintain data locality such that network I/O is constrained to a nearby physical processor.

With reference to Figure 11, another example of a data processing apparatus 6 is shown. The data processing apparatus 6 comprises a multi-core processing system 8, a network interface 9 and a memory 10. It also comprises one or more external processing devices 24. In the embodiment of Figure 11, the external processing device 24 receive packets from multiple RTP streams and aggregates them into stripes before forwarding the stripes to the multi-core processing system thus reducing per-packet overhead on the multi-core processor system 8. In some embodiments, the data processing apparatus is a single physical unit comprising the external processing devices 24, the multi-core processor system 8, the network interface 9 and the memory 10. In other embodiments, the external processing devices 24 may be connected to the multi-core processing system over the network.

Whilst specific examples of the invention have been described, the scope of the invention is defined by the appended claims and not limited to the examples. The invention could therefore be implemented in other ways, as would be appreciated by those skilled in the art.

For example, it should be realised that although the processing of the packets have been described to be processed in data processing apparatus 6 in the first network 3, the processing could also be carried out in the data processing apparatus 7 in the second network 5. Alternatively, the processing of the packets could be carried out in a data processing apparatus operated independently of the first and second networks. Moreover, the data processing apparatus may comprise a conventional telecom switch connected to a separate unit comprising the multi-processor system 8 for carrying out the processing of the packets.

Moreover, although the embodiments have been described with respect to a voice call between a VoIP-capable handset and a conventional telephone handset, the invention can be used in any application whereby packetised data is transferred over a network where there is a requirement to apply functions on said data in real-time, including, but not limited to applications such as video streaming, video conferencing, audio streaming, telephony systems and signal processing. Additionally, although the embodiments have been described with respect to an IP network and the PSTN, the invention can be used with any packet-based protocol including, but not limited to, Ethernet, Infiniband, GPRS, 3G, ATM networks and point-to-point serial links.

Claims

1. A method of processing, in real-time, packetised network-sourced data using a multi-core processor comprising one or more first processing cores and a plurality of second processing cores, the one or more first processing cores being configured to allocate processing tasks to the plurality of second processing cores and each of the second processing cores having a dedicated working memory for receiving data to be processed by the second processing core, the method comprising: receiving a packet of data at an I/O interface; allocating processing of the packet to a processing core of the plurality of second processing cores; processing the packet on the allocated processing core; and transferring the packet to the I/O interface.

2. A method according to claim 1, wherein the packets is a Real-time Transfer Protocol (RTP) packet sourced from a network.

3. A method according claim 1 or 2, wherein the multi-core processor is based upon STI Cell Broadband Engine architecture.

4. A method according to claim 1, 2 or 3, wherein the I/O interface is a network interface card.

5. A method according to any one of the preceding claims, further comprising transferring the packet to one of the one or more first processing cores and after allocating processing of the packet to the processing core of the plurality of second processing cores, transferring the packet to the processing core of the plurality of second processing cores.

6. A method according to claim 5, further comprising: creating a unit for receiving two or more packets of data; and adding two or more packet of data to the unit; and wherein transferring the packet comprises transferring a unit comprising the packet.

7. A method according to claim 6 further comprising: in response to adding a first packet to said unit, starting a timer; and initiating transfer of the unit on expiry of a time window defined by the timer.

8. A method according to claim 7, wherein the creating of a unit for receiving two or more packets of data is carried out by an external processing device connected to the I/O interface and wherein transferring the packet to the one or more first processing cores comprises transferring the unit comprising the packet.

9. A method according to claims 1 to 4, further comprising transferring a packet directly from the I/O interface to a dedicated memory of the processing core of the plurality of second processing cores using a memory transfer operation.

10. A method according to claim 9, wherein transferring a packet directly from the I/O interface comprises transferring a packet directly from the I/O interface to a processing core of the plurality of second processing cores selected in accordance with a scheduling algorithm.

11. A method according to claim 9 or 10, wherein allocating processing of the packet to a processing core is carried out by a processing kernel of the processing core of the plurality of second processing cores.

12. A method according to any one of claims 1 to 10, wherein allocating processing of the packet to a processing core is performed by one of the one or more first processing cores.

13. The method of any one of the preceding claims, wherein the transferring of the packet to the I/O interface comprises transferring the packet to the first processing core for subsequent transfer to the I/O interface.

14. The method of any of claims 1 to 13, wherein the transferring of the packet to the I/O interface comprises transferring the packet directly from the second processing core of the plurality of second processing cores to the I/O interface through a memory transfer operation.

15. A method of any one of the preceding claims, further comprising transferring the packet from a processing core of the plurality of second processing cores to another processing core of the plurality of second processing cores for processing.

16. A method of any of the preceding claims, further comprising calculating one or more packet checksums on a processing core of the plurality of second processing cores.

17 A computer program comprising multiple program object codes that when executed by a processor system comprising a plurality of processor cores causes the processor system to carry out the method of any one of claims 1 to 16.

18. Apparatus for processing, in real-time, packetised network-sourced data, the apparatus comprising one or more multi-core processors comprising one or more first processing cores and a plurality of second processing cores, the one or more first processing cores being configured to allocate processing tasks to the plurality of second processing cores and each of the second processing cores having a dedicated working memory for receiving data to be processed by the second processing core, and an I/O interface for receiving a packet of data; the one or more first processing cores comprising a processing core for allocating processing of a packet received at the I/O interface to one of the plurality of second processing cores and said one of the plurality of second processing cores being operable to process the packet.

19. Apparatus according to claim 18, wherein the packet is a Real-time Transfer Protocol (RTP) packet sourced from a network.

20. Apparatus according claim 18 or 19, wherein the multi-core processor is based on STI Cell Broadband Engine architecture.

21. Apparatus according to claim 18, 19 and 20, wherein the dedicated working memory is an on-die local working memory

22. Apparatus according to any one of claims 18 to 21, wherein the processing core of the plurality of second processing cores is operable to transfer the packet from the processing core of the one or more first processing cores to the dedicated working memory of the processing core of the plurality of second processing cores using a memory transfer operation.

23. Apparatus according to any one of claims 18 to 22, further comprising means for creating a unit for receiving two or more packets of data and adding two or more packet of data to the unit, and wherein the packet is transferred to the processing core of the plurality of second processing cores in a unit.

24. Apparatus according to claim 23, wherein the means for creating a unit is further operable to start a timer when a first packet has been added to the unit and initiate transfer of the unit on expiry of a time window defined by the timer.

25. Apparatus according to claim 23 or 24, wherein the processing core of the one or more first processing cores comprises the means for creating a unit.

26. Apparatus according to claim 23 or 24, wherein the apparatus further comprises a further processing device, the further processing device comprising the means for creating a unit and means for transferring the unit to the processing core of the one or more processing cores.

27. Apparatus according to any one of claims 18 to 22, wherein the processing core of the plurality of second processing cores is operable to transfer a packet directly from the I/O interface to a dedicated memory of the processing core of the plurality of second processing cores using a memory transfer operation.

28. Apparatus according to any one of claims 18 to 27, wherein the processing core of the plurality of second processing cores is configured to transfer a processed packet to the processing core of the one or more first processing cores for subsequent transfer to the I/O interface.

29. Apparatus according to any one of claims 18 to 27, wherein the processing core of the plurality of second processing cores is configured to transfer a processed packet directly to the I/O interface using a memory transfer operation.

30. Apparatus according to any one of claims 18 to 29, wherein the processing core of the plurality of second processing cores is configured to transfer the packet to another processing core of the plurality of second processing cores for further processing.

31. Apparatus according to any one of claims 18 to 30, wherein the processing core of the plurality of second processing cores is operable to calculate a packet checksum for the packet.

32. Apparatus according to any one of claims 18 to 31 further comprising telecom switching apparatus and wherein the processing core of the plurality of second processing cores is operable to perform at least one out of encoding, decoding, compressing, decompressing, encrypting, decrypting, signal analysis and signal processing.