WO2006030822A1

WO2006030822A1 - Gene expression data processing method and processing program

Info

Publication number: WO2006030822A1
Application number: PCT/JP2005/016927
Authority: WO
Inventors: Tomokazu Konishi
Original assignee: Toudai Tlo, Ltd.
Priority date: 2004-09-14
Filing date: 2005-09-14
Publication date: 2006-03-23
Also published as: JPWO2006030822A1

Abstract

The variation of the data collected from a DNA chip is adequately detected and the variation is corrected if possible. In a gene expression data processing method of collecting analyzable data by processing array data acquired according to the expression level of the genes on a DNA chip, the DNA chip is divided into small regions. Data values constituting the array data are standardized (step 300). The average value of the standardized data values or the standard deviation of the median is calculated (step 310). It is checked if there is any variation of the data collected from the DNA chip on the basis of an increase of the standard deviation (step 310).

Description

Specification

Gene expression data processing method and processing program

Technical field

[0001] The present invention relates to a technique for statistically analyzing gene expression data.

Background art

[0002] It is known to use a DNA chip to obtain gene expression data.

A DNA chip is obtained by fixing a plurality of genes as different spots on a substrate such as a slide glass. For example, thousands of thousands of genes are fixed as targets in a microarray. Single-stranded DNA or mRNA is used as the target.

[0003] As a base material for a DNA chip, a plate made of glass with various coatings, a membrane made of nylon or -trocellulose, a hollow fiber, a semiconductor material, a metal material, an organic substance, etc. Things are available. In addition, as a target, one obtained by duplicating all or part of cDNA, one duplicating part of genomic DNA, synthetic DNA and Z or synthetic RNA can be used. In order to fix the target to the substrate, a method of synthesizing oligo DNA on a glass plate by a photolithographic method and a method of attaching a target to a substrate using a spotter or the like are known.

[0004] For example, DNA or RNA (analysis target) with a fluorescent label is allowed to be immobilized on such a DNA chip. Analytes that are complementary to the target form a duplex. Since the analysis target is fluorescently labeled, after hybridization, image data obtained by operating the DNA chip with a fluorescent scanner can be acquired. Based on the image data acquired in this way, it is possible to know at which spot a double strand is formed. More specifically, in the obtained image, spots derived from each DNA are displayed as a result of hybridization. Therefore, by integrating the signal intensity of a predetermined region including the spot position, array data composed of values indicating the signal intensity of each spot can be obtained.

[0005] For example, a microarray with thousands of targets fixed to tens of thousands of Array data showing gene expression can be obtained in a single experimental operation. As a result, when measuring the increase / decrease in the data of one gene expression, the average of the data (signal intensity value) indicating a number of gene expressions as the target is calculated, and the data is calculated based on this. It is common to standardize. More specifically, standardize data before comparing expression data from experiment to experiment. For example, Non-Patent Document 1 discloses an example of standardization.

[0006] The probability distribution of the acquired data is nonparametric. However, for example, as disclosed in Non-Patent Document 2, in order to standardize the acquired data, the Z-standard or t-standard, or the integrated value of the signal intensity of each spot, When divided by the arithmetic average of numbers, the following method is used.

[0007] Since these are not non-parametric methods, there is a problem that such standardization significantly impairs data accuracy.

[0008] In addition, array data based on an image acquired by a fluorescent scanner always includes a knock ground component. This is because the signal intensity of the knock ground existing in the entire image data, and the measurement range and the actual spot size and shape do not always match. Therefore, it is important for accurate analysis to obtain data having a true signal value by subtracting the background component from the numerical value of the acquired image data. The same applies to array data obtained by other methods such as detection of electrical signals and detection of radiation.

[0009] Conventionally, an average value or median value per pixel is obtained based on a numerical value that represents the signal intensity of a specific spot or a non-spotted background component, and this value is multiplied by the number of pixels in the measurement area. It was estimated by.

[0010] Therefore, the present inventor has found that the logarithmic value of the data obtained from the DNA chip force (data indicating the amount of luminescence by gene expression) has a three-parameter normal distribution, logarithmically transform the above data, and further normalize ( For example, Z-standardization) and, as shown in Patent Document 1, it has been proposed to calculate a knock ground value to obtain more preferable data.

Patent Document 1: Japanese Unexamined Patent Application Publication No. 2004-13573

Non-patent document 1: “Normalization strategy for cDNA microarrays” gies for cDNA microarrays), Johanes Schuchhardt et al., Nucleic Asids Reserch (2000) Vol. 29 No. 10, 2000

Non-Patent Document 2: “Chasing the dream: plant EST microarrays”, Todd Richmond, et al., Current Opinion ni Plant Viology, (2000) Vol. 3, 2000, pp. 108-116

Non-Patent Document 3: Tomokizu KONISHI, “Three-parameter log normal distribution ubiquitously found in cDNA microarray data and its application to parametric data treatment) '', BMC Bioinformatics, 200 May 5, 2004

Disclosure of the invention

Problems to be solved by the invention

In microarray experiments, the hybridization reaction in the chip may not be performed uniformly. In addition, due to the scan-jung for recording the reaction result as an image, the density may be non-uniform depending on the portion of the image in one image. Such unevenness affects the data as noise. Until now, the unevenness was often detected by visual observation of image data. Then, a technique has been adopted in which the unevenness on the image is corrected by averaging every small part on the chip. Or, the actual situation is that they were averaged without confirmation. These are techniques for obtaining a uniform image as a whole by using, for example, a weighted average method or a spline function. Because it can be used on an ad hoc basis, any image can be made uniform.

However, this method simultaneously becomes a new noise source. For example, spots on the chip (roughly) have random signal strengths, so spots with high or low signal intensity may gather in a small portion of such spots. Extremely high

Even if it is V or low, even if it is not a gathering, it can be said that the true appearance is that some degree of strength is present randomly. A uniform image is different from such a true figure. Therefore, correction to a uniform image leads to the introduction of errors. When errors are introduced, noise components increase and repeatability Decreases. For this reason, these methods have been used with great strength. Of course, if you do not perceive the unevenness, do not perceive the state, or if you do not take measures to improve it, the unevenness will reduce the data accuracy.

Therefore, it is desirable to detect and remove unevenness by methods that are “not hitch” or “not general”.

[0012] An object of the present invention is to provide a method and a program for appropriately detecting unevenness of data on a DNA chip and correcting the irregularity if possible.

Means for solving the problem

[0013] An object of the present invention is a method for processing gene expression data obtained by processing array data obtained based on the expression level of a gene on a DNA chip and stored in a storage device to obtain analyzable data. There,

Normalizing data values comprising the array data;

Storing the information of the small area in a storage device when the DNA chip is divided into a plurality of small areas;

For each of the small areas, calculating a representative value of the standardized data values of the array data included in the small area;

Calculating a standard deviation of the representative value distribution;

And detecting the presence of unevenness in the array data of the DNA chip based on the increase in the standard deviation.

In a preferred embodiment, the step of detecting the presence of the unevenness comprises:

The step of calculating the significance level of the chi (%) square distribution according to a predetermined standard and the standard deviation and the significance level of the chi () square distribution are compared, and the standard deviation is the significance level of the chi () square distribution. If it is larger, there is a step of determining that there is unevenness in the data.

[0015] In another preferred embodiment, the step of detecting the presence of the unevenness comprises:

Calculating a difference between the standard deviation and an expected value of an average value of the standard deviation; and Calculating an expected value of the standard deviation;

Comparing the difference with a predetermined value based on an expected value of the standard deviation, and determining that there is unevenness in the data if the difference is greater than the predetermined value.

[0016] For example, the predetermined value is 2σ larger than the expected value of standard deviation! /.

[0017] In still another preferred embodiment, the gene expression data obtained based on the expression level of the gene on the DNA chip and processed from the array data stored in the storage device to obtain analyzable data. The processing method is

A step of standardizing data values constituting each of the plurality of DNA chips; a step of calculating a standard deviation of logarithmic values of the respective data values of the DNA chips; and a distribution of the calculated standard deviations of the plurality of DNA chips. Calculating the median, and

Calculating an expected value of the standard deviation;

For each DNA chip, calculating the difference between the standard deviation for the DNA chip and the median;

The difference is compared with a predetermined second value based on the expected value of the standard deviation, and when the difference 1S is greater than the predetermined second value, it is determined that there is unevenness in the data for the DNA chip. And a step of.

[0018] For example, the predetermined second value is 2σ larger than the expected value of the standard deviation! /, A value

In another preferred embodiment, a step of calculating a median value from a distribution of signal intensity around a spot, in which a residue of dust cleaning liquid that causes the unevenness is detected along with measurement of an image. When,

Calculating the standard deviation robustly;

Calculating a difference between each of the signal intensities and the median; comparing the difference with a predetermined value based on the standard deviation; If the value is larger than the value of the spot, it is determined that the spot data is under the influence of unevenness, and information indicating that the spot is under the influence of unevenness is stored in the storage device.

[0020] In still another preferred embodiment, instead of the robustly calculated standard deviation, a step of calculating a median of robustly calculated standard deviations of a plurality of DNA chips is used. Comparing with a predetermined value based on the median value, and determining that the data of the spot is under the influence of unevenness when the difference force is greater than the predetermined value.

[0021] Further, the step of determining the data of the spot located within a predetermined distance from the spot under the influence of the unevenness as being under the influence of the unevenness regardless of the value of the difference of the spot. Have. For example, the predetermined value is 2 spots.

[0022] Further, an object of the present invention is a method for processing gene expression data, which obtains analyzable data by processing array data obtained based on the expression level of genes on a DNA chip.

Normalizing data values comprising the array data;

Determining a shape and arrangement of a small region for dividing the DNA chip, and storing information on the shape and arrangement of the small region in the storage device;

Determining a spatial correction function representing the arranged small area group; calculating a function value by a spatial correction function for a data value belonging to the small area for each small area;

It is also achieved by a method of processing gene expression data, comprising the step of storing a function value in the storage device.

[0023] Preferably, according to an embodiment, the step of determining the shape and arrangement of the subregions comprises:

Storing the information of the small region candidates when the DNA chip is divided into a plurality of small region candidates in the storage device;

Calculating a standard deviation of data values belonging to the small region candidate;

Calculating a median standard deviation for each of the subregions; The division step, the standard deviation calculation step, and the median calculation step are repeated to determine a small region candidate that has the smallest median value as the small region, and information on the determined small region Is stored in the storage device.

[0024] Further, the object of the present invention is to process gene expression data obtained by processing array data obtained based on gene expression levels on a DNA chip and stored in a storage device to obtain analyzable data. A method,

Setting the volume of bubbles and the volume of solution in the chamber when the DNA chip is subjected to hybridization.

Calculating the relative time that the observation point has been immersed in the solution at the observation point set on the DNA chip in accordance with the rotation of the chamber in the hybridization;

Normalizing data values comprising the array data;

Dividing the standardized data value by the relative time of the observation point corresponding to the data value;

It is also achieved by a method of processing gene expression data, comprising the step of storing the divided data value in the storage device.

[0025] Further, an object of the present invention is a processing method of gene expression data for processing array data obtained based on the expression level of a gene on a DNA chip to obtain analyzable data,

Dividing the DNA chip into a plurality of small regions;

Normalizing data values comprising the array data;

For each of the small regions, calculating an average value of data values belonging to the small region, setting first to nth significance levels so that the sensitivity gradually increases, and each of the small regions Information on whether the small area is affected by unevenness based on the first to nth significance levels, and whether or not the small area is affected by unevenness. Is also achieved by a method of processing gene expression data, characterized by comprising the step of storing the above in the storage device.

In addition, the object of the present invention is obtained based on the expression level of a gene on a DNA chip. A processing program for gene expression data that can be read out by a computer in order to process the array data stored in the memory and obtain data that can be analyzed.

Normalizing the data values comprising the array data;

For each of the small areas, calculating an average value or a standard deviation of median values of standardized data values of the array data belonging to the small area; and

This is achieved by a gene expression data processing program characterized by causing a step of detecting the presence of unevenness in DNA chip array data based on the increase in the standard deviation.

[0026] In a preferred embodiment, in the step of detecting the presence of the unevenness, in the computer,

A step of calculating the significance level of the chi-square distribution according to predetermined criteria,

The standard deviation is compared with the significance level of the chi (2) square distribution, and if the standard deviation is greater than the significance level of the chi (2) square distribution, a step of determining that there is unevenness in the data is executed.

[0027] In another preferred embodiment, in the step of detecting the presence of the unevenness, in the computer,

Calculating a difference between the standard deviation and an expected value of an average value of the standard deviation; calculating an expected value of the standard deviation; and

The difference is compared with a predetermined value based on the expected value of the standard deviation, and if the difference is larger than the predetermined value, a step of determining that there is unevenness in the data is executed.

[0028] In still another preferred embodiment, a computer is used to process the array data obtained based on the expression level of the gene on the DNA chip and stored in the storage device to obtain analyzable data. A gene processing data processing readable data readr To the computer,

Standardizing data values constituting each of the plurality of DNA chips; calculating a standard deviation of each data value of the DNA chip; calculating a median value of the calculated DNA chips;

Calculating an expected value of the standard deviation;

For each DNA chip, calculating the difference between the standard deviation for the DNA chip and the median, and

The difference is compared with a predetermined second value based on the expected value of the standard deviation, and when the difference 1S is greater than the predetermined second value, it is determined that there is unevenness in the data for the DNA chip. The step to perform is performed.

[0029] Further, an object of the present invention is to process gene expression data obtained by processing array data obtained based on gene expression levels on a DNA chip and stored in a storage device to obtain analyzable data. A program comprising:

Normalizing the data values comprising the array data;

Determining a shape and arrangement of a small region to divide the DNA chip, and storing information on the shape and arrangement of the small region in the storage device;

Determining a spatial correction function representing the arranged small region group, calculating a function value by the spatial correction function for a data value belonging to the small region for each small region, and

This is achieved by a gene expression data processing program characterized in that the step of storing the function value in the storage device is executed.

[0030] Preferably, in an embodiment, in the step of determining the shape and arrangement of the small region,

Storing the information on the small region candidates in the storage device when the DNA chip is divided into a plurality of small region candidates;

Calculating a median of all standard deviations for each of the subregions; and The step of dividing, the step of calculating a standard deviation, and the step of calculating a median value are repeated, and a step of determining a small region candidate having the minimum median value as the small region is executed.

[0031] Furthermore, an object of the present invention is to process gene expression data obtained by processing array data obtained based on the expression level of genes on a DNA chip and stored in a storage device to obtain analyzable data. A program comprising:

Calculating the relative time that the observation point has been immersed in the solution at the observation point set on the DNA chip in accordance with the rotation of the chamber in the hybridization.

Normalizing the data values comprising the array data;

Dividing the standardized data value by the relative time of the observation point corresponding to the data value; and

This is achieved by a gene expression data processing program characterized in that the step of storing the divided data value in the storage device is executed.

[0032] In addition, an object of the present invention is to process gene data that is obtained based on the expression level of genes on a DNA chip and to obtain data that can be analyzed. A data processing program, wherein the computer stores information on the small area when the DNA chip is divided into a plurality of small areas in the storage device,

Normalizing the data values comprising the array data;

For each small area, calculating an average value of data values belonging to the small area, setting first to nth significance levels so that the significance level becomes gradually stricter

As well as

For each of the small areas, based on the first to nth significance levels, it is determined whether or not the small area is affected by unevenness, and the small area is affected by unevenness. A step of storing in the storage device information indicating that the device is to speak This is also achieved by a gene expression data processing program.

[0033] Furthermore, an object of the present invention is to process gene expression data obtained by processing array data obtained based on the expression level of genes on a DNA chip and stored in a storage device to obtain analyzable data. A method,

Assuming that the data values constituting the array data are normally distributed, calculating the representative value Mb and the scale value Sb;

Using the individual background value Xbi, which is the actual measured value of the background stored in the storage device, associated with each spot data value on the DNA chip, the standard value Zbi is

Zbi = (Xbi-Mb) / Sb

A step of calculating by

When the standard value Zbi is larger than the set rejection level, it is determined that the spot data value that is the basis of the calculation of the Zbi should be rejected, and information indicating that the spot data is rejected, The method is also achieved by a method for processing gene expression data, comprising the step of storing in the storage device.

[0034] In another embodiment of determining whether or not to reject the spot data value, the array data obtained based on the expression level of the gene on the DNA chip and stored in the storage device is processed and analyzed. The method of processing gene expression data to obtain possible data is:

Zbi = (Xbi-Mb) / Sb

A step of calculating by

Sorting all the standard values Zbi and comparing them with a normal 'Probability' plot based on the theoretical value of the normal distribution to identify a range that falls within the predetermined range with the theoretical value; Indicates that if the standard value Zbi is larger than the upper limit of the specified range, it is judged that the data value of the spot that is the basis of calculation of the Zbi should be rejected, and the data of the spot is rejected. Storing information in the storage device.

[0035] Further, an object of the present invention is to process gene expression data obtained by processing array data obtained based on gene expression levels on a DNA chip and stored in a storage device to obtain analyzable data. A method,

An individual back relating to the peripheral spot, which is an actual measured value of the background stored in the storage device, associated with the data value of the peripheral spot located around a certain range of the individual spot which is a spot on the DNA chip Use the ground value Xbi

Standard value Zbi

Zbi = (Xbi-Mb) / Sb

A step of calculating by

Calculating a distance r between the individual spot and a peripheral spot;

Calculating (ZbiZ 2) for each of the surrounding spots, and calculating a sum B of the calculated values;

Determining that the data value of the individual spot should be rejected when the sum B is greater than a set rejection level, and storing in the storage device information indicating that the data of the spot is to be rejected; It is also achieved by a method for processing gene expression data characterized by comprising:

[0036] Further, an object of the present invention is to process gene expression data obtained based on the expression level of a gene on a DNA chip and processing array data stored in a storage device to obtain analyzable data. A method,

Calculating the two-dimensional position (Xc, Xw) of the spot in the c-th row and w-th column of the DNA chip;

(A) calculating a minimum value MINc of spot data values in a predetermined range of rows adjacent to the c-th row or the c-th row;

Calculating a continuous function f (Xc) that approximates the MINc by Xc; Subtracting the f (Xc) from the data value of the spot belonging to the c-th row on the DNA chip; and

Storing the data value obtained by subtracting f (Xc) in a storage device;

(B) calculating a minimum value MINw of spot data values in the w-th column or a predetermined range of columns adjacent to the w-th column;

Calculating a continuous function g (Xw) that approximates the MINw by Xw;

Subtracting g (Xw) from the data value of the spot belonging to the w column in the DNA chip; and

Storing the data value obtained by subtracting g (Xw) in a storage device;

(C) Each of the DNA chip spot data values,

z = (log (, x— y) ~ μ) / σ

(In the above equation, γ is the calculated background value, is the characteristic value of the central tendency.

, Σ is a specific value of fluctuation)

And calculating the background value γ according to

Data value of the spot Step of subtracting the background value y

Calculating MEDc,

Calculating a continuous function h (Xc) approximating the MEDc by Xc;

Dividing the data value of spots belonging to the c-th row by the DNA chip by the h (Xc), and

Storing the data value divided by h (Xc) in a storage device;

(D) Each of the data values of the spots on the DNA chip is

z = (log (, x— y) ~ μ) / σ

(In the above formula, γ is the calculated background value, is the characteristic value of the central tendency, σ is the specific value of the fluctuation)

And calculating the background value γ according to

Data value power of the spot Subtracting the background value y Calculating the value MEDw,

Calculating a continuous function j (Xw) approximating the MEDw by Xw;

Dividing the data value of the spot belonging to the c-th row in the DNA chip by the j (Xw); and

Storing the data value divided by j (Xw) in a storage device;

Including one or more of (A) to (D),

Comparing each of the one or more execution results of (A) to (D) with a lognormal distribution model, and selecting an execution result that approximates the value S most closely to the model; and The method is also achieved by a method for processing gene expression data, comprising the step of storing the executed result in the storage device.

[0037] Further, an object of the present invention is to obtain data that can be analyzed by processing array data obtained based on gene expression levels on a DNA chip and stored in a storage device. A gene expression data processing program readable by

Zbi = (Xbi-Mb) / Sb

A step of calculating by

When the standard value Zbi is larger than the set rejection level, it is determined that the data value of the spot that is the basis of calculation of the Zbi should be rejected, and information indicating that the data of the spot is rejected, It is also achieved by a gene expression data processing program characterized in that the step of storing in the storage device is executed.

Alternatively, an object of the present invention is to obtain arrayable data obtained by processing array data obtained based on the expression level of a gene on a DNA chip and stored in a storage device. A computer-readable program for processing gene expression data, On the computer

Zbi = (Xbi-Mb) / Sb

A step of calculating by

Sorting all the standard values Zbi and comparing them with a normal 'Probability' plot based on the theoretical value of the normal distribution to identify a range that falls within the predetermined range with the theoretical value;

Indicates that if the standard value Zbi is larger than the upper limit of the specified range, it is judged that the data value of the spot that is the basis of calculation of the Zbi should be rejected, and the data of the spot is rejected. And a step of storing information in the storage device. This is achieved by a gene expression data processing program.

Another object of the present invention is to obtain data that can be analyzed by processing array data obtained based on the expression level of genes on a DNA chip and stored in a storage device, and can be read out by a computer. A program for processing gene expression data,

Standard value Zbi

Zbi = (Xbi-Mb) / Sb

A step of calculating by

Calculating a distance r between the individual spot and a peripheral spot;

(ZbiZ 2) is calculated for each of the surrounding spots, and the calculated value Calculating the sum B of

Determining that the data value of the individual spot should be rejected when the sum B is greater than a set rejection level, and storing in the storage device information indicating that the data of the spot is to be rejected; This is achieved by a gene expression data processing program.

Furthermore, an object of the present invention is obtained on the basis of the expression level of a gene on a DNA chip, and can be read out by a computer in order to process the array data stored in a storage device to obtain analyzable data. A gene expression data processing program comprising:

Calculating a continuous function f (Xc) that approximates the MINc by Xc;

Subtracting the f (Xc) from the data value of the spot belonging to the c-th row on the DNA chip; and

Storing the data value obtained by subtracting f (Xc) in a storage device;

Calculating a continuous function g (Xw) that approximates the MINw by Xw;

Storing the data value obtained by subtracting g (Xw) in a storage device;

(C) Each of the DNA chip spot data values,

z = (log (x— y) ~ μ) / σ

And calculating the background value γ according to Data value of the spot The step of subtracting the background value 、, the median value of the data values of the spots in the c-th row or a predetermined range of rows adjacent to the c-th row

Calculating MEDc,

Calculating a continuous function h (Xc) approximating the MEDc by Xc;

Storing the data value divided by h (Xc) in a storage device;

(D) Each of the data values of the spots on the DNA chip is

z = (log (x— y) ~ μ) / σ

, Σ is a specific value of fluctuation)

And calculating the background value γ according to

Subtracting the background data value y, calculating a median value MEDw of spot data values in a predetermined range of columns adjacent to the w-th column or the w-th column,

Calculating a continuous function j (Xw) approximating the MEDw by Xw;

Storing the data value divided by j (Xw) in a storage device;

Including one or more of (A) to (D),

Comparing each of the one or more execution results of (A) to (D) with a lognormal distribution model, and selecting an execution result that approximates the value S most closely to the model; and And a step of storing the executed result in the storage device. This is achieved by a gene expression data processing program.

The invention's effect

According to the present invention, it is possible to provide a method and a program for appropriately detecting unevenness of data on a DNA chip and correcting the unevenness if possible.

BEST MODE FOR CARRYING OUT THE INVENTION [0042] [System overview]

Embodiments of the present invention will be described below with reference to the accompanying drawings. FIG. 1 is a hardware configuration diagram of a gene expression data processing apparatus (hereinafter simply referred to as “processing apparatus”) according to an embodiment of the present invention. As shown in FIG. 1, the processing device 10 includes a CPU 12, an input device 14 such as a mouse and a keyboard, a display device 16 including a force such as a CRT, a RAM (Random Access Memory) 18, and a ROM (Read Only). Memory) 20, a portable storage medium driver 22 for accessing a portable storage medium 23 such as a CD-ROM or a DVD-ROM, a hard disk device 24, and an interface (IZF) 26 for controlling data exchange with the outside. It is equipped with. As can be understood from FIG. 1, a personal computer or the like can be used as the processing apparatus 10 that is effective in the present embodiment.

[0043] The IZF 26 is connected to a reader or scanner (not shown) or a communication circuit that measures the light emission amount of the spot on the hybridized DNA chip and generates data based on the measured light emission amount. Yes. The communication circuit is further connected to an external network (for example, the Internet).

[0044] In the present embodiment, the portable storage medium 23 has a program and a process for receiving data from a reader or a scanner and executing necessary data conversion processing to be described later on the data. A program for analyzing the applied data is stored. Therefore, the portable storage medium driver 22 reads out the above-mentioned program from the portable storage medium 23, stores it in the node disk device 24, and activates it to operate as the personal computer force processing device 10. It becomes possible to do. Alternatively, the above program may be downloaded via an external network such as the Internet.

[0045] From the reader or scanner, a DNA chip is photographed with a CCD camera or the like, and is output as value array data obtained by integrating the signal intensity for each spot. Alternatively, in the reader or scanner, the background value is determined based on the value of the image data of the image taken by the CCD camera, and the background value is subtracted from the signal intensity of each pixel, and the background correction has already been performed. In some cases, the signal intensity for each spot is integrated from the obtained image data and output as array data. In this embodiment Either unprocessed array data or data that has been subjected to correction processing (background correction) by the reader, scanner, or accompanying software can be used. In this specification, data that accumulates the signal for each spot, to which the reader or scanner power is transmitted, is used as array data or data that is used as a basis for performing background processing that is used in this embodiment. This is referred to as original data.

[0046] FIG. 2 is a functional block diagram of the main part of the processing apparatus 10 that works according to the present embodiment. In FIG. 2, the components that execute the process to derive the analysis result of gene expression data are shown. As shown in FIG. 2, the processing device 10 reads out the data buffer 30, a standardization processing unit 31 that reads out data (original data) temporarily stored in the data buffer 30 and standardizes the original data, and the original data. Based on the unevenness detection unit 32 for detecting the unevenness of the hybridization, and the unevenness detection unit 32, when it is determined that there is unevenness, it is determined whether the unevenness can be corrected. Correction processing unit 34 that corrects the original data, image generation unit 36 that generates an image based on the corrected data, and result storage that stores data corresponding to the processing result generated by the correction processing unit 34 and the like And part 38.

[0047] The function of the data buffer 30 is realized by the RAM 18, and in some cases, the hard disk device 24. The data buffer temporarily stores data (original data) indicating the amount of light emitted from each spot transmitted to the reader or scanner. The data buffer 30 also includes information on small areas (for example, small areas) generated in the process of the unevenness detection unit 32, the correction processing unit 34, the image processing unit 36, etc., or generated as a final result of the processing. Information on the shape and arrangement of the area), information indicating that the DNA chip array data is uneven, information indicating that there is unevenness in the data value for a specific spot, function value, standard deviation, median, average Values, standardized data values, corrected data values, etc. are stored. Although not specified below, the values calculated in the processing are temporarily stored in the data notifier 30. Similarly, parameters input by the operator (for example, a rejection level) are also stored in the data buffer 30.

[0048] [Principle of unevenness detection]

In the present embodiment, the chip content is arranged almost randomly on the chip. Therefore, the data value is also independent of the position on the DNA chip, and the randomness of the data value in a small area on the chip is detected by utilizing the fact that the value is also random. Detects unevenness of hybridization. This randomness disruption is also detected by increasing the standard deviation of the average (or median) calculated for each small section on the chip.

[0049] The increase is detected by the following method.

(1) Judgment is based on the standard deviation of chi (%) square distribution.

(2) Judgment based on the normal distribution of standard deviations!

(3) Judgment is made by detecting that the standard deviation obtained when standardizing the original data by the method described later is stable between experiments, and detecting the increase in the standard deviation.

(4) Judge based on the normal distribution of measured values of the background.

[0050] First, the first method will be described.

[0051] As described above, the data (signal intensity) is independent of the position on the chip. Therefore, when focusing on a small area on the chip (for example, n-spot X n-spot area), every small area is expected to have the same average value. The expected variance of the average value can be predicted by the central limit theorem.

[0052] In this case, the irregularization is non-uniform, and therefore if the data is uneven, the average value (or median value) differs between the small areas. For example, when bubbles are generated during hybridization, the average value of the signal intensity of the small region group where the bubbles are located is measured low. As a result, the average value for each small area is scattered, and the dispersion becomes large.

[0053] [Processing for the first method]

Hereinafter, more specifically, the process relating to the first technique executed by the unevenness detection unit 32 of the processing apparatus that works according to the present embodiment will be described with reference to FIG. The parameters used in the first method are as follows.

[0054] X (j = l, ···, N): signal intensity for each spot

j

(i = l, ..., M): average of small area

a ms: Standard deviation of / z m

N: Number of spots included in the small area M: Number of small areas

As shown in Fig. 3, the processing in the unevenness detection unit 32 is performed by standardizing the original data (step 300), calculating the standard deviation σ ms (step 310), and determining the validity of the standard deviation σ ms (step 3). 320).

First, original data standardization processing (step 300) will be described. FIG. 6 and FIG. 7 are flowcharts showing in more detail the standardization processing executed by the standardization processing unit 31 that works according to the present embodiment. This technique is the same as that described in Patent Document 1.

In the present embodiment, the following formula is used when z-standardization is performed on the original data (signal intensity) X.

[0057] z = (log (x-γ) ~ μ) / σ

In this equation, γ is the calculated background value, is the characteristic value of the central tendency, and σ is the specific value of the fluctuation. FIG. 6 is a flowchart showing the knock ground calculation process in more detail in the standardization process.

[0058] As shown in FIG. 6, the unevenness detection unit 32 determines the range of background candidate values according to input by an operator (in some cases, also referred to as an “experimenter”), etc. Then, a plurality of background candidate values in the range are determined (step 601). For example, if the operator specifies a starting point for a background candidate value (for example, “0 (zero)”) and an ending point (for example, the median or the first quartile), A predetermined number of values that are evenly spaced (or equivalent) are determined. For example, when “0” and the median value are specified, 8 values are equally spaced between them, and 10 background candidate values including the start point and end point are determined. In this process, the knock ground candidate value is stored in the data buffer 30, and the value is read and updated as necessary.

Next, the unevenness detection unit 32 subtracts one of the background candidate values from the extracted original data value (that is, a value of a certain signal intensity) (step 602), and the background candidate value is subtracted. The original data value is logarithmically converted (step 603). The logarithmic conversion value acquired here is also stored in the data notifier 30 for use in later processing. Steps 602 and 603 are performed on all selected background candidate values (eg, 10). [0060] Next, a logarithmic transformation value related to a certain background candidate value is compared with a corresponding standard value calculated by the following method and stored in the data notifier 30, and an index indicating a difference in value is obtained. Calculated (step 604). Here, in the present embodiment, the standard value is obtained as follows!

[0061] Since the quantiles have a range, the following numerical values are calculated in order to correct the statistical median value.

[0062] m (i) = (i-0. 3175) / (n + 0. 365)

Where n: number of data, i: natural number from l to n

Next, the inverse function F ^{_ 1} (r) of the normal distribution function is applied to each m (i) obtained. Each of the obtained values becomes a standard value corresponding to the data value.

[0063] For each background candidate value, the unevenness detection unit 32 calculates, for example, the sum of the absolute values of the differences or the sum of the squares of the differences. The value power obtained here is an index that represents the difference between each background candidate value. Of course, the least-squares method “r” may be used as an indicator of the difference. Actually, when the least square method “r” is used, it is desirable from the viewpoint of obtaining a high accuracy and background value.

[0064] Next, in accordance with the instruction from the unevenness detection unit 32, the image generation unit 36 generates a graph with the background candidate value as the horizontal axis and the index indicating the difference as the vertical axis, for example. It is displayed on the screen of the device 16 (step 605).

The operator refers to the graph displayed on the screen of the display device 16 and selects a desired background candidate value range or knock value (step 606). If the selected value is considered sufficiently satisfactory as a knock ground value (Yes in step 6007), the process ends. On the other hand, if it is not satisfactory, a predetermined number of new background candidate values are determined from the newly selected narrower range of background candidate values (step 608), The processing from 602 to 607 is repeated. The new background candidate value may also be a value obtained by dividing the start point and end point of the range of the knock candidate value at equal intervals, or may be a value obtained by equally dividing the value. good. The finally obtained background value is stored in the result storage unit 38.

SC fed. Next, a description will be given of the remaining parameter calculation processing. In general, in the lognormal distribution, the mean value is used as the logarithmic data; z (characteristic value of central tendency), and standard deviation is used as σ (characteristic value of fluctuation). However, in data where DNA chip strength is also obtained, large signal intensities (with relatively large data values) are accurately measured, while small signal intensities (with relatively small data values) are relative. Including large noise. Since data that is negative due to noise cannot be obtained, many of these weak signals will be discarded. In such a case, the conventional calculation method cannot be used.

[0067] Usually, the characteristic value of the central tendency is obtained with an average value. However, the average is calculated higher than the so-called robust method, especially in situations where weak signals are selectively dropped. In these cases, the median is known to be more effective.

On the other hand, the characteristic value of fluctuation is expressed with a standard deviation. However, the standard deviation is calculated to be smaller in the situation where weak signals are selectively dropped as described above, which is not a robust method. On the other hand, as a robust method, iqr, which finds the characteristic value of fluctuation from the quartile range, is known!

[0069] However, the median is obtained from one point in the data group, and iqr is obtained from two points in the data group. In particular, the problem becomes serious when data with a small number of spots is acquired or when the number of correction data is limited. Therefore, in the present embodiment, the following method is used to adopt a high accuracy and parameter calculation method even when the number of data is relatively limited.

FIG. 7 is a flowchart showing a parameter calculation process that works according to the present embodiment. As shown in FIG. 5, the unevenness detection unit 32 obtains a data value (signal intensity) by subtracting the ideal value and the calculated background value (step 701). The ideal value corresponds to the standard value obtained in step 604 above.

Next, the image generation unit 36 takes the ideal value on the horizontal axis in accordance with the instruction from the unevenness detection unit 32, and the data value based on the actual measurement value (that is, the data value (signal intensity) obtained by subtracting the knock ground value). Is plotted on the vertical axis and displayed on the screen of the display device (Step 702), where the data value force based on the actual measurement value is accurately distributed in a normal logarithm. If so, this graph is almost identical to y = x. However, the graph obtained by actually plotting the measured values has slopes (a) and y-intercepts (b) other than “1”, and as the value of X decreases, Loss linearity.

[0072] However, in this graph, there is a portion that is recognized as a substantially straight line (for example, a portion where X is positive is often a straight line). Therefore, in this embodiment, when the operator refers to the graph and operates the input device to specify a range that is determined to have linearity (step 703), the measured value of the specified range is measured. The first-order force between the measured value and the ideal value is calculated using, for example, the least square method. In the obtained linear expression “ax + b”, the slope “a” corresponds to the characteristic value “σ” of the fluctuation, and the y-intercept “b” corresponds to the characteristic value “” of the central tendency (step 704).

[0073] For example, the image generation unit 36 of the processing device 10 uses the obtained "a" and "b" to take the ideal value on the horizontal axis and the corrected actual measurement value z = (log ( _x-- )-μ) / σ A vertical axis may be generated and displayed on the screen of the display device. If the operator is not satisfied with reference to the displayed graph (No in step 705), the operator returns to the range specification in the original graph, and the processing after step 703 is executed again. The

[0074] On the other hand, if the graph is satisfactory (Yes in Step 705), the previously obtained “γ” is the background value, the intercept “b” is “”, and the slope “a” "Is associated with information specifying the DNA chip as" σ "and stored in the result storage unit 38 (step 706). Using the parameters obtained in this way, for each data value for which DNA chip strength was also obtained,

z = (log (x— y) ~ μ) / σ

It is possible to standardize using the equation. The unevenness detection unit 32 standardizes the data value (signal intensity) of the DNA chip using the calculated parameters “Ύ”, “^”, and “σ”, and converts the standardized data value into the data buffer. Store in 30 (step 707).

[0075] As described above, an appropriate background value was calculated to eliminate the influence of noise, and the characteristic value of the central tendency and the characteristic value of fluctuation for standardization were plotted with the actual measurement values. By obtaining the straight line partial force of the graph, it is possible to realize more robust standardization. The unevenness detection unit 32 calculates the characteristic value σ of each variation of each DNA chip and stores it in the data notifier 30. These become the standard deviation σ η of the data value of each DNA chip.

[0077] In the next step 310, the unevenness detection unit 32 reads the standardized data relating to a certain DNA chip stored in the data buffer 30 (step 311). Next, the average value mi (i = 1,..., Μ) of the standardized signal intensity of the spots included in each small region is calculated (step 312). This gives an average value of Μ. Alternatively, the position of the small area may be moved by one spot (or several spots) in the vertical or horizontal direction, and the average value (moving average) of the moved small areas may be obtained. good. Next, the spot detecting unit 32 calculates the standard deviation a ms of the obtained average value / z mi according to the following formula (step 313).

[0078] σ ms ² = [lZ (l—M)] * Σ [ _i um— (average of _i um)] ² (i = l to M)

= [1Z (1—M)] * [∑ _i um ² —M (average of _i um) ² ]

Next, the unevenness detection unit 32 determines whether or not the standard deviation σ ms calculated in step 313 is appropriate (step 320). Here, the experimenter sets a criterion for a significant difference approximated by a chi (X) square distribution with M degrees of freedom and inputs it to the processing device 10. For example, the acceptable probability (eg 5%) for a one-sided test can be used as a criterion. The set standard is stored in a storage device (for example, hard disk device 24). The unevenness detection unit 32 reads the reference stored in the storage device (step 321), and, for example, according to the set reference, for example, the degree of freedom M and the chi (%) square when the probability corresponds to the reference value of the one-sided test. Determine the significance level (% ² ) of the distribution (step 322). Next, the unevenness detection unit 32 compares σ ms with% ^2, and if “σ ms>% ² ” (Yes in step 323), it determines that there is unevenness, and the original data A value (for example, a flag) indicating that unevenness is present therein is stored in, for example, the data buffer 30 in association with the original data or standardized data (step 324). On the other hand, if the answer is no in step 323, the process continues with the standardized data for the next DNA chip (see step 325).

[0079] In this way, it is possible to determine whether or not the original data related to the DNA chip has unevenness. it can. The correction of data when there is unevenness will be described later.

[0080] [Processing for the second method]

Next, a second technique for detecting unevenness will be described. In the second method, similar to the first method, the standard deviation σ ms of the average value of signal intensity / z is calculated. After approximating the distribution of σ ms with a normal distribution, the standard deviation ams Validity is judged. In FIG. 4, the original data standardization process (step 400) and the standard deviation σ ms calculation process (step 410) are the same as steps 300 and 310 in FIG. 3, respectively.

[0081] Parameters used in the second method are as follows.

[0082] E (ams): Expected standard deviation σ ms

Δ σ σ: Ε (σ ms) — σ ms (expected value calculated value)

σ σ ms: Standard deviation of σ ms of each DNA chip

N: Number of spots included in the small area

σ ρ: standard deviation of logarithm of mRNA population

In the second method, unevenness is determined by determining whether or not the difference (Δ σ σ) between the expected value of the standard deviation σ ms and the calculated standard deviation (measured value) σ ms is greater than a predetermined reference value. The presence or absence of is detected.

In the determination of the validity of the calculated standard deviation σ ms (step 420), the unevenness detection unit 32 calculates an expected value Ε (σ ms) of the standard deviation σ ms (step 421). The expected value E (σ ms) can be calculated according to the following formula using the standard deviation σ ρ of the population logarithm and the number of spots より by the limit theorem.

[0084] Ε (σ ms) = σ ρ / Ν ^1/2

Next, the unevenness detection unit 32 calculates Δσσ (= Ε (σms) −σms) (step 422). Thereafter, the unevenness detection unit 32 calculates an expected value Ε (σ σ ms) of σ ams according to the following equation (step 423).

[0085] E (σ σ ms) = σ ρ / Ν ^1/2 / π ^1/4

The unevenness detection unit 32 compares Δσσ calculated in step 422 with a predetermined reference value, for example, 2 * Ε (σams). If “Δ σ σ> 2 * Ε (σ σ ms)” (Yes in step 424), it is determined that there is unevenness and indicates that there is unevenness in the original data. The value (eg, flag) is stored in the data buffer 30 in association with the standardized data or the original data (step 425). On the other hand, if the answer is no in step 425, the process continues with the standardized original data for the next DNA chip (see step 426).

[0086] [Principle of the third method]

Next, a third technique for detecting unevenness will be described. When the original data of the DNA chip is standardized by applying the processing shown in FIGS. 6 and 7, the present inventors have found that the logarithmic normal distribution and the standard deviation at that time are constant between experiments. (See Patent Document 1 and Non-Patent Document 3). The standard deviation is constant and unique for cells in all states, and most of one experiment is done in the same tissue (which contains a specific percentage of cells and its composition does not change). The deviation is expected to be constant.

[0087] However, the hybridization is non-uniform, so if the data is uneven on the chip, the distribution will be distorted. For example, if there is unevenness that halves the signal intensity in half the area on the chip, the distribution of values is considered to be bimodal with two peaks. As a result, the calculated standard deviation becomes larger.

[0088] When a standard deviation larger than the expected value (a robust average of the standard deviations of data from other chips) is calculated, it is a guideline for judging whether it is due to accidental force or unevenness. As a parametric test can be used. That is, when the distribution of variance can be predicted, it is possible to calculate the probability that the calculated variance value will occur. When the probability falls below a predetermined probability, it is judged that the value is caused by unevenness rather than by chance.

[0089] [Process for third method]

The parameters used in the third method are as follows.

[0090] σ n: Standard deviation of data value for each DNA chip, which corresponds to σ ρ in the second method.

[0091] Med σ: Median standard deviation in the experiment

Ε (σ η): Expected value of σ η, which is considered to be the same as Med σ in this method.

[0092] σ σ η: Standard deviation between chips of σ η E (σ σ η): Expected value of σ σ η

NC: Number of spots on DN A chip

FIG. 5 is a flowchart showing processing executed by the unevenness detection unit 32 based on the third method. As shown in FIG. 5, first, the unevenness detection unit 32 reads the original data of the DNA chip to be processed from the data buffer 30 (step 501). Next, the unevenness detection unit 32 standardizes the original data (step 502). Standardization can be realized by the processes shown in Figs.

Subsequently, the unevenness detection unit 32 obtains a median value Med σ of all the obtained standard deviations σ η (step 503).

Thus, after the standard deviation σ η and the median value Med σ are obtained, it is determined whether or not the standard deviation σ η obtained for each DNA chip is appropriate. In the third method, the distribution of σ η is approximated by a normal distribution. The average expected value Ε (σ η) force Med σ is considered equal to U. Also, the expected value Ε (σ σ η) of the standard deviation of the average value can be obtained by the following formula.

[0095] Ε (σ σ η) = Med σ / NC ^1/2 / π ^1/4

The unevenness detection unit 32 obtains an expected value Ε (σ σ η) of the standard deviation using the above formula (step 504), and thereafter, for each DNA chip,

Δ σ η = σ n— Med σ

Is calculated (step 505). The unevenness detection unit 32 compares Δ σ η of the DNA chip with a predetermined value (2 * Ε (σ σ η) in this embodiment), and determines that `` Δ σ η> 2 * Ε (σ σ η ) ”(Yes in step 506), it is determined that there is unevenness. In this case, the unevenness detection unit 32 stores a value (for example, a flag) indicating that there is unevenness in the original data of the DNA chip in association with the original data, for example, in the data buffer 30 (step 507). On the other hand, if NO is determined in step 506, the process returns to step 505 to determine the next DNA chip (see step 508).

In the present embodiment, 2 * E (σ σ n) is set as a predetermined value for determining the presence or absence of unevenness. When this value is used, as with “2σ” in the normal distribution, the larger and smaller values are off to a point that only occurs with a probability of 2.2 percent. For this reason, about 5 percent of 100 DNA chips are connected. And can be expected to be uneven.

On the other hand, when 3 * Ε (σ σ η) is set as the predetermined value, it becomes about 0.1 percent on the large value side and the small value side, respectively. Therefore, it is desirable that the predetermined value can be set as desired by the operator.

[0098] [Process for the fourth method (Figure 24 and Figure 25)]

Further, in the present embodiment, it is possible to search for dirty data by using the normal distribution of the reading value of the knock ground or the measurement value prepared therefor.

[0099] For this purpose, for example, the following measurement values are prepared (see FIG. 24 and FIG. 25).

[0100] A place where a square matrix with the same pitch is inserted at an oblique position between spots when the spot is a square matrix.

[0101] Circumference that surrounds the spot with sufficient distance from the spot.

[0102] In addition, when setting the safety level, it is better to use the 2 σ, which is normally used, by properly calculating from the number of spots. Next, while checking normality, look for outliers and visualize whether it is out of distribution or following. Here, probability plot and QQplot are used. At that time, as a robust standardization method of values, the values may be sorted in ascending order, and a low transition force may be counted to find a transition point. In addition, spot data rejected by the above judgment and spot data that is physically close are often affected by the same dirt. In order to reject the data, set a margin or a margin. Specifically, the spot data within a certain distance from the spot rejected in the above judgment, for example, the distance between two spots, is rejected.

[0103] [Current methods and their problems]

The dust in the experiment is dusty, and if the affected data is not deleted at an early stage of analysis, it will hinder the data analysis work. For this reason, conventionally, a method has been adopted in which the top and bottom edges are cut from the data itself. Of course, this method is meaningless because there are strong and weak data.

[0104] Alternatively, a method of rejecting a large change from experimental data of several times is also used. However, in this case, it can be calculated only after many experiments. There is a problem. Furthermore, this approach has the problem of eliminating the effort to find such clones that have a high biological potential and are highly likely to reject clones with greatly altered expression.

[0105] Sarako generally visualizes data and looks for images visually. In this case, it is often the case that clear garbage is often overlooked. This is because an objective standard cannot be prepared.

[0106] [Dirt-finding principle]

In this embodiment, V, a so-called “background measurement value” distribution force is also judged as dirty. The background measurement value here refers to the measurement of the data value force other than the background γ in the three-parameter normal distribution in Figs.

[0107] For example, in the case of an Agilent DNA chip, for example, a portion that does not pass through the pot is cut out using an algorithm called a cookie cutter. Knock ground measurements are taken by choosing a spot between the DNA spot and the spot that does not overlap the spot. It does not overlap any spot as long as it is just in the center between the spots that are diagonal. As a result, there are usually as many background measurements as the number of spots.

[0108] These actually measured background groups have almost the same values. However, if there is shining trash there, naturally the value from that trash is increased and the value increases by that measured value. Since the knock ground is normally distributed, a spot with an abnormally high value can be found by deviating from its distribution power.

[0109] This is considered to be a noise component. Knocks are made up of a number of pixels, each of which is influenced by the reader and the hybridization, which can be considered more random. Therefore, the knock ground measurement can be regarded as the sum of these random effects. The sum of random values is normally distributed. In fact, the knock ground value is roughly normal.

[0110] Dust adhering to a place where there is no DNA, or a hybridization solution remaining on the surface disturbs the randomness and gives a tendency to higher values. That disturbs the regular distribution. As something that deviates from the distribution or disturbs the distribution, You can discover Mi.

[0111] [Actual method]

The first method is to create a safety level for the number of spots. 2 σ is often used for these levels. However, this value causes too much data to be rejected. Originally, the knockground value is normally distributed, so the probability that a certain value will occur can be predicted. For example, for a DNA chip with 20,000 data values, if you want to reject the data with the expected value of 10 spots, if f 1 is the inverse function of the probability density function of the normal distribution, `` f 1 (1 0Z20000) Is calculated.

[0112] If you try to reject anything that exceeds this value, you won't lose data. In fact, data rejected at 2σ has acceptable reproducibility.

[0113] The second method is to strictly check the normality of the distribution using probability plots (see Fig. 25).

[0114] Find the rejected data using the trial 'improve method. Let this be X.

[0115] For (data measured-X)) data, prepare the theoretical value expected as the z value. The knock value data is then standardized and sorted in a robust manner.

[0116] Sort the data, count downforce (number of measured data x) and let Z2 value be

. Do IQR or MADS from the bottom in the same way.

[0117] Since the value of the outlier is always high, the robust is the force that counts the low direction force. The normal probability plot is made by combining the zero of the theoretical value and the zero point of the standardized data. Judge whether this has sufficient linearity, and change X if necessary.

[0118] Check the distribution, and reject the part that deviates from a certain size.

[0119] [Cut off]

The determination of the significance level when judging an outlier from the distribution of knock ground values is arbitrary as a parametric method. In general, if the significance level is set high, the number of rejected data decreases, and if it is set low, the number of rejected data increases. However, a high rejection rate does not necessarily mean it is safer. If the level is lowered, the power to reject a larger amount of data is not effectively reduced. This is In order not to affect the ground, it is also considered that there is a force that affects the spot. In this case, rather, there are many data with low reproducibility outside the spot population with a high knock ground, where there are few problems with knock values. In this case, the problem is whether there are any rejected spots in the vicinity.

[0120] Therefore, by finding a cluster with a high background and rejecting data outside the cluster that touches the cluster, non-reproducible data can be avoided (see Figure 26).

) o

[0121] As described above, it is possible to detect whether or not the DNA chip is uneven by the first method to the third method. The fourth method is exclusively a method of detecting unevenness caused by dirt along with the spot of the dirt. Therefore, the problem can be solved by rejecting all the data at the relevant location.

[0122] Next, description will be given of how to specify the form of unevenness other than dirt and the position of unevenness, and the coping method (correction method) when the form and position are specified.

[0123] [In the case of unevenness with gradation]

In the DNA chip, there may be unevenness in which the force also has gradation. This is known to be caused by, for example, fading of fluorescent dyes. In addition, due to the physical accuracy of the DNA chip, such as the curvature of the cover glass, unevenness with a gradient can also occur. For a DNA chip with such unevenness, set a small area in a shape that conforms to the gradation (for example, an elongated shape). As a result, it is possible to grasp unevenness quantified for each small area as “continuous bending of a plane” or “curved surface”. It is possible to correct the unevenness by generating an equation representing this curved surface and obtaining its inverse function (see FIGS. 18 and 19).

[0124] In order to enter a numerical value for gradation, it is necessary to find a representative value in a small area. It is desirable that the shape of the small area follows the gradation. Otherwise, it will average the gradation. Making the area smaller eliminates the averaging problem, but instead increases the risk of being affected by accidental fluctuations. For this reason, it is desirable to set as small a region as possible, and the shape is important. [0125] Ideally, it would be desirable for the small region to have a shape that incorporates more data of the same tendency inside, and the shape is expressed as a function of the position of the DNA chip. Is desirable. For example, a plurality of position functions are prepared, and an appropriate one is selected from a plurality of prepared function candidates. Also tune the parameters of the function. The standard deviation of data values in a small area can be used as an index for this purpose. It can be said that the shape in which the standard deviation is the smallest in trend is the enclosure that contains the most data with the same trend.

[0126] [Detection and correction of unevenness with gradation (using pixel data values)]

Hereinafter, detection of unevenness with gradation and correction processing in that case will be described. When the pixel data is acquired as raw data, the value of one spot is composed of several pixels by several pixels (for example, 4 pixels x 4 pixels, 5 pixels x 5 pixels, etc.) Processing using data values for each pixel is possible.

[0127] The parameters used in the processing using the pixel data value are as follows. In the following description, “blank” means that the data value itself is missing.

[0128] NN: Number of pixels in the small area

Here, in a small area, it is desirable to include at least nine spots.

[0129] μ NN: Median value of data values excluding pixels in the small area whose value is blank

Here, it is calculated only when the small area contains at least 9 non-blank pixels. If the number of non-blank pixels is less than 9, the average value of μ NN of all small sections adjacent to the small area is used.

[0130] σ NN: Standard deviation of the data value excluding pixels in the small area whose value is blank

Again, this is only calculated if the small area contains at least 9 non-blank pixels. If the number of non-blank pixels is less than 9, σ ΝΝ is treated as blank.

[0131] Med a NN: Median of σ 除く excluding σ 扱わ treated as blank As shown in FIG. 8, first, the unevenness detector 32 reads the original data as the data value of the pixel from the data buffer 30 (step 801). The unevenness detection unit 32 maintains only the data value of the pixel corresponding to the spot in the original data, and leaves the data value blank for the remaining pixels (step 802). Then, the pixel data value is standardized (step 803). Standardization is similar to that described with reference to Figs. Pixel data values are generally expected to have the three-parameter lognormal distribution described above.

The operator selects, for example, one having a desired shape from the small region candidates displayed on the screen of the display device 16 of the processing device 10 (step 804). The shape of this small region candidate may be a rectangle or a rectangle. It is desirable that the size can be changed freely according to operator parameter settings.

[0133] For example, if the pixel data value is displayed on the screen of the display device 16 of the processing device 10 by using the change in color, the operator can refer to the change in color and the shape of the small region candidate. You can choose what you think is appropriate. Furthermore, the small region candidates can be arranged so as to cover the DNA chip with a slight force inclination. The operator is in the direction of the gradient of the small area candidate (along the direction perpendicular to the direction in which the shading appears), so that the pixels in the small area maintain a constant data value. It is desirable to determine the orientation. Such functions will be described in detail later.

[0134] The unevenness detection unit 32 identifies pixels belonging to each small region candidate so as to cover the DNA chip with no gaps with the small region candidate having the shape, size, and orientation set by the operator (step 805). Next, the unevenness detection unit 32 calculates the standard deviation σ の of the standardized pixel data value for each small region candidate (step 806). As described above, σ ΝΝ is calculated only when the small region candidate includes at least 9 non-blank pixels. If the number of non-blank pixels is less than 9, the average value of μ の of all small region candidates adjacent to the small region candidate is used.

[0135] When the standard deviation σ に関する for all the small region candidates is calculated, the median value Med σ Ν 算出 is calculated (step 807). Processing power of steps 804 to 807 Is repeated only (step 808). The operator sets small area candidates of various shapes, sizes, or orientations, and obtains the median value Med σ に関する for each small area candidate.

[0136] The unevenness detection unit 32 compares the median values Med σ に関する with respect to the respective small region candidates, identifies the small region candidate with the smallest Med σ ((step 809), and determines its shape and size. The direction and direction are stored in the result storage unit 38 (step 810).

[0137] [Detection and correction of unevenness with gradation (using spot data values)]

When the data indicating the signal intensity for each spot is acquired as the original data, the processing proceeds in units of spots. In such data, the value corresponding to the signal intensity is set as the data value corresponding to the spot, and the data value corresponding to the other space of the DNA chip is blank. In this case, the parameters used in the processing are as follows.

[0138] ND: Number of spots in small area

Here, in a small area, it is desirable to include at least nine spots.

[0139] μ ND: Median value of the data values of the spot data values in the small area, excluding those that are blank

Here, it is calculated only when the small area contains at least 9 non-blank data values. If the number of non-blank data values is less than 9, the average value of μ ND of all small sections adjacent to the small area is used.

[0140] σ ND: Standard deviation of data values excluding blank data values in the small area Again, this is calculated only when the small area contains at least 9 non-blank data values. The If the number of non-blank data values is less than 9, σ Νϋ is treated as blank.

[0141] Med σ ND: Median of σ ND excluding σ れ treated as blank

As shown in FIG. 9, first, the unevenness detection unit 32 reads the data value of the original data also from the data buffer 30 (step 901). The unevenness detection unit 32 standardizes the data value of the original data (step 902). Standardization is similar to that described with reference to FIGS. As a result, the data values are generally expected to have the three-parameter lognormal distribution described above. [0142] The operator selects, for example, one having a desired shape from the small area candidates displayed on the screen of the display device 16 of the processing device 10 (step 903). The shape of this small region candidate may be a rectangle or a rectangle. It is desirable that the size can be changed freely according to operator parameter settings.

[0143] Similar to the method described with reference to FIG. 8, for example, if the pixel data value is displayed on the screen of the display device 16 of the processing device 10 using the change in color, the operator can By referring to the color change, it is possible to select what seems to be suitable as the shape of the small region candidate. Furthermore, it is possible to arrange the small region candidates so as to cover the DNA chip with a slight inclination. The operator is oriented along the direction of the gradient of the small area candidate (the direction perpendicular to the direction in which the shading appears), so that the pixels in the small area maintain a constant data value. It is desirable to determine the direction and direction.

[0144] The unevenness detection unit 32 identifies spots belonging to each small region candidate so as to cover the DNA chip with no gaps with the small region candidate having the shape, size and orientation set by the operator (step 904). Next, the unevenness detection unit 32 calculates a standard deviation σ ND of standardized data values for each small region candidate (step 905).

[0145] When the standard deviation σ ND for all the small region candidates is calculated, its median Med σ Ν D is calculated (step 906). The processing power of steps 903 to 906 is repeated as many times as desired by the operator (step 907). The operator sets small area candidates of various shapes, sizes, or orientations, and acquires the median value Med σ ND for each small area candidate.

[0146] The unevenness detection unit 32 compares the median values Med σ ND for the respective small region candidates, identifies the small region candidate having the smallest Med σ ND (step 908), and determines its shape and size. The direction and direction are stored in the result storage unit 38 (step 909).

[0147] When the small region covering the DNA chip is determined, the correction processing unit 34 determines a spatial correction function as shown in FIG. 10 (step 1001). In determining the spatial correction function, the small area function is referred to. The small area function can be thought of as a function that specifies the XY plane of the DNA chip, so the z-axis is introduced into this function and extended to a function that represents a curved surface. [0148] For example, the function representing the curved surface can be represented by the following general formula.

[0149] f (x) + g (y) + h (z) = c

Here, z is a correction term representing the distortion of space, and h (z) is a function that satisfies the formula as much as possible.

[0150] If the function cannot be estimated for the shape and the placement force of the small area, a function that specifies the space of several forces may be prepared, and the operator may select the function and adjust its parameters. In this case, as shown in FIG. 11, the correction processing unit 34 creates an equation for z using the median of the range of the constant c for each small region (step 1101). As a result, z equations can be obtained as many as the number of small regions. The correction processing unit 34 obtains a function h (z) that best satisfies the obtained equation of z and a parameter of the function (step 1102). Next, the correction processing unit 34 uses the previously calculated NN or ND as the variable z of h (z) (step 1103).

[0151] When the spatial correction function is obtained in this way, the correction processing unit 34 solves the spatial correction function with respect to z (step 1002). This represents z as a function of X and y. Next, the correction processing unit 34 calculates the value of z corresponding to the spot or pixel from the XY coordinates that specify the position of each spot or pixel (step 1003). After that, the correction processing unit 34 subtracts the obtained z value corresponding to the position from the standardized spot or pixel data value (step 1004). In other words, this z is the correction factor. Each of the data values obtained by subtracting the corresponding correction coefficient is stored in 38 results storage units (step 1005).

[0152] The unevenness detection unit 32 uses, for example, the above-described first method for unevenness detection (see Fig. 3) or the second method (see Fig. 4) to detect unevenness in the data value of the DNA chip. Judgment is made whether or not there is power (step 1006). If it is determined that unevenness exists (Yes in step 1007), the unevenness detection unit 32 determines whether or not the force has improved unevenness compared to the state before correction (step 1008). . In step 1008, it is determined whether or not Δ a ms obtained by the first method and Δ σ σ obtained by the second method are smaller than before correction.

[0153] If it is determined NO in step 1008, the data on the DNA chip cannot be corrected, and these data are discarded (step 1009). On the other hand, If the answer is yes in step 1008, data indicating the possibility of remaining unevenness (for example, unevenness remaining flag) is associated with the corrected data and used by the operator. Make sure that you can be notified that there may be some unevenness in the data.

On the other hand, if it is determined that there is no unevenness (No in step 1007), the operator may use the corrected data for analysis processing.

[0155] [In the case of unevenness that appears at a certain position with a certain strength]

Unevenness may appear at specific positions on the DNA chip, where the signal intensity appears strong or weak. For example, it is known that the bubble force in the hybridization solution may weaken or completely inhibit the signal intensity of the portion in contact with the bubble. When it is possible to identify a place where such unevenness has occurred and to predict the influence thereof, it is possible to correct the data. Even if the impact is unpredictable, it is possible to avoid the occurrence of problems caused by incorrect data by rejecting that part of the data.

[0156] The following method is preferably applied to the case where it is considered that a spot may exist at a specific position based on a principle such as hybridization.

[0157] [Principles such as DNA chip agitation]

For example, there is a bubble force S in a constant volume chamber for hybridization of a DNA chip, and the channel 1200 is rotated around a horizontal axis 1201 as shown in Fig. 12 (a). Think.

[0158] In such a case, the relative effective time of the hybridization at each position on the DNA chip may be calculated. The following variables are used to identify the location of the unevenness.

[0159] Θ: Angle formed by the reference plane 1210 of the chamber and the horizontal plane 1211 (see Fig. 12 (b))

(x, y): Position on the DNA chip

b: Variable used in the equation specifying the interface in the chamber

va: Volume of bubbles in the chamber

vs: Volume of solution in chamber vm: Capacity in the chamber

xo. (i: l, · · ·): Variables representing each observation point

The chamber 1 can be considered as a very thin rectangular parallelepiped. Therefore, in processing, the thickness of the chamber is ignored and the plane (rectangular) is considered. As shown in FIG. 13, the unevenness detector 32 schematically sets one side of the chamber (one long side in the example of FIG. 12) as a reference line (step 1301), An observation point is set at a position corresponding to the pixel or spot of the placed DNA chip (step 1302). Therefore, the observation points are arranged on the matrix.

[0160] Next, the unevenness detection unit 32 divides the chambers using a horizontal line so that the ratio of the upper and lower sizes is va: vs. This ratio is also constant when the chamber is rotated about its axis in various orientations.

[0161] Next, the parameter i is initialized (step 1303), and after initializing 0 (step 1304), the parameter i is changed by a constant value, and the observation point X o 1S chamber is changed for each Θ. In step 1305, it is determined whether it is located in the bubble region or in the solution region. The determination result is stored in the data buffer 30 (step 1306). Steps 1305 and 1306 are repeated until the Θ force ^) force reaches 2π (steps 1307 and 1308). As a result, the time when the observation point χο was in the solution can be obtained. Steps 1304 to 1308 are executed for all observation points (see steps 1309 and 1310). In step 1305, a dividing line is assumed in which the ratio of the size of the upper side and the lower side becomes va: vs according to the inclination Θ of the chamber. The observation point Xo is below the dividing line. It can be determined whether the observation point is in the solution or in the bubble.

[0162] If the rotation speed of the chamber is not constant, the increment of Θ may be determined so as to be inversely proportional to the angular speed at each time point in order to reproduce the non-constant state as much as possible.

[0163] By executing such processing, it is possible to obtain the time during which each observation point was present in the solution during the hybridization. In the present embodiment, finally, for each observation point, the relative time existing in the solution (that is, the time existing in the solution). (Z hybridization time) is stored in the result storage unit 38.

[0164] In addition, vs and va in the chamber may not be known. In such a case, the following processing is executed.

[0165] As shown in Figure 14, va is increased from 0 to vm (see steps 1401, 1407, 1408), and for each va, the relative time each observation point was in the solution is calculated. (Step 1402). The unevenness detection unit 32 removes each observation point xo whose relative time is “0” (step 1403), and for the remaining observation point xo, data corresponding to the observation point. Divide the value (normalized signal intensity) by the relative time obtained (step 1404).

Next, the unevenness detection unit 32 calculates the standard deviation σ ms of the value obtained by the division (step 1405), and temporarily stores va and the calculated standard deviation σ ms (step 1406).

[0167] After changing va, the standard deviation σ ms is calculated for each va, and these are stored in correspondence with each other. Then, the unevenness detection unit 32 finds the va having the smallest standard deviation a ms, and The relative time of each observation point associated with is stored in the result storage unit 38 (step 1409).

[0168] Next, with reference to the obtained relative time at each observation point, it is determined whether or not the data value is correctable. More specifically, as shown in FIG. 15, the correction processing unit 34 divides the data value at the observation point xoi by the corresponding relative time. This process is executed for all observation points (step 1501). The data value divided by the relative time is temporarily stored in the data buffer 30 (step 1502). Next, the unevenness processing unit 34 uses the first method (see FIG. 3) or the second method (see FIG. 4) to check whether or not the unevenness is eliminated by using the divided data value (step 1503). ). If it is determined that unevenness exists (Yes in step 1504), the unevenness detection unit 32 determines whether the unevenness has been improved as compared to before correction (step 1505).

[0169] If it is determined NO in step 1505, the corrected data (that is, a series of data values divided) is discarded (step 1506). On the other hand, if it is determined YES in step 1505, the corrected data is retained. Step 1504 If it is determined that there is unevenness, the method of specifying the position of unevenness of the data itself as described below is applied regardless of the determination in step 1505.

On the other hand, if it is determined that there is no unevenness (No in step 1504), the operator may use the corrected data for analysis processing.

[0171] [Method of identifying the position of unevenness of data itself]

When the signal intensity of the spot is independent of the position on the DNA chip, focusing on the small area on the DNA chip, it is expected that the average value of the signal intensity will be the same for any small area. The average value is normally distributed by the central limit theorem. Also, the expected value of the variance can be predicted.

[0172] When the hybridization is uneven and the data is uneven on the DNA chip, there is a difference in the average signal intensity between the small areas. For example, if bubbles exist, the average value of the signal intensity of each small region group corresponding to the portion where the bubbles are located is measured low.

[0173] As described above, since the average value of the signal intensity in the small region is normally distributed and the variance can be predicted, it is possible to predict the probability that a certain average value can occur. When the probability falls below a predetermined value, it is determined that the value is caused by unevenness that is not caused by chance.

[0174] However, a DNA chip generally uses a large amount of data, for example, tens of thousands per chip. For this reason, when “2 σ”, which is often used in parametric tests, is set as a sensitivity of 5%, the absolute number of data exceeding the sensitivity by chance increases (α error). For example, if the number of DNA chip spots is 30,000, and 9 (3 X 3) spots in the DNA chip are considered to be small areas, there are more than 3,000 small areas, and even if it is 5%, about 170 It can be said that the data value force sensitivity of a small area of 1, that is, 1500 spots is exceeded.

[0175] If this sensitivity range is set to be dull, there is a problem that unevenness cannot be detected (β error)

) ο

[0176] In general, it is not possible to set the sensitivity to reduce both the α error and the j8 error. In this embodiment, the above-mentioned questions are provided by providing a plurality of levels. Solve the problem.

[0177] First level

The first level sets the sensitivity very low (eg probability corresponding to 1Z (number of small areas)). A small region exceeding the sensitivity range is determined to be uneven.

[0178] Second level

The second level is set to be more sensitive (for example, 0.05). Small areas that exceed the sensitivity range are judged to have a high possibility of unevenness.

[0179] When a bubble or a small region smaller than the size of the bubble movement range is observed, a small region adjacent to the small region (for example, if the small region is rectangular, the side of the small region or Is considered to be affected by bubbles. These small regions are both, for example, it could have a median low signal strength level as not happened only 0.05 or less probability is 0.05 ² or less.

[0180] This does not apply when one small region is low and the other small region protrudes into a high signal region. For example, one small area is rejected when it is rejected upward and the other small area is rejected.

[0181] Therefore, it is determined that two adjacent small areas that are likely to be uneven are both uneven.

[0182] nth level

The n-th level is set with higher sensitivity (for example, 1Z (the number of small areas) “(lZn)). It is determined that a small area exceeding the sensitivity area is highly likely to be uneven.

[0183] When a bubble or a small region smaller than the size of the bubble movement range is observed, the vicinity of such a small region (for example, in a circle including M small regions in the radial direction or in a circle) The small area group included in the small area group is considered to have a high possibility of unevenness. The number of small areas exceeding the sensitivity range in the small area group About m

, (Probability of set sensitivity) ^m or less.

[0184] If this possibility is lower than a preset level, it is determined that all the small area groups are uneven.

[0185] In rare cases, bubbles are almost the same course during hybridization, at a roughly fixed rate. May move. In such a case, it is possible to specify the position of the unevenness by using the level described above.

[0186] [Specific processing for identifying the location of unevenness from the data itself]

As shown in FIG. 16, the unevenness detection unit 32 accepts input such as the shape of the small region according to the operator's instruction (Step 1601), places the small region on the DNA chip, and places the small region in each small region. Identify the data value to which it belongs (step 1602). The small area may be set in the analysis of unevenness with gradation. Alternatively, a small area where the standard deviation of the data value of the small area is as small as possible may be set separately.

[0187] In addition, when the processing after step 1505 in Fig. 15 and without passing through step 1506 and after this processing, the corrected data value is used as the data value.

[0188] The unevenness detection unit 32 obtains the median value of the data values of each small region (step 1603). Next, it is determined how far from the expected value the median value of each small area is. For example, if the first level is reached (Yes in Step 1604), it is determined that the small area is uneven (Step 1605). If the second level has been reached (Yes in step 1606), it is determined whether the adjacent small area has also reached the second level. If the adjacent subregions have also reached the second level (Yes in step 1607), then whether or not the force is less than or equal to the first level is the product of these probabilities. Judged. If the probability of being multiplied is less than or equal to the first level (Yes in Step 1608), it is determined that the small area is uneven (Step 1605).

[0189] Such processing is repeated for the third, fourth,..., And nth levels. For example, at the nth level, the subregion has reached the nth level (Yes in step 1609), the adjacent subregion has also reached the nth level (Yes in step 1610), and If the value multiplied by the probabilities is less than or equal to the first level (Yes in Step 1611), the small area is determined to be uneven (Step 1605). Repeated for region (steps 1612, 1613)

As a result of the processing in Fig. 16, all data in the small area determined to be under the influence of unevenness is rejected (step 1701). In addition, a small area adjacent to a small area determined to be under the influence of unevenness. For the area, half of the data that touches the area under the influence of unevenness is also rejected (step 1702). Furthermore, for the small area that is obliquely adjacent to the determined small area under the influence of unevenness, half of the data that is closest to the small area where the data was rejected is rejected (step 1703).

[0190] Rejecting data in the vicinity is a safety margin. In this embodiment, half of the data is rejected. This is because if more than half is uneven, the median is also affected and the data for the small area itself is expected to be rejected. The data is set for multiple rejected small areas, and in each small area, the influence of each rejected small area force is considered separately.

[0191] After rejecting the data in steps 1701 to 1703, the unevenness processing unit 34 uses the first method (see Fig. 3) or the second method (see Fig. 4) to check whether the unevenness has disappeared. (Step 1704). If it is determined that unevenness exists (Yes in step 1705), the unevenness detection unit 32 determines whether the unevenness has been improved compared to before correction (step 1706). .

[0192] If it is determined No in step 1706, the unevenness detection unit 32 discards the entire data (step 1707). On the other hand, if the answer is yes in step 1706, the data is retained. On the other hand, if it is determined that there is no unevenness (No in step 1705), the operator may use the remaining data from which some data has been rejected for analysis processing as described above.

[0193] For example, when the hybridization chamber 1 containing bubbles is swirled in the horizontal direction and stirred, if the bubbles are powered by the stirring, the hybridization chamber 1 containing bubbles is removed. This method is effective when the position and appearance of the unevenness cannot be specified, such as when stirring in the vertical direction.

[0194] [Function example]

Next, the function mentioned in FIG. 8 will be described.

[0195] For any distortion, there is a method to keep the brightness uniform using a spline function. However, since the brightness of the chip image basically changes depending on the part of the chip, it is inevitable that the conversion will introduce new distortion. [0196] Regarding the position of the chip, unevenness occurs in a linear function. For any position on the DNA chip, there are parameters a and b representing the linear function “(read data) = (true value) * b + a”. These parameters are expressed as a function of position on the DNA chip. I thought those functions were smooth.

[0197] The meters a and b can be estimated based on the minimum and median values of the subregion data, respectively. The smoothness of the function is used to reduce the noise caused by using small areas. These functions are used to correct the parameters γ and. Unevenness is corrected through correction of these parameters γ and. Incidentally, parameter σ is obtained from the corrected data. The parameters γ, μ, and σ are the data standardization parameters calculated in Figs.

[0198] [Principle of curvilinear distortion]

The strain that occurs at the washing stage occurs linearly. The main cause of unevenness that can occur during washing is thought to be the difference in temperature and water flow during washing. This unevenness is thought to be affected by different mechanisms of action, depending on the signal component that changes depending on the amount of hybridization, and the knock ground component that changes with the amount of dye that is non-specific.

[0199] In other words, it is expressed as “raw data (original data) value = signal component + background component”.

[0200] In contrast, the hybridization is considered to be an equilibrium state force between binding and dissociation, whereas washing is an almost one-sided dissociation reaction. Due to the large amount of solution, the dissociated probe is unlikely to form a new bond. In general, washing is performed at a salt concentration that is more “stringent” than that at the time of hybridization, that is, the equilibrium moves to the dissociation side.

[0201] Therefore, even if the probe concentration is kept high! /, The reaction proceeds in the dissociation direction. At that time, the reaction of dissociating the probe is proportional to the concentration of the probe.

[0202] The rate at which a probe dissociates is thought to be proportional to the concentration of the probe. This is because dissociation is expected to occur in a certain stochastic process for each probe. In that case, a primary reaction is expected for each type of probe. [0203] Commonly used assumptions and mathematical models A force Consider the following equation.

[0204] v =-d [probe] / dt = k [probe]

(lZ [probe]) d [probe "= one kdt

Integration: Wash here !, if t is the time, from the start of wash

[probe] = [probeOJ exp (― kt)

[probeO]: Concentration at the start of washing

Here, when there is a temperature difference at each part of the chip, or when there is a difference in water flow, these differences affect the constant k representing the “ _pro be” binding force. The signal intensity of the part with constants of kl and k2 on the chip is exp {-(kl -k 2)} t times different if [probeO] is the same. Since t is constant, this means that the resulting signal intensity due to “probe” will be “multiplied”.

[0205] The knock ground component can be considered as follows. The signal is not necessarily composed solely of the hybridization of the probe to the chip nucleic acid, but it can also provide a low specificity and binding power of free dyes and probes to the DNA chip surface. This is the so-called knock ground component of the signal. Such a low-specificity binding reaction is expected to be completed in a very short time. Therefore, it is considered that there is an equilibrium state during the hybridization, and the next equilibrium state is quickly shifted during the subsequent handling.

[0206] Such components should have a constant value for each position on the chip. This value is a constant common to that part or other parts that have the same conditions as that part. As a result, this component means that it will “additively” affect the signal intensity due to “probe”, regardless of the other components (how much “probe” has hybridized). The above suggests that high-priority irregularities change the numerical values of the data in the form of primary conversion.

[0207] If so, the data can be corrected with an inverse function of a linear expression. By subtracting the appropriate numerical value a and dividing the result by the appropriate numerical value b, the effect of unevenness due to changes in k can be canceled. The following describes how to obtain these values.

[0208] [Additive effects]

The additive effect increases the power of all data. This effect is (3 parameter normal This is an influence on γ, which is the background value of the distribution (normal distribution based on data standardized by the processing shown in Figs. 6 and 7). This effect is most apparent in the minimum data. Microarray data is essentially lognormally distributed. A characteristic of this distribution is that the frequency distribution is concentrated on small values of data close to γ. In other words, these small values have almost the same value as γ.

[0209] The data range “f” to “h” or the minimum value “j” in which the frequency distribution is concentrated in the data group of an arbitrary small region on the DNA chip is slightly larger than the required γ. These values can be used to estimate y.

[0210] However, this has the following problems. The question is how much power is “small”. This is a problem of probability. If the number of data in the area is large, it becomes closer to γ. Expectation value can be considered more as the probability to multiply the number of data

[0211] The median from "f" to "h" or "j" includes noise with a distribution with a certain probability as an estimate of y. It will be smaller than γ. Because of this noise, using the “median of h or j; The problem can be reduced by using a larger sub-region. However, increasing the small area leads to overlooking unevenness. Regarding the size of the small area, there is a trade-off between the sensitivity of noise and the fidelity that reflects unevenness faithfully. To solve this dilemma to some extent, we assume the following.

[0212] Suppose that the γ value of each interval is smoothly connected to each adjacent interval. For example, unevenness in temperature is considered to occur smoothly, so this seems to be a reasonable assumption. Under this assumption, the additive effect of each interval on a smaller scale is also expressed as a function of the position on the chip. Based on the function, γ value is corrected. The γ value is a force that also calculates the overall force of the data. At the time of the calculation, the function is used to weight each chip position.

[0213] [Outline of the method (see Figure 18 to Figure 21)]

(1) Calculate the median of “h” or “j” from “f” for each small region on the DNA chip.

(2) As a function of the physical position on the DNA chip (or the position of the spotlight) For each of these sub-intervals, find a smooth function that satisfies the median or j value of f to h.

(3) When calculating y, for each spot, the value of the function of the position of that spot is stored in γ.

(4) γ is different depending on each part of the DNA chip. Γ increases when the minimum value rises due to the addition effect. This corrects for the additive effect.

[0214] [Specific processing]

Consider an Agilent DNA chip. When there is a process of changing the temperature when washing the DNA chip, a temperature gradient is generated on the chip surface by the method of holding the chip or by the water flow. Until this gradient is resolved, the hybridization proceeds in different states, resulting in unevenness.

[0215] This unevenness was considered by dividing it into two components in the vertical and horizontal directions of the DNA chip. This is to avoid the fact that it is not easy to define a plane function. Simplified as a function of two curves, vertical and horizontal.

[0216] It was assumed that a similar curve could be obtained by cutting at an arbitrary position in the vertical direction, and a similar curve could be obtained by cutting at an arbitrary position in the horizontal direction. Such a surface is shaped like a heel.

[0217] Under the above assumption, the following small regions can be set.

[0218] Three rows of data in the vertical direction

Data group for one row in the horizontal direction

As shown in FIG. 19, the correction processing unit 34 first sets the small region described above, finds the minimum value from the data group (step 1901), and plots the chip position as the X axis and the minimum value as the y axis. (Step 1902). Next, the correction processing unit 34 performs linear approximation by the least square method for each plot. However, in this embodiment, the coefficient is calculated using a moving average or the like. In other words, linear approximation was performed based on 10 data before and after surrounding the data. As a result, the function is expressed as a curve composed of many straight lines. By such processing, a vertical position function and a horizontal position function can be obtained.

[0219] All data has two functions, vertical and horizontal. So for a spot When calculating 0, add the values of the two functions to γ, which is the constant value that was originally calculated, as follows:

[0220] (specific γ for a spot) = γ (—constant value) + longitudinal function value + transverse function value

[Multiplicative effects (see Figure 18, Figure 19, Figure 22, and Figure 23)]

Multiplicative effects only affect the amount of hybridization. This effect is on μ (where it is the median of the three-parameter normal distribution). This effect is evident in the median value of the data. The median force of the data of the DNA chip that has been previously subtracted γ Varies depending on the position of the chip.

[0221] To make this change, consider a small area similar to γ. The median value is calculated for each area and used to correct the data. The contradictory relationship between the size of the area and the fidelity of noise and unevenness correction is exactly the same as the case of [additive effect]. The same solution can be used.

[0222] [Specific processing]

As in the case of the additive effect, unevenness was divided into two components in the vertical and horizontal directions of the DNA chip. This is to avoid the fact that it is difficult to define a plane function. Simplified as a function of two curves, vertical and horizontal.

[0223] It was assumed that a similar curve could be obtained at any position in the vertical direction, and a similar curve could be obtained at any position in the horizontal direction.

[0224] Under the above assumption, the following small regions can be set.

[0225] Three rows of data in the vertical direction

Data group for one row in the horizontal direction

Therefore, the correction processing unit 34 sets a small area, finds the median value of these data forces, and plots the chip position on the X axis and the minimum value on the y axis. Next, the correction processing unit 34 finds a function that gently connects the plot data in the vertical direction and the horizontal direction. More specifically, the correction processing unit 34 performs linear approximation by the least square method for each plot. However, in the present embodiment, the coefficient is calculated using a moving average or the like. In other words, linear approximation is performed based on 10 data before and after surrounding the data. As a result, The number was expressed as a curve composed of a number of straight lines. By such processing, the function of the vertical position and the function of the horizontal position can be obtained.

[0226] All data has two functions, vertical and horizontal. When calculating μ for a certain spot, the values of the two functions should be added to the fixed values originally calculated as follows.

[0227] (specific μ for a spot) = μ (constant value) + longitudinal function value + transverse function value

Μ is the median value of log (xi-y).

[0228] [Detailed description of processing in FIGS. 18 and 19]

Hereinafter, the processing shown in FIGS. 18 and 19 will be described in more detail. The unevenness detection unit 32 of the processing device 10 reads the original data from the data buffer 30 (step 1801), and rejects data determined to be dirty by the fourth method of unevenness detection (FIGS. 24 and 25). (Step 1802). Next, the unevenness detection unit 32 executes at least one of the first to third methods (FIGS. 3 to 5) of unevenness detection (step 1803). Here, the operator decides which method is to be executed, operates the input device, and inputs one or more methods to be executed.

If non-uniformity is not detected (No in step 1804), the process ends. On the other hand, when unevenness is detected (Yes in Step 1804), the correction processing unit 34 checks the attribution of the original data to the matrix (Step 1805). In this embodiment, the spot data values are stored one-dimensionally in the data buffer as one column for convenience. On the other hand, spots are originally arranged on the matrix in the DNA chip. Accordingly, in step 1805, information for handling each spot data value as a two-dimensional array is acquired.

[0230] Next !, the correction processing unit 34 refers to the original data, and acquires the row and column information on the original DNA chip for each spot of the DNA chip (step 1806). . The correction processing unit 34 obtains the physical position Xc on the chip corresponding to the c-th row of the DNA chip (step 1807) and obtains the physical position Xw on the chip corresponding to the w-th column (step 1808). ). Here, in the present embodiment, a row specifies a group of spots that are continuous in the vertical direction on the DNA chip, and a column specifies a group of spots that are continuous in the horizontal direction. This Thus, for example, the position of the spot in the c-th row and the w-th column can be specified by (Xc, Xw).

Next, as shown in FIG. 19, the correction processing unit 34 executes at least one of the correction methods A to D described below in accordance with an instruction from the operator (step 1809).

20, FIG. 21, FIG. 22, and FIG. 23 are flowcharts showing the correction methods A to D in detail, respectively. FIG. 20 shows “additional unevenness correction” and vertical correction, and FIG. 21 shows “additional unevenness correction” and row correction. FIG. 22 is “correction of multiplication unevenness” and relates to correction of vertical lines, and FIG. 23 is “correction of multiplication unevenness” and relates to correction of rows. .

In the processing of FIG. 20, the correction processing unit 34 performs the vertical c-th row or several rows including the c-th row (for example, the (c 1) -th row, the c-th row, and the (c + 1) -th row). 3), the minimum value MINc of the spot data value is obtained and stored in the data buffer 30 (step 2001). The minimum value MINc is determined for all rows and stored in the data buffer 30.

[0234] Next !, the correction processing unit 34 obtains a smooth and continuous function f (Xc) that approximates MINc by Xc that is the X coordinate of the spot (step 2002). Next, f (Xc) substituted with Xc, which is the X coordinate corresponding to the c-th row, is subtracted from each data value of the spot on the c-th row (step 2003). The correction processing unit 34 stores the data value obtained by subtracting f (Xc) in this way in the data buffer 30 (step 2004).

[0235] The processing in Fig. 21 is basically the same as the processing in Fig. 20. In the process of Fig. 20, the vertical c row is used, while in the process of Fig. 21, the horizontal w column is used, and MINw is approximated by Xw which is the y coordinate of the spot. A continuous function g (Xw) is determined (see step 2102). The correction processing unit 34 subtracts g (Xw) substituted with Xw, which is the y coordinate corresponding to the w-th column, from each of the spot data values in the w-th column (step 2103), and subtracts the subtracted data value. Store in the data buffer (step 2104).

[0236] As shown in FIG. 22, in the case of correcting multiplicative unevenness, the correction processing unit 34 should execute correction of additive unevenness (correction shown in FIGS. 20 and 21). Determine whether or not (step 2201). This may be determined by referring to the type of correction to be executed, which is included in the correction instruction from the operator. If NO in step 2201, standardize The processing unit 31 reads the original data stored in the data buffer 30, normalizes the data, and obtains the knock ground value γ (step 2202). This γ is the above-mentioned formula z = (log (x− y) ~ μ) / σ

Is used.

Next, the standardization processing unit 31 subtracts the calculated background value γ from the data value of the original data, and stores the subtracted data value in the data buffer 30 (step 2203). Next, the correction processing unit 34 obtains the median value MEDc of the data values of the spots in the vertical c-th row or several rows including the c-th row and stores them in the data buffer 30 (step 2204). The median value MEDc is obtained for all rows and stored in the data buffer 30.

Next, the correction processing unit 34 obtains a smooth and continuous function h (Xc) that approximates MEDc with Xc that is the X coordinate of the spot (step 2205). Next, each data value of the spot in the c-th row (data value obtained by subtracting the background value γ) is divided by h (Xc) into which Xc corresponding to the c-th row is substituted (step c). 2206). The correction processing unit 34 stores the divided value thus obtained in the data buffer 30 (step 2207).

The correction process (FIG. 23) for the row is basically the same as the process of FIG. A smooth and continuous function j (Xw) that approximates the MEDw obtained by the same process with Xw that is the y coordinate of the spot is obtained, and the data value of the spot is divided by j (Xw), The divided value is stored in the data buffer 30.

[0239] After one or more of the processes shown in Figs. 20 to 23 are executed, the standardization processing unit 31 standardizes the corrected data values (step 1810). Standardization here uses the following formula as described above.

[0240] z = (log (x-γ) ~ μ) / σ

This is because the data values corrected by the processes in FIGS. 20 to 23 are not standardized.

[0241] Further, the unevenness detection unit 32 executes the first to third methods (FIGS. 3 to 5) of the unevenness detection to confirm the effect of the correction (Step 1811). In this embodiment, any one of correction methods A to D or a combination of two or more is used. Thus, each corrected data is obtained (step 1812).

[0242] When unevenness is found in the corrected data (when it is determined that the unevenness exceeds the allowable range) (No in Step 1813), the original data is rejected (Step 1816). . On the other hand, if it is determined that there is no unevenness (within the allowable range) (Yes in Step 1813), the corrected and standardized data value is stored in the data buffer 30 (Step 1815). Thereafter, the corrected and standardized data values are used for the analysis. If multiple corrected data value groups exist due to multiple combinations of the correction methods A to D, the correction processing unit 34 uses the distribution power logarithm normal distribution of the standardized data values after correction. A combination of correction methods that most closely approximates the data is selected (step 1814), and the data value corrected by the combination is stored in the data buffer 30.

[0243] [Detailed description of flowchart for fourth method of unevenness detection]

The fourth method of unevenness detection includes the method shown in FIG. 24 and the method shown in FIG. These will be described in more detail with reference to flowcharts.

As shown in FIG. 24, the unevenness detection unit 32 determines the rejection level based on, for example, input by the operator. For example, 2σ should be used as the rejection level. Next, the unevenness detection unit 32 reads the original back-round data of the DNA chip to be processed from the data buffer 30 (step 2401). This original background data will be explained.

[0245] For example, in an Agilent DNA chip, a portion that does not pass through the pot is cut out using a cookie cutter and a squeeze algorithm. A place between the spots on the DNA chip that does not overlap with the spot is selected, and the data value of the place is measured, which becomes the background measurement value. For example, it does not overlap any spot as long as it is in the middle between diagonally adjacent spots. Usually, the same number of background measurements as the number of spots is taken. Also in this embodiment, the same number of background measurement values as the number of spots are stored in the data buffer 30, and this becomes the original background data in step 2401. Each background measurement stored in the data buffer 30 is associated with a spot.

[0246] The actually measured background measurement values are almost the same. However, the chip Naturally, if there is dust or other light that shines on that position, the knock ground value will increase due to the influence of the light. Since the knock ground value is normally distributed, it can be considered that the spot related to the background measurement value in which the value is abnormally rising also deviates from the distribution force.

[0247] The unevenness detection unit 32 calculates the representative value Mb and the scale value Sb on the assumption that the original background data is normally distributed (step 2403). Here, as the representative value Mb, the average value that is robustly obtained as the median value can be used. As the scale value, the standard deviation calculated by IQR or MADS can be used.

[0248] Next, the unevenness detection unit 32 reads an individual back-round value, which is a background measurement value for each spot, from the data buffer 30 (step 2404), and standardizes the data value according to the following formula (step 2404). 2405).

[0249] Zbi = (Xbi Mb) / Sb

The unevenness detection unit 32 determines whether or not the calculated Zbi is below the rejection level (step 240).

6). If NO in step 2406, that is, if Zbi is greater than the rejection level, reject the spot data value associated with the individual background value being processed (step 2407)

In practice, a flag indicating that the data value is determined to be rejected in the data buffer 30 may be added. The data value with the flag added is no longer used for data analysis.

[0250] Step 2405 for individual background values for all spot data

To Step 2407 are executed (see Step 2408).

Next, the process of FIG. 25 will be described in more detail. In Figure 25, step 2501

, 2502ί, step 2402 and 2403 in Fig. 24 respectively. The individual background value Xbi is read out until uneven detection ί32ί (step 2503), and the individual background value Xbi is standardized by the following formula (step 2504).

[0252] Zbi = (Xbi Mb) / Sb

This standardization is the same as the standardization in step 2405 of FIG. In this method,

All individual background values of the DNA chip are read out and knocked out Is standardized. The unevenness detection unit 32 rearranges the calculated Zbi in descending order (step 25 05). Further, the unevenness detection unit 32 creates Normal Probability Prot according to the theoretical value of the normal distribution model (step 2506). The unevenness detection unit 32 compares Z bi arranged in descending order with Normal Probability Prot, and detects a range in which the value of Z bi matches the theoretical value (step 2507). This may detect a Zbi range that does not have to be an exact match but is within a certain range of errors.

[0253] The unevenness detection unit 32 reads the standardized individual background value Zbi (step 2508), and determines whether the read Zbi is less than or equal to the upper limit of the range that matches the theoretical value ( Step 2509). If NO in step 2509, the spot data associated with the individual background value Xbi, which is the basis for Zbi calculation, is rejected (step 2510). The processing of steps 2509 and 2510 is performed on the value Zbi calculated using the individual background values for all spot data (see step 2511).

[0254] [Process for diagnosis of critical spots]

As described above, if a cluster with a high knock ground is found and the data outside the cluster that touches the cluster is rejected, non-reproducible data can be avoided (Figure 26). The processing shown in FIG. 26 will be described in more detail.

[0255] First, the unevenness detection unit 32 determines a rejection level based on the input by the operator! (Step 2601). For example, 2σ may be used as the rejection level. Of course, it goes without saying that other levels may be adopted. Next, the unevenness detection unit 32 reads out the standardized background value Zbi of the peripheral area of the spot to be processed from the data buffer 30 (step 2602). The standardized background value Zbi is obtained by performing an operation of Zbi = (Xbi−Mb) ZSb on the individual background value Xbi in the same manner as described in FIGS. Further, as the peripheral region of the spot to be processed, the unevenness detection unit 32 can use a 5 × 5 matrix, a 7 × 7 matrix, a 3 × 3 matrix, or the like centered on the spot.

[0256] Spots to be processed are also referred to as individual spots, and spots in the peripheral area are also referred to as peripheral spots. Next, the unevenness detection unit 32 detects the positions of the individual spots on the matrix and the surrounding spots. With reference to the position on each matrix, the distance r between the individual spot and the peripheral spot is obtained according to the following equation (step 2603).

[0257] r = (x "2 + y" 2) "0.5

X and y are information on the distance in the row direction and information on the distance in the column direction, respectively. As the distance information, the difference in row number and the difference in column number can be used.

[0258] After that, the unevenness detection unit 32 calculates (ZbiZr'2) based on the standardized background value Zbi of the surrounding spots and the distance r, and assigns the sum B to the individual spots. And stored in the data buffer 30 (step 2604). The unevenness detection unit 32 determines whether or not the calculated value B is less than the rejection criterion (step 2605). If it is determined NO in step 2605, the data value of the individual spot assigned with the total value B is rejected (step 2606). For example, a flag indicating that the data value is determined to be rejected in the data buffer 30 may be added. The data value with the flag added is no longer used for data analysis.

[0259] The unevenness detection unit 32 executes the processing of Step 2602 to Step 2606 for all spots (see Step 2607).

[0260] The present invention is not limited to the above-described embodiments, and various modifications are possible within the scope of the invention described in the claims, and these are also included in the scope of the present invention. Needless to say, it is something.

[0261] For example, in the above embodiment, unevenness detection (for example, unevenness with gradation) is performed on a DNA chip that is determined to have unevenness by performing unevenness detection (for example, see FIGS. 3 to 7). Etc.), and in some cases, amendments or rejects some or all data. However, the present invention is not limited to this, and non-uniformity detection, non-uniformity determination, correction, or data rejection may be executed. Industrial application fields

[0262] The present invention can be used for analysis of DNA chips. In particular, the unevenness on the DNA chip is detected, the spot data value is corrected, and if correction is impossible, it is determined that part or all of the DNA chip data should be rejected. Therefore, more accurate and appropriate analysis is possible. Brief Description of Drawings

FIG. 1 is a hardware configuration diagram of a gene expression data processing apparatus according to an embodiment of the present invention.

[FIG. 2] FIG. 2 is a functional block diagram of the main part of the processing apparatus which is effective in the present embodiment.

[FIG. 3] FIG. 3 is a flowchart showing a process that works on the first method executed by the unevenness detection unit.

[FIG. 4] FIG. 4 is a flowchart showing a process that works on the second method executed by the unevenness detection unit.

[FIG. 5] FIG. 5 is a flowchart showing a process that works on a third method executed by the unevenness detection unit.

[FIG. 6] FIG. 6 is a flowchart showing in more detail a standardization process executed by a standardization processing unit that works according to the present embodiment.

[FIG. 7] FIG. 7 is a flowchart showing in more detail the standardization processing executed by the standardization processing unit working on the present embodiment.

[FIG. 8] FIG. 8 is a flowchart showing an example of detection processing of unevenness having gradation in the present embodiment.

FIG. 9 is a flowchart showing another example of the unevenness detection process with gradation in the present embodiment.

FIG. 10 is a flowchart showing an example of correction processing for unevenness with gradation in the present embodiment.

[FIG. 11] FIG. 11 is a flowchart showing an example of a process of calculating a spatial correction function that is useful in the present embodiment.

FIG. 12 is a diagram showing an example of a chamber used for hybridization.

FIG. 13 is a flowchart showing an example of unevenness detection processing based on the principle of hybridization.

FIG. 14 shows another example of unevenness detection processing based on the principle of hybridization. It is a flowchart.

FIG. 15 is a flowchart showing an example of unevenness correction processing based on the principle of hybridization.

FIG. 16 is a flowchart illustrating an example of processing for detecting a position of unevenness.

FIG. 17 is a flowchart illustrating an example of processing for detecting the position of unevenness.

[FIG. 18] FIG. 18 is a flowchart showing an example of a spatial correction function calculation process that works on the present embodiment.

[FIG. 19] FIG. 19 is a flowchart showing an example of a spatial correction function calculation process that is useful in the present embodiment.

[FIG. 20] FIG. 20 is a flowchart showing an example of correcting background irregularity in the background correction function calculation process according to the present embodiment.

[FIG. 21] FIG. 21 is a flowchart showing another example of correction of background irregularity in the spatial correction function calculation process that is useful in the present embodiment.

FIG. 22 is a flowchart showing an example of a method of correcting a multiplicative unevenness in the calculation process of the spatial correction function that works according to the present embodiment.

FIG. 23 is a flowchart showing an example of a method for correcting a multiplicative unevenness in the process of calculating a spatial correction function that works according to this embodiment.

[FIG. 24] FIG. 24 is a flow chart showing a process that works on the fourth method executed by the unevenness detection unit.

[FIG. 25] FIG. 25 is a flowchart showing an alternative method in which the level is not set in advance among the processes related to the fourth method executed by the unevenness detection unit.

[FIG. 26] FIG. 26 is a flowchart showing a determination of a margin in a process that works on the fourth method executed by the unevenness detection unit.

Explanation of symbols

10 Gene expression data processor

12 CPU

14 Input device

16 Display device RAM

ROM

Data notifier Standardization processing unit Unevenness detection unit Correction processing unit Image generation unit

? f, b ^ e s' 1 thought ρβ

Claims

The scope of the claims

[1] A method for processing gene expression data obtained by processing array data obtained based on the expression level of a gene on a DNA chip and stored in a storage device to obtain analyzable data.

Normalizing data values comprising the array data;

Calculating a standard deviation of the representative value distribution;

And a step of detecting the presence of unevenness in the array data of the DNA chip based on the increase in the standard deviation.

[2] The step of detecting the presence of the unevenness comprises:

Calculating a chi (%) square distribution parameter according to a predetermined criterion; calculating a significance level from the chi (%) square distribution;

The method according to claim 1, further comprising: comparing the standard deviation with the significance level and determining that there is unevenness in the data if the standard deviation is greater than the significance level.

[3] The method according to claim 2, wherein the value of the significance level is an upper significance level of a chi (X) square distribution based on a predetermined confidence coefficient with respect to a variance that is the square of the standard deviation.

[4] The step of detecting the presence of the unevenness comprises:

Calculating a normal distribution parameter according to a predetermined standard, calculating a normal distribution force significance level,

[5] The value of the significance level is based on a predetermined confidence coefficient with respect to the standard deviation. The method according to claim 4, which is an upper significant level of a normal distribution.

[6] A method for processing gene expression data, which is obtained based on the expression level of a gene on a DNA chip and which is stored in a storage device to obtain analyzable data.

Calculating an expected value of the standard deviation;

The difference is compared with a predetermined second value based on the expected value of the standard deviation, and when the difference 1S is greater than the predetermined second value, it is determined that there is unevenness in the data for the DNA chip. And a step of processing gene expression data.

[7] A method for processing gene expression data obtained by processing array data obtained based on the expression level of a gene on a DNA chip and stored in a storage device to obtain analyzable data.

Storing the information of the small area when the DNA chip is divided into a plurality of small areas in the storage device;

Normalizing data values comprising the array data;

Calculating an average value or a median value of data values of array data belonging to the small area for each small area;

Accepting a set 1st to nth significance level that gradually increases the confidence coefficient;

For each of the small areas, based on the average value or the median value and the first or nth level, determine whether the small area is affected by unevenness; A method for processing gene expression data, comprising: a step of storing in the storage device information indicating that the small region is affected by unevenness.

[8] A method for processing gene expression data that obtains analyzable data by processing array data obtained based on gene expression levels on a DNA chip and stored in a storage device.

Considering the distribution of background values measured at or near the spot as a normal distribution,

Calculating a representative value of the distribution;

Calculating a standard deviation of the distribution;

Calculating a difference between each of the signal intensity and the representative value; comparing the difference with a predetermined value based on the standard deviation; and if the difference is greater than the predetermined value And determining that the spot data is under the influence of unevenness, and storing the information indicating that the spot is under the influence of unevenness in the storage device. Expression data processing method

[9] The method according to claim 8, wherein the predetermined value is twice the standard deviation.

[10] rearranging the differences in order of magnitude;

Preparing a sequence of theoretical values obtained from a normal distribution model with the same number as the difference;

Plotting the theoretical value and the sorted difference in a one-to-one correspondence, detecting a difference larger than a value at which a linear relationship cannot be obtained from the plot;

Determining that the data value of the spot where the large difference is detected is under the influence of unevenness, and storing information indicating that the spot is under the influence of unevenness in the storage device. 9. The method for processing gene expression data according to claim 8, further comprising:

[11] Array data obtained based on the gene expression level on the DNA chip and stored in the storage device A method of processing gene expression data to obtain data that can be analyzed

Calculating a representative value of the distribution;

Calculating a geometric mean of ratio values obtained by dividing the standard deviation of background values of a plurality of DNA chips by their median values;

Multiplying the median of the background values of the DNA chip by the geometric mean value of the ratio values to calculate the expected value of the standard deviation;

Calculating a difference between the background value of the spot and the median value, comparing the difference with a predetermined value based on an expected value of the standard deviation, and the difference is greater than the predetermined value. The spot data is determined to be under the influence of unevenness, and a step for storing in the storage device information indicating that the spot is under the influence of unevenness is provided. To process gene expression data.

12. The method according to claim 11, wherein the predetermined value is twice an expected value of the standard deviation.

[13] A method for processing gene expression data obtained by processing array data obtained based on the expression level of a gene on a DNA chip and stored in a storage device to obtain analyzable data.

Normalizing data values comprising the array data;

And a step of storing the function value in the storage device.

[14] To obtain the spatial correction function, 14. The method of claim 13, comprising approximating a specific rank number of the small area data as a function of the position of the small area on the chip.

[15] To obtain the spatial correction function,

14. The method according to claim 13, comprising approximating a mode value of the data in the small area as a function of a position on the chip of the small area.

[16] To find the spatial correction function,

Obtaining two independent orthogonal equations for two orthogonal axes representing physical positions on the DNA chip,

14. A method according to claim 13, characterized in that the two equations are treated independently.

[17] Finding the equation comprises

For two orthogonal axes representing the physical position on the chip,

Find many equations along one of its axes,

By scanning on the other axis,

The method of claim 16, wherein the method is configured to determine a function of position.

[18] determining the shape and arrangement of the subregions;

Calculating a standard deviation of data values belonging to the small area candidates; calculating a median of standard deviations for each of the small areas; dividing the data; calculating a standard deviation; Repeating the step of calculating, determining a small region candidate that has a minimum median value as the small region, and storing information on the determined small region in the storage device. 18. A method according to any one of claims 13 to 17, characterized in that

[19] A method for processing gene expression data obtained by processing array data obtained based on the expression level of a gene on a DNA chip and stored in a storage device to obtain analyzable data.

Setting the volume of bubbles and the volume of solution in the chamber when the DNA chip is subjected to hybridization. Calculating the relative time that the observation point has been immersed in the solution at the observation point set on the DNA chip in accordance with the rotation of the chamber in the hybridization;

Normalizing data values comprising the array data;

And storing the divided data value in the storage device. A method of processing gene expression data, comprising:

[20] Processing of gene expression data readable by a computer to process array data obtained based on gene expression levels on a DNA chip and stored in a storage device to obtain analyzable data A program comprising:

Normalizing the data values comprising the array data;

A gene expression data processing program characterized by causing a step of detecting the presence of unevenness in DNA chip array data based on the increase in the standard deviation.

[21] In the step of detecting the presence of the unevenness, in the computer,

The standard deviation is compared with the significance level of the chi (2) square distribution, and if the standard deviation is greater than the significance level of the chi (2) square distribution, a step of determining that there is unevenness in the data is executed. 21. A program according to claim 20, characterized in that:

[22] In the step of detecting the presence of the unevenness,

Calculating a difference between the standard deviation and an expected value of an average value of the standard deviation; calculating an expected value of the standard deviation; and The difference is compared with a predetermined value based on an expected value of the standard deviation, and when the difference is larger than the predetermined value, a step of determining that there is unevenness in data is executed. 21. The program according to claim 20.

23. The program according to claim 22, wherein the computer is operated so that the predetermined value is twice an expected value of a standard deviation.

[24] Processing of gene expression data readable by a computer to process the array data obtained based on the gene expression level on the DNA chip and stored in a storage device to obtain analyzable data A program comprising:

Calculating an expected value of the standard deviation;

The difference is compared with a predetermined second value based on the expected value of the standard deviation, and if the difference 1S is greater than the predetermined second value, it is determined that there is unevenness in the data for the DNA chip. A program for processing gene expression data, characterized in that the step of executing is performed.

[25] Processing of gene expression data readable by a computer to process the array data obtained based on the gene expression level on the DNA chip and stored in the storage device to obtain analyzable data A program comprising:

Storing the information on the small area in the storage device when the DNA chip is divided into a plurality of small areas;

Normalizing the data values comprising the array data;

Calculating a specific rank number of data values of array data belonging to the small area for each small area;

Accepting a set 1st to nth significance level that gradually increases the confidence coefficient, and For each of the small areas, it is determined whether or not the small area is affected by unevenness based on the average value or the median value and the first or nth significance level. A processing program for gene expression data, characterized by causing a step of storing information indicating that a small region is affected by unevenness in the storage device.

[26] Processing of gene expression data that can be read by a computer to process the array data obtained based on the gene expression level on the DNA chip and stored in a storage device to obtain analyzable data A program comprising:

Calculating a representative value of the distribution;

Calculating the standard deviation of the distribution,

Calculating the difference between the individual signal intensity and the representative value;

Comparing the difference with a predetermined value based on the standard deviation; and if the difference is greater than the predetermined value, determine that the spot data is under the influence of unevenness; A gene expression data processing program characterized by causing a storage device to execute information indicating that the spot is under the influence of unevenness.

27. The program according to claim 26, wherein the predetermined value is twice the standard deviation.

[28] In the computer,

Rearranging the differences in order of magnitude;

Step of preparing a sequence of theoretical values obtained from the normal distribution model with the same number as the difference,

Plotting the theoretical value and the sorted difference in a one-to-one correspondence, detecting a difference larger than a value at which a linear relationship cannot be obtained from the plot, and

The data value of the spot where the large difference is detected is determined to be under the influence of unevenness, and information indicating that the spot is under the influence of unevenness is stored in the storage device. 27. The program according to claim 26, wherein the program is executed.

[29] Processing of gene expression data that can be read by a computer to process the array data obtained based on the gene expression level on the DNA chip and stored in a storage device to obtain analyzable data A program comprising:

Calculating a representative value of the distribution;

Calculating a geometric mean of the ratio values obtained by dividing the standard deviation of the background values of a plurality of DNA chips by their median values;

Calculating the difference between the background value of the spot and the median, and

The difference is compared with a predetermined value based on the expected value of the standard deviation. If the difference is larger than the predetermined value, it is determined that the spot data is under the influence of unevenness, A gene expression data processing program characterized by executing a step of storing in a storage device information indicating that the spot is under the influence of unevenness.

30. The program according to claim 29, wherein the predetermined value is twice an expected value of the standard deviation.

[31] A gene expression data processing program for processing array data obtained based on gene expression levels on a DNA chip and stored in a storage device to obtain analyzable data, which is stored in the computer. ,

Normalizing the data values comprising the array data;

Determining a spatial correction function representing the arranged small region group, calculating a function value by the spatial correction function for a data value belonging to the small region for each small region, and A gene expression data processing program characterized by causing the function value to be stored in the storage device.

[32] In order to obtain the spatial correction function,

32. The program according to claim 31, which causes the computer to execute a step of approximating a specific rank number of the data in the small area as a function of a position on the chip of the small area.

[33] In order to obtain the spatial correction function,

32. The program according to claim 31, which causes the computer to execute a step of approximating a mode value of the data in the small area as a function of a position of the small area on the chip.

[34] To obtain the spatial correction function,

For each of two orthogonal axes representing physical positions on the DNA chip, a step of obtaining an independent unitary equation is performed,

32. The program according to claim 31, wherein the computer is caused to operate so as to handle the two expressions independently.

[35] In the step of obtaining the equation,

For two orthogonal axes representing the physical position on the chip,

Find many equations along one of its axes,

By scanning on the other axis,

The program according to claim 34, wherein said computer is operated so as to obtain a function of position.

[36] In the step of determining the shape and arrangement of the small region, the computer stores information on the candidate for the small region when the DNA chip is divided into a plurality of small region candidates. Step to do,

Calculating a median of standard deviations for each of the small areas, and dividing, calculating a standard deviation, calculating a median Is repeated, and a candidate for a small area that has the smallest median value is determined as the small area, and the information on the determined small area is stored in the storage device. 36. The program according to any one of claims 31 to 35.

[37] A gene expression data processing program that obtains analyzable data by processing array data obtained based on gene expression levels on a DNA chip and stored in a storage device.

Normalizing the data values comprising the array data;

A program for processing gene expression data, comprising: executing a step of storing the divided data value in the storage device.

[38] A method for processing gene expression data obtained by processing array data obtained based on the expression level of a gene on a DNA chip and stored in a storage device to obtain analyzable data.

Zbi = (Xbi-Mb) / Sb

A step of calculating by

If the standard value Zbi is greater than the set rejection level, it is determined that the spot data value that is the basis for the calculation of the Zbi should be rejected, and the spot data is rejected. And a step of storing in the storage device information indicating that the gene expression data is processed.

[39] A method for processing gene expression data that obtains analyzable data by processing array data obtained based on the expression level of a gene on a DNA chip and stored in a storage device.

Zbi = (Xbi-Mb) / Sb

A step of calculating by

Indicates that if the standard value Zbi is larger than the upper limit of the specified range, it is judged that the data value of the spot that is the basis of calculation of the Zbi should be rejected, and the data of the spot is rejected. A method for processing gene expression data, comprising: storing information in the storage device.

[40] A method for processing gene expression data that obtains analyzable data by processing array data obtained based on gene expression levels on a DNA chip and stored in a storage device.

Standard value Zbi

Zbi = (Xbi-Mb) / Sb A step of calculating by

Calculating a distance r between the individual spot and a peripheral spot;

Determining that the data value of the individual spot should be rejected when the sum B is greater than a set rejection level, and storing in the storage device information indicating that the data of the spot is to be rejected; A method for processing gene expression data, comprising:

A method of processing gene expression data obtained by processing array data obtained based on the expression level of a gene on a DNA chip and stored in a storage device to obtain analyzable data.

Calculating a continuous function f (Xc) that approximates the MINc by Xc;

Storing the data value obtained by subtracting f (Xc) in a storage device;

Calculating a continuous function g (Xw) that approximates the MINw by Xw;

Storing the data value obtained by subtracting g (Xw) in a storage device;

(C) Each of the DNA chip spot data values,

z = (log (, x— y) ~ μ) / σ

(In the above equation, γ is the calculated background value, is the characteristic value of the central tendency. , Σ is a specific value of fluctuation)

And calculating the background value γ according to

Data value of the spot Step of subtracting the background value y

Calculating MEDc,

Calculating a continuous function h (Xc) approximating the MEDc by Xc;

Storing the data value divided by h (Xc) in a storage device;

(D) Each of the data values of the spots on the DNA chip is

z = (log (, x— y) ~ μ) / σ

, Σ is a specific value of fluctuation)

And calculating the background value γ according to

Calculating a continuous function j (Xw) approximating the MEDw by Xw;

Storing the data value divided by j (Xw) in a storage device;

Including one or more of (A) to (D),

Comparing each of the one or more execution results of (A) to (D) with a lognormal distribution model, and selecting an execution result that approximates the value S most closely to the model; and And a step of storing the executed result in the storage device. A method for processing gene expression data, comprising:

The data obtained based on the expression level of the gene on the DNA chip and read out by a computer to process the array data stored in the storage device to obtain analyzable data. A possible gene expression data processing program, wherein the computer calculates a representative value Mb and a scale value Sb assuming that the data values constituting the array data are a normal distribution;

Zbi = (Xbi-Mb) / Sb

A step of calculating by

When the standard value Zbi is larger than the set rejection level, it is determined that the data value of the spot that is the basis of calculation of the Zbi should be rejected, and information indicating that the data of the spot is rejected, And storing the data in the storage device. A processing program for gene expression data, comprising:

A gene expression data processing program that can be read out by a computer to process the array data obtained based on the gene expression level on the DNA chip and store it in the storage device. And the computer

Zbi = (Xbi-Mb) / Sb

A step of calculating by

Indicates that if the standard value Zbi is larger than the upper limit of the specified range, it is judged that the data value of the spot that is the basis of calculation of the Zbi should be rejected, and the data of the spot is rejected. Storing the information in the storage device. A program for processing gene expression data.

[44] Processing of gene expression data readable by a computer to process the array data obtained based on the gene expression level on the DNA chip and stored in the storage device to obtain analyzable data A program comprising:

Standard value Zbi

Zbi = (Xbi-Mb) / Sb

A step of calculating by

Calculating a distance r between the individual spot and a peripheral spot;

Determining that the data value of the individual spot should be rejected when the sum B is greater than a set rejection level, and storing in the storage device information indicating that the data of the spot is to be rejected; And a program for processing gene expression data.

[45] Processing of gene expression data that can be read by a computer to process the array data obtained based on the gene expression level on the DNA chip and stored in a storage device to obtain analyzable data A program comprising:

Calculating a continuous function f (Xc) that approximates the MINc by Xc;

Subtracting the f (Xc) from the data value of the spot belonging to the c-th row on the DNA chip; and Storing the data value obtained by subtracting f (Xc) in a storage device;

Calculating a continuous function g (Xw) that approximates the MINw by Xw;

Storing the data value obtained by subtracting g (Xw) in a storage device;

(C) Each of the DNA chip spot data values,

z = (log (, x— y) ~ μ) / σ

, Σ is a specific value of fluctuation)

And calculating the background value γ according to

Data value of the spot Step of subtracting the background value y

Calculating MEDc,

Calculating a continuous function h (Xc) approximating the MEDc by Xc;

Storing the data value divided by h (Xc) in a storage device;

(D) Each of the data values of the spots on the DNA chip is

z = (log (, x— y) ~ μ) / σ

, Σ is a specific value of fluctuation)

And calculating the background value γ according to

Calculating a continuous function j (Xw) approximating the MEDw by Xw; Dividing the data value of the spot belonging to the c-th row in the DNA chip by the j (Xw); and

Storing the data value divided by j (Xw) in a storage device;

Including one or more of (A) to (D),

Comparing each of the one or more execution results of (A) to (D) with a lognormal distribution model, and selecting an execution result that approximates the value S most closely to the model; and Storing the executed result in the storage device. A program for processing gene expression data, comprising: