| PLNT4610/PLNT7690
Bioinformatics Lecture 12, part 2 of 3 |
| Microarray
data must be put through a data "funnel", before the data can be used
for making inferences about biological systems. At each step of
refinement, some data are necessarily discarded, so that only
significant data remains at the end. Acquisition - based on the quality of the images, some slides must be thrown out Filter - spots with signals below the level of acurate measurement, or spots that are not of a uniform shape, are discarded Normalize - slide to slide variation must be normalized using internal control spots Validate - choose genes whose expression is reproducible, as determined by correlation between experiments, and for which significant change occurs between experimental treatments |
![]() |

The most fundamental type of normalization is to subtract negative
controls from the signal for each gene. Negative controls are DNAs
which
are not present in the mRNA population. For example, an array of plant
genes might contain several vector spots.
To allow comparison of genes from one filter to the next, it is often useful to spike the labeling reaction with some foreign RNA or DNA that is not normally in the RNA population. While in principle some presumably "constitutive" genes like actin, tubulin, or ubiquitin might serve, careful experiments often show that these genes are not really constitutive. Therefore, foreign gene sequences, known not to be present in the species being studied, are better controls. For example, a human RNA population might be spiked with plant RNA, and plant genes used as positive controls on the array. The signal si for gene i would therefore be raw counts gi divided by the median of the counts for the control spots.
Normalization of signal for each gene to a ratio makes it possible to compare ratios between experiments, proived that the spiked controls are the same in all experiments.
Normalization to a
positive control is typically used in single-label
experiments. Comparison of one experiment to another can either be done
by plotting signal si directly on a graph, or
signals
from two experiments can be converted into a ratio, usually by choosing
one treatment as a control. For example, in a timecourse, a 0 hour
timepoint
might be chosen, and signal from all other timepoints divided by the
signal
for the 0 hour timepoint, to give a ratio.
Example
cDNA probes were made from Human fibroblasts treated with serum
(Cy5-dUTP)
or serum-deprived cells (Cy3-dUTP). 3 replicate arrays were scanned,
and
the same regions from the replicate arrays are shown. Serum inducible
genes
appear green, while serum
suppressible genes appear red. 1 - protein
disulfide
isomerase-related protein P5; 2 - IL-8 precursor; 3 - EST AA057170; 4-
vascular endoghelial growth factor.
from Iyer VR, Eisen MB, Ross DT, Schuler G, Moore T, Lee JCF, Trent
JM, Staudt LM, Hudson J Jr, Boguski MS, Lashkari D, Shalon D, Botstein
D, Brown PO (1999) The transcriptional program in the response of human
fibroblasts to serum. Science 283(5398):83-87.
Double-label
experiments using a mixed reference probe make it possible to compare
many treatments
One of the problems with double-label experiments is that you need to
do a separate hybridization experiment for every comparison you want to
make.
For example, if you want
to measure the ratio of expression in cells treated with three
different drugs, compare compared to a contol, you need to do the
following experiments:
| Ratios
of expression for a hypothetical gene from cells given different drug
treatments. |
|||||
| Cy3 |
|||||
Cy5 |
control |
drug1 |
drug2 |
drug3 |
|
| control |
1.0 |
1.5 |
1.0 |
3.0 |
|
| drug1 |
0.667 |
1.0 |
0.667 |
2.0 |
|
| drug2 |
1.0 |
0.667 |
1.0 |
3.0 |
|
| drug3 |
3.0 |
2.0 |
3.0 |
1.0 |
|
In general, if there are n treatments, n2
experiments are required for all possible comparisons.
One possible alternative would be to simply compare all treatments to
the control. The problem is that there will always be genes that have
no detectible transcripts in one or more of the treatments or in the
control. For those genes there is no way to calculate a ratio.
The solution is to create
a reference probe, containing labeled cDNA derived from all treatments.
For example, equal amounts of mRNA from each of the 4 treatments might
be mixed to create a reference probe. In all experiments, one control
of drug mRNA would be labeled with Cy3, and the reference probe labeled
with Cy5.
| Ratios
of expression for a hypothetical gene from cells given different drug
treatments |
|||||
| Cy3 |
|||||
| control |
drug1 |
drug2 |
drug3 |
||
| Cy5 |
control
+ drug1 + drug2 + drug3 |
1.625 |
2.43 |
1,625 |
4.78 |
All genes that are present
in any mRNA population will be present in the reference probe. Thus for
n treatments there are n labeling/hybridization experiments.
| One common method for assessing
the reproducibility of the results is a
scatter plot. A scatter plot compares two replicates of an experiment.
Ideally, each datapoint in one replicate should be identical to that in
the other experiment. In the example at right, dye-swap technical replicates have been performed. In one experiment, an RNA population has been labeled with Cy5. In the dye swap, the same population was labeled with Cy3. If experimental conditions are well controlled, results for each gene (dots) should always be the same for both dyes, resulting in a perfect diagonal line, where the intensity with in one labeling is directly proportional to the intensity in the other labeling (yellow dots). Where there is a labeling bias, genes will fall off the diagonal (red, green dots). |
![]() from Wishart D, Van Domselaar G (2004) Analyzing 2 Colour Microarrays. Applied Computational Genomics Course Presentation. |
![]() |
![]() |
![]() After outliers are removed, a new slope is calculated. |
![]() Using the new slope, data are searched again for datapoints that deviate from the linear function by greater than 2 s. d. The process is repeated until all data fit the linear function. |
| The greatest variation from
replicate to replicate falls on either extreme of the intensity scale.
The detection of strongly-expresed transcripts is an underestimate,
because of signal saturation (burn out) during detection. The detection
of weakly-expressed genes is poorly reproducible, because of low
sensitivity of detection. A Loess function is analogous to a linear regression. It represents the data using a curvilinear function. |
![]() |
| The
data are corrected to correspond to the Loess function. The data are
represented as if the Cy5/Cy3 ratios were not affected by the average
signal. |
![]() |


| PLNT4610/PLNT7690
Bioinformatics Lecture 12, part 2 of 3 |