previous  page PLNT4610/PLNT7690 Bioinformatics
Lecture 12, part 2 of 3
next page

B. Experimental design and normalization


Microarray data must be put through a data "funnel", before the data can be used for making inferences about biological systems.  At each step of refinement, some data are necessarily discarded, so that only significant data remains at the end.

Acquisition - based on the quality of the images, some slides must be thrown out

Filter - spots with signals below the level of acurate measurement, or spots that are not of a uniform shape, are discarded

Normalize - slide to slide variation must be normalized using internal control spots

Validate - choose genes whose expression is reproducible, as determined by correlation between experiments, and for which significant change occurs between experimental treatments


2. Normalization

The raw intensities of signal from each spot on the array are not directly comparable. Depending on the types of experiments done, a number of different approaches to normalization may be needed. Not all types of normalization are appropriate in all experiments. Some experiments may use more than one type of normalization.

Note: For normalization, the median is preferrable to the mean, because the mean is strongly-affected by outlying datapoints.

a. Subtraction of negative controls from gene signals


The most fundamental type of normalization is to subtract negative controls from the signal for each gene. Negative controls are DNAs which are not present in the mRNA population. For example, an array of plant genes might contain several vector spots.
 
 

b. Ratio of signal to positive control


 

To allow comparison of genes from one filter to the next, it is often useful to spike the labeling reaction with some foreign RNA or DNA that is not normally in the RNA population. While in principle some presumably "constitutive" genes like actin, tubulin, or ubiquitin might serve, careful experiments often show that these genes are not really constitutive. Therefore, foreign gene sequences, known not to be present in the species being studied, are better controls. For example, a human RNA population might be spiked with plant RNA, and plant genes used as positive controls on the array. The signal  si for gene i would therefore be raw counts gi divided by the median of the counts for the control spots.

Normalization of signal for each gene to a ratio makes it possible to compare ratios between experiments, proived that the spiked controls are the same in all experiments.

Normalization to a positive control is typically used in single-label experiments. Comparison of one experiment to another can either be done by plotting signal si  directly on a graph, or signals from two experiments can be converted into a ratio, usually by choosing one treatment as a control. For example, in a timecourse, a 0 hour timepoint might be chosen, and signal from all other timepoints divided by the signal for the 0 hour timepoint, to give a ratio.
 

c. Ratio of each gene to its control level

Because of the many sources of variation from experiment to experiment, a good control is to choose some experimental condition as a baseline, to use as a control against all other experimental conditions or treatments. For example, the level of expression in a wild type organism might be the baseline, for comparison with expression levels in mutants. An excellent control can then be implemented by labeling the control RNA population with one dye (eg. Cy3) and all other RNA populations with a different dye (eg. Cy5). Each labeled experimental population is then mixed with an equal quantitiy of the labeled control RNA, and the mixed probe is hybridized with a gene array. The array is scanned at the wavelengths for each dye, and the ratio of the experiment to the control is the ratio of the intensities for each dye (corrected for background) for the two dyes. This approach is illustrated below:

 

Example
cDNA probes were made from Human fibroblasts treated with serum (Cy5-dUTP) or serum-deprived cells (Cy3-dUTP). 3 replicate arrays were scanned, and the same regions from the replicate arrays are shown. Serum inducible genes appear green, while serum suppressible genes appear red. 1 - protein disulfide isomerase-related protein P5; 2 - IL-8 precursor; 3 - EST AA057170; 4- vascular endoghelial growth factor.

from Iyer VR, Eisen MB, Ross DT, Schuler G, Moore T, Lee JCF, Trent JM, Staudt LM, Hudson J Jr, Boguski MS, Lashkari D, Shalon D, Botstein D, Brown PO (1999) The transcriptional program in the response of human fibroblasts to serum. Science 283(5398):83-87.

Double-label experiments using a mixed reference probe make it possible to compare many treatments
One of the problems with double-label experiments is that you need to do a separate hybridization experiment for every comparison you want to make.

For example, if you want to measure the ratio of expression in cells treated with three different drugs,  compare compared to a contol, you need to do the following experiments:

Ratios of expression for a hypothetical gene from cells given different drug treatments.


Cy3



Cy5

control
drug1
drug2
drug3
control
 1.0
1.5
1.0
3.0
drug1
0.667
1.0
0.667
2.0
drug2
1.0
0.667
1.0
3.0
drug3
3.0
2.0
3.0
1.0

In general, if there are n treatments, n2 experiments are required for all possible comparisons.
One possible alternative would be to simply compare all treatments to the control. The problem is that there will always be genes that have no detectible transcripts in one or more of the treatments or in the control. For those genes there is no way to calculate a ratio.

The solution is to create a reference probe, containing labeled cDNA derived from all treatments. For example, equal amounts of mRNA from each of the 4 treatments might be mixed to create a reference probe. In all experiments, one control of drug mRNA would be labeled with Cy3, and the reference probe labeled with Cy5.

Ratios of expression for a hypothetical gene from cells given different drug treatments

Cy3
control
drug1
drug2
drug3
Cy5
control + drug1
    + drug2 + drug3
1.625
2.43
1,625
4.78

All genes that are present in any mRNA population will be present in the reference probe. Thus for n treatments there are n labeling/hybridization experiments. 

d. Normalized ratios are usually expressed as logs

To facilitate easier mathmatical handling of the data, as well as comparisons over a wide range of expression levels, ratios are usually expressed as logs. For example, if a gene is expressed at 100-fold greater level in the control than in the mutant, log10 (1/100) = -2. A log ratio of 0 is therefore indicative of a gene whose expression is the same in both conditons or treatments.

e. Error due to low-level genes can affect the entire experiment

One problem with ratios in double label experiments is that low intensity spots have the greatest error. Therefore, even if a spot has a high level of expression in one treatment, the error of the ratio will be governed by the variation in measurement of expression in a low-level condition. If a gene is expressed at low levels in both treatments, the error will be compounded.

f. Quality assurance
One common method for assessing the reproducibility of the results is a scatter plot. A scatter plot compares two replicates of an experiment. Ideally, each datapoint in one replicate should be identical to that in the other experiment.

In the example at right, dye-swap technical replicates have been performed. In one experiment, an RNA population has been labeled with Cy5. In the dye swap, the same population was labeled with Cy3. If experimental conditions are well controlled, results for each gene (dots) should always be the same for both dyes, resulting in a perfect diagonal line, where the intensity with in one labeling is directly proportional to the intensity in the other labeling (yellow dots). Where there is a labeling bias, genes will fall off the diagonal (red, green dots).



from Wishart D, Van Domselaar G (2004) Analyzing 2 Colour Microarrays. Applied Computational Genomics Course Presentation.

This is only one example of the many comparisons done using scatter plots. They may all look alike, but read the X and Y axes carefully to make sure you know what is being compared.


g. Removal of outliers

Outliers are datapoints with poor reproducibility. One approach to removal of outliers is to iteratively  remove datapoints whose variation exceeds 2 standard deviations from the linear regression of all datapoints. By definition, points falling far from the line (y = mx + b) are not reproducible.




After outliers are removed, a new slope is calculated.

Using the new slope, data are searched again for datapoints  that deviate from the linear function by greater than 2 s. d.  The process is repeated until all data fit the linear function.
from Wishart D, Van Domselaar G (2004) Analyzing 2 Colour Microarrays. Applied Computational Genomics Course Presentation.

h. Loess (Lowess) normalization


The greatest variation from replicate to replicate falls on either extreme of the intensity scale. The detection of strongly-expresed transcripts is an underestimate, because of signal saturation (burn out) during detection. The detection of weakly-expressed genes is poorly reproducible, because of low sensitivity of detection.

A Loess function is analogous to a linear regression. It represents the data using a curvilinear function.
The data are corrected to correspond to the Loess function. The data are represented as if the Cy5/Cy3 ratios were not affected by the average signal.
from Wishart D, Van Domselaar G (2004) Analyzing 2 Colour Microarrays. Applied Computational Genomics Course Presentation.

i Which genes show a  "significant" difference between treatments?
 
Many microarray programs use statistical methods such as ANOVA to calculate a p value p(x >
α), indicating the probability, the observed value x would exceed α by random chance alone. In many experiments, we accept p values less than 0.05 or 0.01 as being significant. Remember, however, that this means that 1 out of 20 times, or 1 out of 100 times, we get a false positive! Since each gene array can be thought of as thousands of experiments, we expect to see dozens or even hundreds of false positives that appear to be significant, but are not. Therefore, the threshold p values for assessing significance must be adjusted. The page linked below presents some of the ways of correcting p values in microarray experiments.

http://www.bea.ki.se/staff/reimers/Web.Pages/Statistical.Significance.htm


Fuhrman S, Liang S, Wen X, Somogyi R (2002) Zeroing in on essential gene expression data. In Grigorenko EV (ed.) DNA Arrays: Technologies and experimental strategies. CRC Press.  pp129 - 140.

In gene array experiments, we are most interested in genes whose expression differs between two treatments, or over a timecourse of expression. Where only two treatments are involved (eg. tumor vs. non-tumor cells) a simple threshold measure (eg. > 2 sd. change in log of expression) is adequate to identify the most important genes.

Where many treatments of conditions are being compared, one approach is to use Shannon's Entropy function as a measure of the information content of the data for each gene.


Normalized expression levels for 5 genes from rat hippocampus at different stages of development: Embryonic 15 d, 18d (E); Postnatal 0 - 25 d (P); Adult (A). TCP = T-complex protein. NT3 = neurotrophin3; GAP43 = growth-associated protein 43; CNTFR = ciliary neurtrophic factor receptor.

Entropy is calculated as
 

where p is the probability (frequency) of occurrence of a given level of gene expression for each treatment i.

Entropy is a measure of information content. Genes whose expression doesn't change have 0 entropy (eg. CNTFR). The more expression responds to different conditions, the more information is contained in its expression pattern. Genes assayed in a set of conditions can be ranked based on entropy. The genes with the highest entropy are the ones that respond most to a series of treatments or conditions. These are the "interesting" genes.



previous  page PLNT4610/PLNT7690 Bioinformatics
Lecture 12, part 2 of 3
next page