December
1,
2009
GENE
ARRAYS
GENE ARRAY LINKS
Bibliography on Microarray Data Analysis [http://www.nslij-genetics.org/microarray/]
A. Gene
Array Technology
1. How
do gene arrays work?
2. Types of experiments
3. Types of data
4. What are we trying to learn from gene arrays?
B.
Experimental design and normalization
1.
Sources of experimental variation
2. Normalization
C.
Grouping genes with similar expression patterns
1.
Cluster analysis
2. Self-organizing maps
A. Gene
Array Technology
It has become
common in many model
systems to sequence large numbers of cDNAs from an organism. Craig
Venter
at the NIH realized that a rapid survey of genes in an mRNA population
could beidentified by doing single sequencing reactions on clones froma
cDNA library as a rapid means of identifying the different classes of
genes
present in an mRNA population. Although sequence data from a single
reaction
is likely to contain errors, the error rate of automated sequencing
methods
is now far less than one error per hundred bases, more than good enough
to identify a sequence, from several hundred bases of sequence.
Sequences
derived from one-pass sequencing of
libraries are referred to as Expressed Sequence Tags, or ESTs.
The
existence of large sets of ESTs opens the
door for studying gene expression on a large scale.
Animation:
Microarray Tutorial at University Health Network Micrarray Centre,
Toronto.
http://www.microarrays.ca/info/tutorials.html
1. How do
gene arrays work?
Gene array experiments are
sometimes referred to as "reverse Northerns".
In Northern blots, RNA is blotted onto a filter and hybridized with a
probe
to detect a particular species of mRNA as a distinct band or spot. In
gene
array hybridization, cDNAs are spotted onto a filter or slide and
hybridized with
a probe made from an mRNA population. Usually, probes are made by
reverse-transcribing
mRNA into single-stranded cDNA in the presence of labeled nucleotides.
The labeled probe, therefore, is a population of cDNA molecules
representing
the original mRNA population. Probes are hybridized with filters
containing
cDNAs spotted in a 2-dimensional array. The amount of hybridization to
a given clone represents the amount of mRNA present for the
corresponding
gene.
Gene
array technology: measures mRNA levels for thousands
of
genes in
- any cell or tissue type
- at
any point in development
- in
response to any stimulus
Gene
arrays
consist of hundreds or thousands of cDNAs spotted onto microscope
slides
(microarray) or nylon filters (macroarray). cDNAs are chosen from EST
collections,
so the sequences, and usually the identities of genes in the array are
known.
In a gene array experiment, an mRNA population is isolated from
cells.
The population is labeled by synthesizing complementary cDNAs using
reverse
transcriptase and labeled nucleotides. The resulting cDNA population is
then
hybridized to the array.
gene x - strongly expressed; high abundance transcript
gene y - moderately expressed; medium abundance transcript
gene z - weakly expressed; low abundance transcript
Each transcript base pairs with the complementary DNA for its
corresponding
gene on the array.
Signal strength is proportional to the abundance of each mRNA
WARNING!
Each one of these steps contributes to experimental variation.
|
a. Gene arrays
Each gene on an array is
represented as either
- cDNAs (usually from an
EST project; now obsolete)
To minimize the risk of
errors, it is essential to keep bacterial
cultures
in an array format, corresponding to the intended positions of the
genes
on the array. Cultures can be stored as glycerol cell stocks at
-80°C.
Since all clones usually come from a single library, or several
libraries
using the same cloning vector, it is usually possible to amplify any
insert
by PCR using primers specific for the multiple cloning site in the
vector.
Typically, a small number of bacterial cells are transferred to
microtiter
plates contiaining PCR reaction components, and inserts amplified by
direct
PCR on bacterial cells. The result is a PCR product, of only the
insert
and a small region of the vector. There DNAs can be directly spotted
onto
either nylon filters or glass microarray slides.
- oligonucleotide - size
range usually 25 - 70 nt. Smaller sizes tend to be less efficient in
binding probe. However, smaller oligonucleotides have greater
specificity. Beyond about 70 nt, there is little increase in signal.
Oligonucleotides are synthesized de-novo, so there is no chance of
contamination from other sequences, as with cDNAs
b. cDNA probes
Gene array experiments typically attempt to compare gene expression
levels in different tissues or conditions, or at different times after
a treatment. RNA is extracted from each tissue, condition, or traatment
and RNA samples are diluted so that each sample contains the same
concentration
of RNA. To create a single-stranded probe, RNA is added to a reaction
mix
containing oligo dT primers, which can base pair with the polyA tail on
mRNA, Reverse Transcriptase (RNA-dependent DNA polymerase) and labeled
nucleotides. Commonly, labeled nucleotides are either tagged with
fluorescent
labels such as Cy3 and Cy5, or digoxygenin (DIG), which can be detected
using chemiluminescent detection. In principle, for every mRNA molecule
in the original RNA population, a single-stranded labeled cDNA will be
produced, complementary to the mRNA. The higher the concentration of a
particular mRNA, the more cDNA will be present.
c. Hybridization and washing
Incorporation of label
into each probe is quantified, and probes are diluted so that all are
at
an equal concentration. Usually, a duplicate filter or microarray is
prepared
for each probe to be assayed. Probes are hybridized separately with
each
array. Filter arrays are incubated with probe and washed in much the
same
way as is done for Southern or Northern blotting. For glass
microarrays,
hybridization is done under a coverslip, and slides are washed by
dipping
into wash solutions. Commercially-produced arrays come in cassettes, in
which
hybridization, washing, and detection are done.
d. Data acquisition
Hybridized probe is detected by UV fluorescence in a slide reader
using confocal laser microscopy. The raw intensity of each
spot
is measured by a CCD camera, and the data acquired as a TIF image.
2. Types of experiments
Single
label experiment
The simplest type of gene
array experiment is the single label experiment. Duplicate arrays are
hybridized
with probes made using a single label. To allow comparison between
treatments,
controls must be included in the probes and on the arrays to act as
hybridization
standards.

from Mark
Schena*,, Dari Shalon, Renu Heller*, Andrew Chai*, Patrick O.
Brown§,
and Ronald
W.
Davis* (1996) Parallel human genome analysis: Microarray-based
expression
monitoring of 1000 genes Vol. 93, Issue 20, 10614-10619.
Expression of human genes
was measured in RNA populations from cells
grown at 37°C (-Heat shock) or 43°C (+Heat shock). White boxes:
genes whose expression changes with heat shock. Red boxes: genes
activated
by heat shock. Green boxes: genes suppressed by heat shock.
Double
label experiments
Another approach to comparing
expression between two conditions is double
label experiments. For example, in work from Patrick
Brown's
lab at Stanford, cDNA probes were made from yeast cells grown in
the presence of either galactose or glucose. To distinguish between
signals
from the two probes, different fluorescently-tagged nucleotides, either
Cy3 or Cy5 were added during reverse transcription. Cy3 has emission
maxima
at 565 and 615 nm, while Cy5 has an emission peak of 670nm. Replicate
experiments
were done in which dyes were switched. By scanning the arrays twice,
once
for Cy3 and once for Cy5, a composite image can be generated in which
the
ratio of the two dyes, and hence, the ratio of transcripts in the two
growth
conditions, can be measured. In pseudocolor images, spots in the array
representing genes that are more strongly expressed in the presence of
galactose are shown in green, and
spots representing genes more
strongly
expressed in the presence of glucose
are shown in red.
http://www.pnas.org/cgi/content/full/94/24/13057/F1
3) Types of data
Gene array studies tend to
generate two different types of data. Studies in which two or more
conditions are compared
at a time generate discrete state data. Often it is critical to follow
the
expression of a gene over time after a treatment. In timecourse
experiments,
the expression of each gene in response to two or more treatments is
measured
over time. For example, in the timecourse at right, the solid blue and
red
dashed curves might represent the expression levels for a gene in
response
to two different drugs.

There is
a whole family of problems in normalization of data and controlling for
components
of experimental variation.
To put things into
perspective,
if the experiment was repeated 4 times, the timecourse above represents
2 treatments x 6 times x 4 replicates = 48 probes
hybridized
to 48 duplicate arrays
to generate the data. Although
the data for each replicate are averaged, there is often a great deal
of variation
in the results, which can potentially negate any meaning. Therefore,
extraordinary
measures must be taken to minimize experimental variation at each step
in
the procedure, to minimize the overall variation.
2. What
are we trying to learn from gene arrays?
The primary
goal of gene array experiments is to
generate expression information for every gene in the array, under some
set of condittions. Expression may be studied in
- different tissues
- different developmental stages
- different genotypes
- different treatments
- different times after a treatment.
The kind of
results that are sought in gene array
experiments can be illustrated as follows:

In the example, timecourse
data are generated for each gene in an array.
The raw data consists of a series of expression curves for timecourses,
or histograms where other types of treatments are being compared. The
goal
is usually to find which groups of genes have the most similar
expression
patterns. In the example, two genes in the array (hatched background)
show
a gradual induction over the period of the timecourse. Two other genes
(shaded background) show a biphasic response with two distinct periods
of strong expression.
Key questions:
- Which genes are expressed
differentially,
between condition A and condition B?
- How can genes be grouped according
to
similarities in expression patterns?
|
B.
Experimental design and normalization
It is critical to realize
that every experimental step in a procedure
contributes to the final experimental error. Therefore, one should
conceptualize the data as a set of observations each with a measureable
amount of variation. In the figure, error bars represent the standard
error
of each measurement. The goal can then be restated as that of setting
up
the experiment in such a way as to minimize the final standard error in
the observations. For some timepoints in which there is little true
difference,
a difference can only be detected when the standard error for both
treatments
is small. For other timepoints where the differences are large, higher
standard errors will still allow the detection of the difference
between
two treatments.
1.
Sources of experimental variation
Making a list
of factors that contribute to experimental
error is essentialy the same as making a list of steps in the gene
array
experiment. However several points are worth highlighting.
- Treatments
- Experimental
conditions
- Tissue preparation
- Probes
- RNA
isolation - use idential amounts of tissue, identical
extraction methods; use minimum number of steps; measure amount of RNA
and normalize concentration
- labeling
- measure incorporation of label and normalize
samples to same concentration
- amount
- add same amount of label to each hybridization
- Arrays
- PCR
products - amplify directly from bacterial cells,
rather than isolated plasmids; add same amount of product to each spot
on filter
- Uniformity
of spotting - use arraying tool for filter
arrays or robot for microarrays.
- treatment
of filters or slides
- Hybridization and washing
- Long
hybridization to ensure that hybridization goes
to completion.

In
any hybridization experiment, the time required
for hybridization to go to completion is proportional to the
concentration
of the probe. As
illustrated, high abundance transcripts will hybridize
to completion in very short times, so the signal should be roughly the
same regardless of how long hybridization is done. For moderately
abundant
transcripts, it takes longer for hybridization to proceed to
completion,
so the amount of transcript for that mRNA will be underestimated unless
a long hybridization time is used. Finally for rare abundance
transcripts,
the hybridization curve will still be in the linear phase after a long
time. For example, at the time indicated by the dotted line on the
X-intercept,
the moderately abundant transcript would be estimated with only a small
error, while the abundance of the rare transcript would probably be
greatly
underestimated. (See http://www.umanitoba.ca/afs/plant_science/courses/PLNT3140/l14/cot.html
for
more details on reassociation kinetics).
- Washing -
For
genes that are members of multigene families, hybridization
results could vary depending on hybridization and washing stringency.
Hybridization
under low stringency conditions might allow cross hybridization between
members of a gene family, and all members would be expected to give
roughly
the same signal. Hybridization under high stringency conditions would
allow
for more discrimination between genes, because each transcript would
only
hybridize with its orthologous gene on the array.
- Data acquisition
- Image acquisition-
The acquisiton of the image data carries similar
built in sources of variation as does hybridization. Within a certain
intensity
range, the amount of signal detected is linearly proportional to the
time
of exposure. For microarrays, data acquisition is done by scanning the
slide in a confocal laser scanner. Data is saved as a TIFF image, where
intensity of a given
pixel is proportional to the amount of signal coming from part of the
filter
or slide. For highly abundant transcripts beyond a certain amount of
signal,
there may be little increase in intensity per unit time, and the spot
will
be saturated in the image. Moderately expressed genes may yeild signal
within the linear range of the camera's detection range. For rare
transcripts,
it may not be possible to expose long enough for signal from the
transcript
to shoulder out. It is important to recognize that these errors of
detection
are compounded on top of the errors associated with hybridization time!
- Spot and
background detection - Software has to delineate each spot in the
array, and to choose areas outside of spots for background estimation.
Spots diameter can vary, and spot morphology may be irregular,
rather
than being perfectly circular.
BIOLOGICAL
REPLICATES ARE THE SINGLE MOST EFFECTIVE WAY TO GET GOOD GENE
EXPRESSION RESULTS!
In
the next section we will see that there is an almost endless list of
ways to massage the data. The most heroic analytical methods are no
substitute for the simple step of doing several biological replicates.
- In
each biological replicate, the entire experiment, such as different
treatments of a batch of cells, plants or animals, sampling of
different tissues from different conditions, followed by extraction of
RNA, is repeated.
- The
RNA samples from different biological replicates are NOT mixed for a
single hybridization. Rather, a separate labeling and hybridization is
carried out for EACH REPLICATE.
- Technical
replicates, in which the same RNA sample is labeled and hybridized,
only control for differences in handling. Biological replicates include
all sources of biological and experimental variation. Therefor, they
are more realistic.
- As
the number of biological replicates increases, the total experimental
variation decreases.
Gene
chips are getting cheaper all the time, often less than $100 per chip.
The excuse that you can't do biological replicates because it is too
expensive no longer obtains.
|