NimbleGen 12-Plex arrays
The original source data for probe design for this array was part of FlyBase D. melanogaster annotation release 5.25. Using proprietary probe selection algorithms NimbleGen determined unique, high-quality 60-mer probes for transcripts and exons. The sequence files used for probe design may be downloaded below.dmel-all-transcript-r5.25.fasta.gz
Re-Annotation of DGRC-1 Amplicon Microarrays from Drosophila melanogaster Annotation version 3.1 to version 4.1
DNA microarrays used to interrogate gene expression are fabricated with arrangements of DNA elements corresponding to transcribed sequences of genes. The DNA elements can be generated in various ways, such as PCR amplified cDNA clone inserts (cDNA), DNA fragments PCR amplified from a genomic DNA template using specific primer pairs (amplicons), spotted synthetic oligonucleotides, or oligonucleotides synthesized directly on the microarray slide. By definition cDNA clones must correspond to transcribed sequences. On the other hand amplicon primers and oligonucleotides are selected on the basis of the annotated transcribed sequences in a genome. Genome annotation is a rapidly developing field and the genome annotations are regularly updated to accommodate new experimental data and refined gene prediction algorithms. In this process some gene models may be retired, merged, split, or have the intron/exon boundaries modified. The changes to gene models from Drosophila melanogaster annotation version 2.0 (released Oct. 2000) to annotation version 4.1 (released Feb. 2005) are summarized in Table 1. The time cycle of developing transcriptome microarrays is generally longer than the cycle time of annotation updates. As a consequence, each individual array element must be periodically mapped to the most recent version of the annotation. Here we describe the logic of a general algorithm for mapping amplicons to annotated transcribed sequences and its application to re-annotating the Drosophila Genomics Resource Center (DGRC) amplicon transcriptome microarrays with respect to the version 4.1 annotation of the Drosophila melanogaster genome.
Our objective is twofold. First, to map each amplicon on the DGRC transcriptome microarray platform to the most recently annotated transcript(s) of a gene. Second, to flag all amplicons that would be expected to give anomalous results when used as a probe on a microarray. Anomalies associated with oligonucleotide primer selection include: failed PCR amplification, amplification of multiple and disparate regions of the genome, elements amplified from non-transcribed DNA, amplicons corresponding to multi-copy transcribed sequences such as transposable elements, and amplicons capable of hybridizing to transcripts from more than one annotated gene. The re-annotation algorithm must therefore evaluate each oligonucleotide primer pair on the DGRC transcriptome microarray platform in terms of serving as a valid PCR primer pair, and then evaluate the amplicons in terms of serving as a valid transcriptome microarray element. Since gene identifiers and sequence coordinates are not necessarily stable between annotation versions, we made no attempt to track the changes in gene models between old and new annotations. Instead, we have used the sequence of oligonculeotide primers to perform a de novo mapping of the oligonucleotide primer pairs and corresponding amplicons to sequence scaffolds and annotated features. The re-annotation algorithm is described below.
Step 1: Evaluate oligonucleotide primers
We define valid primer pairs as those predicted to amplify a single PCR product when used to prime from a genomic DNA template. In order to serve as an effective PCR primer, an oligoncleotide must base pair with the template at a unique site. In order for a pair of primers to prime amplification, they must be in the correct orientation on opposite strands and lie within an amplifiable range (defined below). To evaluate the oligonucleotides with respect to these criteria, we align the primer pair sequences to the genome using BLAST (Altschul, Gish et al. 1990) (see Table 2 for BLAST parameters), then parse and filter the results on the following criteria:
Primers must base pair with the template
- If the percent identity for the primer to its matching section of the genome was less than 80%, it was eliminated as a possible primer for PCR.
- If the 3' end of the primer had more than 2 nt that were not matching the genomic sequence, it was eliminated as an effective PCR primer. This is due to the fact that DNA polymerase amplifies from 5' to 3' and if there is a "flap" of unpaired nucleotides on the 3' end, there will be problems with amplification.
Primers must be unique in the genome
- To serve as an effective PCR primer, an oligonucleotide must prime DNA polymerase from a single site in the template. In this case the oligonucleotide must prime from a single site in the entire genome. To evaluate the ability of an oligonucleotide to prime more than one site on the genome, we examined the second best BLAST hit for its ability to prime a PCR reaction. An oligonucleotide was considered to be an effective primer if the E-value of the second best BLAST hit was greater than 3 orders of magnitude larger than the best match. For example, if a 20mer that has a single, perfect match and an E-value of 4.9e-5 and the second best BLAST hit, which corresponds to a different part of the genome, has an E-value of 4.9e-2(which is 1,000 times greater than 4.9e-5), then the oligonucleotide was accepted as a viable primer. If the E-value of the second best match was 4.9e-3 (which is only 100 times greater than 4.9e-5), it was deemed to being capable of priming at more than one site in the genome and the primer was not passed as a viable primer.
Primer pairs must be capable of producing an amplicon
- If the primer pairs were located on different chromosomes, they were eliminated on the basis of not being able to produce an amplicon.
- If the primer pairs were not in the correct orientation on opposite DNA strands, they were eliminated on the basis of not being able to produce an amplicon. One oligonucleotide needs to match the "+" strand, while the other needs to match the "-" strand. Lastly, the 3' ends of each oligonucleotide need to be facing each other, otherwise PCR will fail.
- An amplifiable range is limited to primer pairs located less than 1,000 bp apart, so if the coordinates of the primer pairs were more than 1,000 bp apart, they were eliminated as a primer set candidate that could effectively and efficiently produce an amplicon.
Oligonucleotide pairs that passed the previous criteria were determined to be valid primer pairs. The genomic sequence lying between the outside ends of the best sequence matches was parsed and defined as the amplicon sequence for evaluation in the following step.
Step 2: Evaluate and map amplicons to annotated features
To serve as an effective transcriptome microarray element, a DNA sequence must, at a minimum, hybridize only to transcripts from a single gene. To evaluate this, we aligned the sequence of each amplicon to the genome sequence using BLAST (Altschul, Gish et al. 1990; see Table 3 for BLAST parameters) and evaluated the sequence matches with respect to the annotated transcribed sequences of genes. This was done by comparing the physical coordinates of the amplicon sequence match with the physical coordinates of the annotated genes and transcripts. The amplicons were then flagged as follows:
- If the amplicon had multiple BLAST hits to different positions on the genome, then it was flagged as an amplicon that hits multiple places in the genome.
- If the amplicon matched sequences in the genome that did not overlap any annotated gene features, it was flagged as an amplicon that does not hit a gene feature.
- If the amplicon matched a gene region, but fully matches within an intron of a transcript, it is flagged as an amplicon that hits an intronic region.
- Lastly, we flagged those amplicons that matched transcribed sequences that are not traditional genes. These include transposable elements, pseudogenes, and non-coding RNAs, and were flagged accordingly.
Mapping amplicons to Drosophila melanogaster version 4.1 genome annotation
The Drosophila ORF Primer Set was designed by Incyte Genomics against annotation version 1.0 of the Drosophila melanogaster genome, and contains 15,168 primer pairs. These primers have been used to fabricate the Incyte FlyGem microarrays(Johnston, Wang et al. 2004) and the Drosophila Genome Resource Center (DGRC) amplicon transcriptome microarrays. The annotation of the amplicons was updated to release 3.0 of the Drosophila genome sequence and release 3.1 of the genome annotation in June 2003, by Brian Oliver's group which is available from Gene Expression Omnibus Platform number GLP20. We applied the above algorithm to map the amplicons from the Drosophila ORF Primer Set with respect to Drosophila melanogaster genome release 4.0 and annotation version 4.1. The source of sequence files and the BLAST parameter settings are described in Tables 2-4 and the numbers of primer pairs passing the various filters is summarized in Table 5.
DNA microarrays used to interrogate gene expression can be constructed in many different forms. The DGRC-1 amplicon transcriptome mircoarrays are spotted with DNA fragments PCR amplified from a genomic DNA template using specific primer pairs (amplicons). The fragments of DNA that are printed on the microarray slides are static and will not change unless new primer pairs are designed and the entire PCR amplification process is carried out. Even though the DNA amplicons remain unchanged, the annotation information associated with the genome is always evolving. The Drosophila melanogaster genome annotation has progressed from version 1.0 in March of 2000 to 4.1 in February of 2005. Since the DGRC will continue to print the same amplicons designed by Incyte Genomics based on the version 1.0 annotation of the Drosophila melanogaster genome, the annotation information associated with each amplicon must be updated on each subsequent annotation release. As of July 2005, the most recent version of the annotated Drosophila melanogaster genome is 4.1and here we describe the process and reasoning behind updating this information in the online version of the DGRC microarray files.
All of the data that was used in conjunction with this analysis was taken from Flybase, so the updated annotation information should correspond directly to the information one can find in Flybase.
Before the annotation version 4.1 revision, the DGRC microarrays were annotated under the version 3.1 platform. In updating the annotation information from version 3.1 to version 4.1, out of the 13, 449 possible genes that have been annotated in version 4.1 of the Dmel genome, the DGRC microarrays represent transcripts from 11,880 unique genes, which is roughly 88% of the version 4.1 annotated genes. Out of the 15,168 primer pairs/amplicons from the Incyte set, 14,427 (95%) have the same annotation information from version 3.1 to 4.1. There are 58 amplicons that did not have a gene annotation in version 3.1, but do have a gene annotation in version 4.1. There are 513 amplicons that did have a gene annotation in version 3.1, but do not have a gene annotation in version 4.1. Lastly, there are 170 amplicons that were annotated with a gene from version 3.1, but the gene annotation changed to represent another gene in version 4.1.
Out of the 13,801 amplicons that match a transcript of a gene, 11,880 match a unique gene, 10,252 (~74%) match a single transcript of a particular gene and 3,537 (~26%) amplicons match more than one transcript of a gene. Within the set of 13,801 amplicons that match a transcript of a particular gene, there are several that hit other genomic features such as transposons (20 amplicons), pseudo genes (12 amplicons), and non-coding RNAs (11 amplicons). The amplicons that match any of the fore-mentioned gene features were also specially noted.
All of the information associated with the DGRC microarrays, including the version 3.1 and 4.1 annotation information, can be downloaded from the DGRC website at: http://dgrc.bio.indiana.edu/microarrays/
Table 1. Summary of the re-annotated genomic sequence of Drosophila melanogaster, updated on April 29, 2005. http://www.flybase.net/annot/dmel-release4-notes.html
|Release 2||Release 3.1||Release 3.2||Release 4.0||Release 4.1|
|Peptides unchanged from r2 to 3.1||-||8769||53||8822||-||-||-||-||-|
|Peptides unchanged from r3.1 to 3.2||-||-||-||-||16902||256||17158||-||-|
|Peptides unchanged from r3.2 to 4.0||-||-||-||-||-||-||-||18720||-|
|Peptides unchanged from r4.0 to 4.1||-||-||-||-||-||-||-||-||18483|
|Natural transposon insertions||0||1572||9||1581||1572||6189||7761||1571||1571|
|Misc. non-protein-coding RNA||0||38||8||46||45||13||58||45||64|
|New compared to previous||336||576||226||802||211||56||267||1||77|
|Deleted from previous||114||284||61||345||41||23||64||0||4|
|Mergers of previous||-||695||12||707||31||6||37||1||61|
|Splits of previous||-||675||1||676||26||0||26||0||4|
Table 2. BLAST Parameters used for aligning the oligonucleotide primer pairs to the Drososphila melanogaster genome. The Dmel-all-chromosome-r4.1.fasta file was taken from FlyBase's FTP website: ftp://flybase.net/genomes/Drosophila_melanogaster/current/fasta/
|BLAST Parameter||Parameter value|
|Query File||Fasta formatted file containing each individual primer|
|Word Size||10 - All primer pairs have a length between 18 and 23 na. A primer with a matching word size of less than 11 is not adequate for a successful PCR reaction.|
|Expectation value||.01 - A cutoff value of .01 for the e-value was set to remove erroneous BLAST hits.|
|Remaining BLAST parameters||default values|
Table 3. BLAST Parameters used for aligning the amplicons to the Drososphila melanogaster genome. The Dmel-all-chromosome-r4.1.fasta file was taken from FlyBase's FTP website: ftp://flybase.net/genomes/Drosophila_melanogaster/current/fasta/
|BLAST Parameter||Parameter value|
|Query File||Fasta formatted file containing all amplicons produced from primer pairs that passed the criteria outlined in section "Step 1: Evaluate Oligonucleotide Primers".|
|Word Size||30 - All amplicons have a length of over 100 na. An amplicon with a word size of less than 30 will not reliably hybridize to the spotted cDNA on the microarray.|
|Expectation value||.01 - A cutoff value of .01 for the e-value was set to remove amplicons with very low probability of hybridizing to the microarray.|
|Remaining BLAST parameters||default values|
Table 4. Files used in the re-annotation process. All files were taken from Flybase at:ftp://flybase.net/genomes/Drosophila_melanogaster/current/fasta/
|File name||The file's use in the re-annotation process||dmel-all-chromosome-r4.1.fasta||File was used as the reference for all the coordinates in each chromosome.||dmel-all-gene-r4.1.fasta||File was used to identify the genes that an amplicons represents.||dmel-all-transcript-r4.1.fasta||File was used to identify the transcripts of a particular gene that an amplicon represents||dmel-all-miscRNA-r4.1.fasta||File was used to identiy any non-coding RNA genome regions that an amplicon may also represent.||dmel-all-transposon-r4.1.fasta||File was used to identify any transposable elements that an amplicon may also represent.||dmel-all-pseudogene-r4.1.fasta||File was used to identify any pseudogenes that an amplicon my also represent.|
Table 5. Summary of the re-annotated DGRC microarrays
|DGRC microarray features
|DGRC microarray features|
|Unique Protein-coding genes||12242||11880|
|Unique Protein-coding transcripts||15098||12352|
|Unchanged from 3.1 to 4.1||-||14427|
|Annotation added from 3.1 to 4.1||-||58|
|Annotation deleted from 3.1 to 4.1||-||513|
|Gene matching a single transcript||10822||10252|
|Gene matching multiple transcripts||3290||3537|
|Primers failed re-annotation||-||269|
|Amplicon failed re-annotation||-||546|
|Annotation Work Flow||Description of Re-Annotation in PDF format.