NimbleGen 12-Plex arrays

The original source data for probe design for this array was part of FlyBase D. melanogaster annotation release 5.25. Using proprietary probe selection algorithms NimbleGen determined unique, high-quality 60-mer probes for transcripts and exons. The sequence files used for probe design may be downloaded below.

dmel-all-transcript-r5.25.fasta.gz
dmel-all-ncRNA-r5.25.fasta.gz
dmel-all-pseudogene-r5.25.fasta.gz
dmel-all-miRNA-r5.25.fasta.gz
dmel-all-miscRNA-r5.25.fasta.gz
dmel-all-tRNA-r5.25.fasta.gz

Re-Annotation of DGRC-1 Amplicon Microarrays from Drosophila melanogaster Annotation version 3.1 to version 4.1

Introduction

DNA microarrays used to interrogate gene expression are fabricated with arrangements of DNA elements corresponding to transcribed sequences of genes. The DNA elements can be generated in various ways, such as PCR amplified cDNA clone inserts (cDNA), DNA fragments PCR amplified from a genomic DNA template using specific primer pairs (amplicons), spotted synthetic oligonucleotides, or oligonucleotides synthesized directly on the microarray slide. By definition cDNA clones must correspond to transcribed sequences. On the other hand amplicon primers and oligonucleotides are selected on the basis of the annotated transcribed sequences in a genome. Genome annotation is a rapidly developing field and the genome annotations are regularly updated to accommodate new experimental data and refined gene prediction algorithms. In this process some gene models may be retired, merged, split, or have the intron/exon boundaries modified. The changes to gene models from Drosophila melanogaster annotation version 2.0 (released Oct. 2000) to annotation version 4.1 (released Feb. 2005) are summarized in Table 1. The time cycle of developing transcriptome microarrays is generally longer than the cycle time of annotation updates. As a consequence, each individual array element must be periodically mapped to the most recent version of the annotation. Here we describe the logic of a general algorithm for mapping amplicons to annotated transcribed sequences and its application to re-annotating the Drosophila Genomics Resource Center (DGRC) amplicon transcriptome microarrays with respect to the version 4.1 annotation of the Drosophila melanogaster genome.

Our objective is twofold. First, to map each amplicon on the DGRC transcriptome microarray platform to the most recently annotated transcript(s) of a gene. Second, to flag all amplicons that would be expected to give anomalous results when used as a probe on a microarray. Anomalies associated with oligonucleotide primer selection include: failed PCR amplification, amplification of multiple and disparate regions of the genome, elements amplified from non-transcribed DNA, amplicons corresponding to multi-copy transcribed sequences such as transposable elements, and amplicons capable of hybridizing to transcripts from more than one annotated gene. The re-annotation algorithm must therefore evaluate each oligonucleotide primer pair on the DGRC transcriptome microarray platform in terms of serving as a valid PCR primer pair, and then evaluate the amplicons in terms of serving as a valid transcriptome microarray element. Since gene identifiers and sequence coordinates are not necessarily stable between annotation versions, we made no attempt to track the changes in gene models between old and new annotations. Instead, we have used the sequence of oligonculeotide primers to perform a de novo mapping of the oligonucleotide primer pairs and corresponding amplicons to sequence scaffolds and annotated features. The re-annotation algorithm is described below.

Step 1: Evaluate oligonucleotide primers

We define valid primer pairs as those predicted to amplify a single PCR product when used to prime from a genomic DNA template. In order to serve as an effective PCR primer, an oligoncleotide must base pair with the template at a unique site. In order for a pair of primers to prime amplification, they must be in the correct orientation on opposite strands and lie within an amplifiable range (defined below). To evaluate the oligonucleotides with respect to these criteria, we align the primer pair sequences to the genome using BLAST (Altschul, Gish et al. 1990) (see Table 2 for BLAST parameters), then parse and filter the results on the following criteria:

Primers must base pair with the template

Primers must be unique in the genome

Primer pairs must be capable of producing an amplicon

Oligonucleotide pairs that passed the previous criteria were determined to be valid primer pairs. The genomic sequence lying between the outside ends of the best sequence matches was parsed and defined as the amplicon sequence for evaluation in the following step.

Step 2: Evaluate and map amplicons to annotated features

To serve as an effective transcriptome microarray element, a DNA sequence must, at a minimum, hybridize only to transcripts from a single gene. To evaluate this, we aligned the sequence of each amplicon to the genome sequence using BLAST (Altschul, Gish et al. 1990; see Table 3 for BLAST parameters) and evaluated the sequence matches with respect to the annotated transcribed sequences of genes. This was done by comparing the physical coordinates of the amplicon sequence match with the physical coordinates of the annotated genes and transcripts. The amplicons were then flagged as follows:

  • If the amplicon had multiple BLAST hits to different positions on the genome, then it was flagged as an amplicon that hits multiple places in the genome.
  • If the amplicon matched sequences in the genome that did not overlap any annotated gene features, it was flagged as an amplicon that does not hit a gene feature.
  • If the amplicon matched a gene region, but fully matches within an intron of a transcript, it is flagged as an amplicon that hits an intronic region.
  • Lastly, we flagged those amplicons that matched transcribed sequences that are not traditional genes. These include transposable elements, pseudogenes, and non-coding RNAs, and were flagged accordingly.

Mapping amplicons to Drosophila melanogaster version 4.1 genome annotation

The Drosophila ORF Primer Set was designed by Incyte Genomics against annotation version 1.0 of the Drosophila melanogaster genome, and contains 15,168 primer pairs. These primers have been used to fabricate the Incyte FlyGem microarrays(Johnston, Wang et al. 2004) and the Drosophila Genome Resource Center (DGRC) amplicon transcriptome microarrays. The annotation of the amplicons was updated to release 3.0 of the Drosophila genome sequence and release 3.1 of the genome annotation in June 2003, by Brian Oliver's group which is available from Gene Expression Omnibus Platform number GLP20. We applied the above algorithm to map the amplicons from the Drosophila ORF Primer Set with respect to Drosophila melanogaster genome release 4.0 and annotation version 4.1. The source of sequence files and the BLAST parameter settings are described in Tables 2-4 and the numbers of primer pairs passing the various filters is summarized in Table 5.

Summary

DNA microarrays used to interrogate gene expression can be constructed in many different forms. The DGRC-1 amplicon transcriptome mircoarrays are spotted with DNA fragments PCR amplified from a genomic DNA template using specific primer pairs (amplicons). The fragments of DNA that are printed on the microarray slides are static and will not change unless new primer pairs are designed and the entire PCR amplification process is carried out. Even though the DNA amplicons remain unchanged, the annotation information associated with the genome is always evolving. The Drosophila melanogaster genome annotation has progressed from version 1.0 in March of 2000 to 4.1 in February of 2005. Since the DGRC will continue to print the same amplicons designed by Incyte Genomics based on the version 1.0 annotation of the Drosophila melanogaster genome, the annotation information associated with each amplicon must be updated on each subsequent annotation release. As of July 2005, the most recent version of the annotated Drosophila melanogaster genome is 4.1and here we describe the process and reasoning behind updating this information in the online version of the DGRC microarray files.

All of the data that was used in conjunction with this analysis was taken from Flybase, so the updated annotation information should correspond directly to the information one can find in Flybase.

Before the annotation version 4.1 revision, the DGRC microarrays were annotated under the version 3.1 platform. In updating the annotation information from version 3.1 to version 4.1, out of the 13, 449 possible genes that have been annotated in version 4.1 of the Dmel genome, the DGRC microarrays represent transcripts from 11,880 unique genes, which is roughly 88% of the version 4.1 annotated genes. Out of the 15,168 primer pairs/amplicons from the Incyte set, 14,427 (95%) have the same annotation information from version 3.1 to 4.1. There are 58 amplicons that did not have a gene annotation in version 3.1, but do have a gene annotation in version 4.1. There are 513 amplicons that did have a gene annotation in version 3.1, but do not have a gene annotation in version 4.1. Lastly, there are 170 amplicons that were annotated with a gene from version 3.1, but the gene annotation changed to represent another gene in version 4.1.

Out of the 13,801 amplicons that match a transcript of a gene, 11,880 match a unique gene, 10,252 (~74%) match a single transcript of a particular gene and 3,537 (~26%) amplicons match more than one transcript of a gene. Within the set of 13,801 amplicons that match a transcript of a particular gene, there are several that hit other genomic features such as transposons (20 amplicons), pseudo genes (12 amplicons), and non-coding RNAs (11 amplicons). The amplicons that match any of the fore-mentioned gene features were also specially noted.

All of the information associated with the DGRC microarrays, including the version 3.1 and 4.1 annotation information, can be downloaded from the DGRC website at: http://dgrc.bio.indiana.edu/microarrays/



Table 1. Summary of the re-annotated genomic sequence of Drosophila melanogaster, updated on April 29, 2005. http://www.flybase.net/annot/dmel-release4-notes.html

  Release 2 Release 3.1 Release 3.2 Release 4.0 Release 4.1
Total Euchrom. Hetero. Total Euchrom. Hetero. Total Euchrom. Euchrom.
Protein-coding genes 13474 13369 290 13659 13472 320 13792 13472 13449
Protein-coding transcripts 14335 18109 396 18505 18746 430 18906 18746 18941
Unique peptides 13922 15848 353 16201 16356 390 16746 16356 16471
Peptides unchanged from r2 to 3.1 - 8769 53 8822 - - - - -
Peptides unchanged from r3.1 to 3.2 - - - - 16902 256 17158 - -
Peptides unchanged from r3.2 to 4.0 - - - - - - - 18720 -
Peptides unchanged from r4.0 to 4.1 - - - - - - - - 18483
tRNAs 0 288 0 288 288 0 288 288 295
rRNAs 0 6 6 0 96 6 102 96 96
Pseudogenes 0 17 1 18 39 1 40 40 39
microRNAs 0 23 0 23 23 0 23 23 66
snRNAs/snoRNAs 0 56 0 56 56 0 56 56 57
Natural transposon insertions 0 1572 9 1581 1572 6189 7761 1571 1571
Misc. non-protein-coding RNA 0 38 8 46 45 13 58 45 64
New compared to previous 336 576 226 802 211 56 267 1 77
Deleted from previous 114 284 61 345 41 23 64 0 4
Mergers of previous - 695 12 707 31 6 37 1 61
Splits of previous - 675 1 676 26 0 26 0 4


Table 2. BLAST Parameters used for aligning the oligonucleotide primer pairs to the Drososphila melanogaster genome. The Dmel-all-chromosome-r4.1.fasta file was taken from FlyBase's FTP website: ftp://flybase.net/genomes/Drosophila_melanogaster/current/fasta/

BLAST Parameter Parameter value
Database File Dmel-all-chromosome-r4.1.fasta
Query File Fasta formatted file containing each individual primer
Word Size 10 - All primer pairs have a length between 18 and 23 na. A primer with a matching word size of less than 11 is not adequate for a successful PCR reaction.
Expectation value .01 - A cutoff value of .01 for the e-value was set to remove erroneous BLAST hits.
Remaining BLAST parameters default values


Table 3. BLAST Parameters used for aligning the amplicons to the Drososphila melanogaster genome. The Dmel-all-chromosome-r4.1.fasta file was taken from FlyBase's FTP website: ftp://flybase.net/genomes/Drosophila_melanogaster/current/fasta/

BLAST Parameter Parameter value
Database File Dmel-all-chromosome-r4.1.fasta
Query File Fasta formatted file containing all amplicons produced from primer pairs that passed the criteria outlined in section "Step 1: Evaluate Oligonucleotide Primers".
Word Size 30 - All amplicons have a length of over 100 na. An amplicon with a word size of less than 30 will not reliably hybridize to the spotted cDNA on the microarray.
Expectation value .01 - A cutoff value of .01 for the e-value was set to remove amplicons with very low probability of hybridizing to the microarray.
Remaining BLAST parameters default values


Table 4. Files used in the re-annotation process. All files were taken from Flybase at:ftp://flybase.net/genomes/Drosophila_melanogaster/current/fasta/

File name The file's use in the re-annotation process
dmel-all-chromosome-r4.1.fasta File was used as the reference for all the coordinates in each chromosome.
dmel-all-gene-r4.1.fasta File was used to identify the genes that an amplicons represents.
dmel-all-transcript-r4.1.fasta File was used to identify the transcripts of a particular gene that an amplicon represents
dmel-all-miscRNA-r4.1.fasta File was used to identiy any non-coding RNA genome regions that an amplicon may also represent.
dmel-all-transposon-r4.1.fasta File was used to identify any transposable elements that an amplicon may also represent.
dmel-all-pseudogene-r4.1.fasta File was used to identify any pseudogenes that an amplicon my also represent.


Table 5. Summary of the re-annotated DGRC microarrays

  DGRC microarray features

Release 3.1
DGRC microarray features

Release 4.1
Annotated amplicons 13833 13801
Unique Protein-coding genes 12242 11880
Unique Protein-coding transcripts 15098 12352
Unchanged from 3.1 to 4.1 - 14427
Annotation added from 3.1 to 4.1 - 58
Annotation deleted from 3.1 to 4.1 - 513
Gene matching a single transcript 10822 10252
Gene matching multiple transcripts 3290 3537
Primers failed re-annotation - 269
Amplicon failed re-annotation - 546


Annotation Work Flow Description of Re-Annotation in PDF format.

AnnotationWorkFlow.pdf Version: September 2005

Citing the DGRC

When publishing experiments using materials obtained from the DGRC please cite the Drosophila Genomics Resource Center, supported by NIH grant 2P40OD010949, in the acknowledgments. Your cooperation helps us when we need to renew our grant.