Question

Pipeline To Map 60-Mers To Genes

2

Entering edit mode

12.6 years ago

David Quigley 11k

I'm working with an older Agilent microarray platform, and I need to update the annotation. The problem: given a list of 10,000 sequences of length 60, identify those which unambiguously map within the coding region or UTR of a well-annotated gene, and record the Entrez ID for that gene. By "unambiguous" I mean a single gene matches all 60 bases with 100% identity. Cases where gene FOO maps perfectly but gene BAR has 100% identity for 27 bases should be rejected as ambiguous. Since the query sequences were generated from ESTs, I need to accept results where there is 100% identity but the alignment spans exons of the same gene.

The brute force solution is to feed a local BLAT instance the hg19 build and the sequences, parse the output for start-stop loci, and match those against exon bounds for the whole genome pulled from UCSC. That's not a fun way to spend an afternoon.

Can you think of a method that requires less effort?

annotation blat microarray • 2.9k views

ADD COMMENT • link updated 12.6 years ago by Eric Fournier ★ 1.4k • written 12.6 years ago by David Quigley 11k

1

Entering edit mode

Is there a reason not to align to mRNA?

ADD REPLY • link 12.6 years ago by Sean Davis 26k

0

Entering edit mode

Good point, since I only care about perfect matches to mRNA, ideally refseq.

ADD REPLY • link 12.6 years ago by David Quigley 11k

Ram · Answer 1 · 2011-09-14

"An afternoon?" Ah ha ha ha! :-) Seriously, it's a bigger, uglier can of worms than you'd expect.

I found the Agilent 4x44k human oligoarray has updated annotation on the GEO platform (April 2011) but not on the Agilent website. You may want to check GEO to see if the annotations are updated sufficiently for your uses before you decide to embark on a potentially perilous journey...

My suggestion is to look at a pipeline designed for this purpose, take a look at a comparison, for example at http://www.biomedcentral.com/1753-6561/3/S4/S1 (there are other reviews, of course, but that one's pretty good.)

I found sigReannot to do pretty well, providing enough extra info that you can spend your time ranking heuristics rather than mapping, then re-mapping, then mapping again (with successively more permissive search spaces.)

Mapping to mRNAs sounds great, and is great for those probes aligning to annotated mRNAs, but there are a ton that don't align to mRNAs, either just downstream an annotated gene, or some are "in the middle of nowhere." I've searched around for updated annotation sets for these types of arrays (Agilent for example) and it's oddly non-existent. I figure the reason is that nobody wants to put potentially incorrect annotations out there.

score 3 · Answer 2 · 2011-09-14

3

Entering edit mode

12.6 years ago

Sean Davis 26k

A simple workflow might look like:

Align to RefSeq using blat
Use pslReps to choose the single best hit
Use a simple perl, python, or even awk script to choose only alignments that meet your criteria.

ADD COMMENT • link 12.6 years ago by Sean Davis 26k

score 1 · Answer 3 · 2011-09-14

I had to do something exceedingly similar with another organism. BLATing probes to the whole set of ResSeq sequences for my organism was the only sensible solution I came up with. Whatever else I tried to do ended up being a huge time sink with only marginal benefits.

Also, be careful about limiting your search space to coding regions. A lot of Agilent probes are designed to hybridize to the 3'UTR of genes, which has the advantage of removing a lot of dT-primer amplification bias.

As an aside, unless you have compelling reasons to do so, do not limit your annotations to alignments with 100% identity on 100% of the length. Agilent 60-mers can have significant hybridization with up to four mismatches, depending on the location of those mismatches within the probe. I personally use a cutoff of 58 matches for annotating probes, with an additional cutoff of 56 matches to determine specificity (IE, if a probe has a 60nt match with one transcript, but a 56nt match with another transcript, I discard the probe as non-specific)