I have a set of about 200,000 50-mer probe sequences from an Illumina custom SNP array. I'd like to map the probe sequences to the human genome (e.g. hg19) in order to directly determine the SNP positions (I have reason to suspect that some of the positions listed in the manifest file could be incorrect). I have managed to do this using BLAT with the "-fastmap" option, which according to the documentation is intended for "fast DNA/DNA remapping - not allowing introns, requiring high %ID."
Here is the specific command I used:
blat -fastMap -minIdentity=100 -out=psl hg19.fa query.fa output.psl
where hg19.fa is a fasta file with the hg19 human genome build and query.fa is a fasta file with all of the probe sequences.
To restrict results to "perfect matches" (meaning that all 50 of the bases in the query sequence align identically to the reference sequence), I filtered BLAT output for qStart==0, qEnd==qSize (50 bases), and misMatches==0. For the most part this works (and is pretty fast). The majority of probes have a single perfect match, which can be used to define the position of the SNP targeted by that probe.
However, I get about 4,000 probes mapping perfectly to multiple locations, even after excluding results with 'random' contigs etc. Here are two examples:
EXAMPLE 1 (this one has over 500 matches across several chromosomes):
qname qsize qstart qend matches mismatches tname tstart tend rs13427123 50 0 50 50 0 chr1 50345575 50345625 rs13427123 50 0 50 50 0 chr1 58596872 58596922 rs13427123 50 0 50 50 0 chr1 58739424 58739474 rs13427123 50 0 50 50 0 chr1 58843594 58843644 rs13427123 50 0 50 50 0 chr1 60118167 60118217 ...
qname qsize qstart qend matches mismatches tname tstart tend rs1050190 50 0 50 50 0 chr2 60961709 60961759 rs1050190 50 0 50 50 0 chr3 141644808 141644858
Maybe this isn't surprising, but I'm not sure how to handle it. In the manifest file, the probe from example 2 is annotated as mapping to rs1050190/chr3:141644808, which corresponds to the second match. In these cases, I could just take the match that is consistent with the manifest. But it isn't clear to me how Illumina decided that DNA hybridizing to this probe in its assay is from chr3:141644808-141644858 and not chr2:60961709-60961759. I would have expected this to be handled in the design of the probes (i.e. avoid sequences with multiple alignments).
- Am I missing something in my approach to running BLAT or interpreting its output?
- Could an alternative aligner help to differentiate between "matches"? The decision to use BLAT was influenced by these answers: A: What Should I Use Blat For? C: Locating A Sequence In A Fasta File.
Technical note: I realize there are two probe design systems for Illumina SNP arrays (Infinium I and II). For simplicity, I am showing just Infinium II designs, where the 3' end of the probe is adjacent to the polymorphism (i.e. the 50-mer probe sequence does not contain the polymorphism itself). For the Infinium I design probes, which contain the polymorphic SNP as the last (3') nucleotide of the probe (and thus there are two probes, one for each allele), I remove this last base from each of the sequences and do the alignments based on the resulting non-polymorphic 49-mers. I have the issue with multiple perfect alignments for both the Infinium I (49-mers) and II (50-mers) probes, so I don't think this is a (primary) source of the issue.