Question

Aligning SNP array probes to human genome with BLAT -- how to prioritize matches?

0

Entering edit mode

5.1 years ago

croberts377 • 0

I have a set of about 200,000 50-mer probe sequences from an Illumina custom SNP array. I'd like to map the probe sequences to the human genome (e.g. hg19) in order to directly determine the SNP positions (I have reason to suspect that some of the positions listed in the manifest file could be incorrect). I have managed to do this using BLAT with the "-fastmap" option, which according to the documentation is intended for "fast DNA/DNA remapping - not allowing introns, requiring high %ID."

Here is the specific command I used:

blat -fastMap -minIdentity=100 -out=psl hg19.fa query.fa output.psl

where hg19.fa is a fasta file with the hg19 human genome build and query.fa is a fasta file with all of the probe sequences.

To restrict results to "perfect matches" (meaning that all 50 of the bases in the query sequence align identically to the reference sequence), I filtered BLAT output for qStart==0, qEnd==qSize (50 bases), and misMatches==0. For the most part this works (and is pretty fast). The majority of probes have a single perfect match, which can be used to define the position of the SNP targeted by that probe.

However, I get about 4,000 probes mapping perfectly to multiple locations, even after excluding results with 'random' contigs etc. Here are two examples:

EXAMPLE 1 (this one has over 500 matches across several chromosomes):

 qname  qsize  qstart  qend  matches  mismatches tname    tstart      tend
rs13427123     50       0    50       50           0  chr1  50345575  50345625
rs13427123     50       0    50       50           0  chr1  58596872  58596922
rs13427123     50       0    50       50           0  chr1  58739424  58739474
rs13427123     50       0    50       50           0  chr1  58843594  58843644
rs13427123     50       0    50       50           0  chr1  60118167  60118217
...

EXAMPLE 2:

            qname  qsize  qstart  qend  matches  mismatches tname     tstart       tend
rs1050190     50       0    50       50           0  chr2   60961709   60961759
rs1050190     50       0    50       50           0  chr3  141644808  141644858

Maybe this isn't surprising, but I'm not sure how to handle it. In the manifest file, the probe from example 2 is annotated as mapping to rs1050190/chr3:141644808, which corresponds to the second match. In these cases, I could just take the match that is consistent with the manifest. But it isn't clear to me how Illumina decided that DNA hybridizing to this probe in its assay is from chr3:141644808-141644858 and not chr2:60961709-60961759. I would have expected this to be handled in the design of the probes (i.e. avoid sequences with multiple alignments).

Questions:

Am I missing something in my approach to running BLAT or interpreting its output?
Could an alternative aligner help to differentiate between "matches"? The decision to use BLAT was influenced by these answers: A: What Should I Use Blat For? C: Locating A Sequence In A Fasta File.

Technical note: I realize there are two probe design systems for Illumina SNP arrays (Infinium I and II). For simplicity, I am showing just Infinium II designs, where the 3' end of the probe is adjacent to the polymorphism (i.e. the 50-mer probe sequence does not contain the polymorphism itself). For the Infinium I design probes, which contain the polymorphic SNP as the last (3') nucleotide of the probe (and thus there are two probes, one for each allele), I remove this last base from each of the sequences and do the alignments based on the resulting non-polymorphic 49-mers. I have the issue with multiple perfect alignments for both the Infinium I (49-mers) and II (50-mers) probes, so I don't think this is a (primary) source of the issue.

blat SNP array probe alignment • 1.3k views

ADD COMMENT • link 5.1 years ago by croberts377 • 0