Mapping all 32-mers within Hamming Distance = 3
2
1
Entering edit mode
23 months ago
giova34 ▴ 10

There's no shortage of posts on Biostars regarding ...

I read through all of them and learned a lot. But I'm struggling to find all alignment locations of ~ 1 000 000 x 32-mers, within a Hamming distance of 3, against a reference genome.

Wondering if folks can suggest either a parameter change, or a tool of choice! Here's what I've tried, and what I observed:

  • With BLAT I find that the k-mer alignments are underreported.
    • Using a Bloom filter, I find the counts of k-mers exceed the alingment locations using BLAT. Furthermore, the aligned locations of BLAT never exceeds ~ 200 or so...!
      • blat -t=dna -minMatch=1 -stepSize=1 -minScore=29 -tileSize=8 -repMatch=10000 -fine -out=pslx hg38.fa kmers.fa kmers.psl
  • With bowtie (version 1) the resultant alignments are all perfect matches! The CIGAR scores are all 32M suggesting perfect alignement... despite indicating mismatches, why?
    • bowtie --sam -v 3 -p 8 --all -f --norc hg38 kmers.fa kmers_aln.sam
  • With bbduk I keep running out of memory (never even runs), despite allocating 500 GB (!) Is allowing a Hamming distance = 3 too computationally expensive?
    • bbduk.sh -Xmx500g in=kmers.fa ref=hg38.fa outm=kmers_aln.fq k=32 mm=f hdist=3 threads=8
hamming alignment kmer ngs short • 1.3k views
ADD COMMENT
1
Entering edit mode

If you are mapping then why are you using bbduk.sh? This way it is running in filtermode by default. You should use bbmap.sh instead. Why are you exporting the result as a .fq file? You also need to use a smaller k to allow for initial matches.

Try

bbmap.sh -Xmx40g in=kmers.fa out=align.sam ref=hg38.fa k=14 threads=8 local=t minid=0.6 ambig=all

You will need to play with minid parameter to get 3 mismatches.

ADD REPLY
0
Entering edit mode

GenoMax Thank you so much for this response! I think I misspoke (and learned the difference btwn alignment vs. mapping Alignment and mapping). I would like to do alignment, and merely followed your suggestion to another user (kmer alignment with mismatch). The .fq was based on reading bbduk documentation.

Still a little confused as I took the past few days to try BLAST and bbmap. I expect nearly uniform genome coverage but the kmers I have only map to half of the genomic regions? Not sure if it is by default masking some regions.

ADD REPLY
0
Entering edit mode

Can you post an example of some of the kmers you are working with? Aligners will generally have an upper limit in terms of alignments they report. Have you tried to align with bbmap? You will need to add ambig=all option to report all alignments.

ADD REPLY
0
Entering edit mode
23 months ago

This may or may not be a direct answer to your question. It depends on your genome, and genmap will allow you to investigate this further at this kmer size and hamming distance. It is, however, not an aligner (only takes a genome in FASTA format as input).

I have been using genmap a lot for kmer mappability purposes and think it likely fits the bill for your question. https://github.com/cpockrandt/genmap

If you have a really big genome, eg wheat, you might need >500-800 GB RAM for the linux sorting step prior to making a bigwig from the bedgraph, but this is not genmap per se.

ADD COMMENT
0
Entering edit mode
4 weeks ago

It sounds like Bowtie is what you want. The "M" operator in a CIGAR string can represent either a match or a mismatch (see section 1.4 of the SAM specification). I don't think bowtie supports CIGAR strings using the =/X format. If this is necessary, you might consider requesting that feature.

ADD COMMENT
0
Entering edit mode

Agree. The "MD" tag in the SAM output may help identify the mismatches.

ADD REPLY

Login before adding your answer.

Traffic: 1564 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6