Question

Efficiently run blat for long sequences

0

Entering edit mode

5.9 years ago

Jautis ▴ 580

Hello, I'm using blat to align genes from one genome to another. This is working well for small sequences (<10kb), but longer sequences are running for an more than a day with no signs of finishing. This seems to be especially true for those 35kb+ and some of the sequences are near 200kb.

Does anybody have suggestions for increasing the efficiency? I've thought about blat-ing 10kb intervals of the genes, but that would pose problems if some intervals fail to map or fail to map uniquely. I've pasted below the code that I'm currently using to run blat given the target genome and sequence, requiring at least a 90% of the sequences match and 97% identity. Thanks!

 f=`awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' sequence.fa | tail -1`
 a=$(( 9*f/10 ))
 blat target.2bit sequence.fa psl/sequence.psl -tileSize=15 -minScore=$a -minIdentity=97

blat alignment sequence assembly • 1.4k views

ADD COMMENT • link updated 5.8 years ago by Vitis ★ 2.6k • written 5.9 years ago by Jautis ▴ 580

score 0 · Answer 1 · 2019-09-17

Are you aligning spliced transcripts to genome assemblies, which requires opening big gaps (for introns)? If not, I'd suggest you to try minimap aligner: https://github.com/lh3/minimap2. There is an option to deal with substantially diverged sequences. If you're mapping spliced transcripts to genome assemblies (which are typically quite small), I don't think BLAT would have a problem.