Conversion Of Blat Output To Sam/Bam
3
8
Entering edit mode
11.7 years ago

I have some short reads aligned using BLAT, the output is in tabular psl format (including the sequences) for each alignment. Is it possible to convert the blat output to SAM/BAM format. Myself, I would think it is not because of the lack of some data fields in psl which is required for SAM format (mainly the CIGAR string), but please proove me wrong! Normally I would advise myself to use a different tool (bwa, bowtie, lastz) and align getting a SAM file, but what if that wasn't an option (say because you really want to use BLAT or you don't have the input) is there a way to do the conversion. I can possibly code that in perl and share it if someone had an idea how to do it.

blat format conversion bam sam • 12k views
ADD COMMENT
10
Entering edit mode
11.7 years ago

(not tested): there is a script named psl2sam.pl in:

samtools/misc/psl2sam.pl
ADD COMMENT
0
Entering edit mode

Yeah, thanks a lot. Tested, that seems to work. Even without the sequences included. I wonder what more nice things are there hidden in samtools. btw, the script has some options (-a,-b, -q, -r) but they are seemingly ignored.

ADD REPLY
0
Entering edit mode

This script would not add the header to the sam file. Here is how to add header to your sam file.

ADD REPLY
0
Entering edit mode

If you want some additions on top of that (for example to view in IGV) I just did the following oneliner:

REF=reffasta.fa
QUERY=queryfasta.fa
RESULT=outputbase
blat $REF $QUERY $RESULT.pslx -out=pslx -noTrimA -extendThroughN && psl2sam.pl $RESULT.pslx | python add_seq_info_to_psl2sam.py $QUERY | samtools view -bST $REF - > $RESULT.bam && samtools sort $RESULT.bam $RESULT.sorted && samtools index $RESULT.sorted.bam

where this script is add_seq_info_to_psl2sam.py:

import sys
from string import maketrans
sam=sys.stdin
transtab = maketrans('ACTG','TGAC')

name2seq = {}
fa=open(sys.argv[1])
for line in fa:
    name=line.strip()[1:]
    seq = fa.next().strip()
    name2seq[name]=seq
fa.close()

for line in sam:
    vals = line.split("\t")
    assert vals[0] in name2seq
    if vals[1]=='0':
        vals[-3]=name2seq[vals[0]]
    else:
        assert vals[1]=='16'
        vals[-3]=name2seq[vals[0]][::-1].translate(transtab)
    vals[5] = 'S'.join(vals[5].split('H'))
    sys.stdout.write("\t".join(vals))
ADD REPLY
5
Entering edit mode
6.1 years ago

This psl2sam.pl solution no longer works very well. I verified this by uploading the input psl and the output sam (converted to bam) to the UCSC genome browser.

Since this page comes up as the first hit on Google, here's a better solution using bedtools, which you can apt-get on ubuntu

wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/pslToBed

(or look around on that site for your operating system)

Get a list of chromosome sizes, for instance

wget https://genome.ucsc.edu/goldenpath/help/hg38.chrom.sizes

Now run

pslToBed input.psl input.bed
bedtools bedtobam -bed12 -i input.bed -g hg38.chrom.sizes > output.bam
samtools view output.bam > output.sam

Using this, my input.psl matches my output.bam file when uploaded as a track.

ADD COMMENT
0
Entering edit mode

This works well, but is there a way to include the sequences in the generated SAM or one has to do that manually?

ADD REPLY
1
Entering edit mode

Did check the samtools approach? I think it could have included the sequences.

ADD REPLY
0
Entering edit mode

It does not work unfortunately, the sequences are not included. Also the SAM output is invalid as there are often dashes in the CIGAR string like this: 2H-574M609H

ADD REPLY
0
Entering edit mode

I have tried a different approach: I have converted the PSL to BAM using the bamtools method above and cooked up a script to augment the BAM file with the sequences from the original fasta file. However the CIGAR string is inconsistent with the sequence length, for example:

Read  0     
REF   1953    255     15M6N30M        *       0       0      
ACGAGGGGGCGACCGGTGTTCAGCCCCCGTCGACTTGAGTCCGTTTGCAATGCGTCTATCGTTCACGGTCAAGATGTTCCACCTTGCTTCACTTGTGT
CCTTTCGCGAAGGCGCCAAGTTCGGTGCCTAGCAGTGCATAACGTCGCCTATCGCTCTGGATTGTCTGGCCACAAGTATAGCCTGCTGTGGTGTTTATAG
TTCGTAGTCCAATCTGTATCGTGCCGCTAAGTATAGCATGTCGCTGCCTGATGCAATAGTGCTTCGCACTCATTGGTAGGCGGCAACTAGTTGCCATAAGTT
GCATGGTGGTGAGGTGGATACGATAAGTTGGGGACGGTGACTGACACTGTCGTCTAGTTGGGAGCAACTTATTGCCCGGAGTATAGATAAGACGTCGTG
ATGCGCATGAGATCGCGGACTGTACTGTTCACTTTTGTAGTTTTCGACTGCTTGCGCCCAAAGGCTACTTATGTAGTATGGTATAGCATGTCTGCGATGCTTT
GTTTGCTCCTGAGTGTGTTGACGATGGATGAAGTATTGATTGTTCAACGGGGAAGACAGTGGTGCCGCAATGAAAAGTTGAGGTGCAAATAGAGCGGGC
CGGCATTCGTTTTCAAACGCCTTTTATCCGGTGCAGATTGCGTACGGATACACTAAATTTGTTTTTACTTTCGTACTAGCGTTACGGTGATAAACGATTTCTTT
TCATGACACGAAACATAATAGTCTTGCCGTCCGATTGCCGATATTGTCTGTAATGGAGCAGCGATGATCAAGAGCAATCAACTTCCGGCTATGCAGTGTGTT
TGGTAATCGACATCGCGCGCCTTAATCGCTTTTCATCGCGCGGAACAGAAGAAGACCTATTTTTGTAGTCGCATAAGCGCCGTGGAATGTGATTGCTTAGTT
GCATCTACAAGGTGACTAGGTGGAAAAATCACTCGTGTTTCGGATGTGGTGACGTCTCGCGGTTATAATTTGTCTGTCTTAGTGGATGTCATGTCCTGGTTT
AGGTGGAATCAAGTTAACTTTTTGTTGCCATGTCAGGTGCAAACCAACGACAGTGATGACATTGCGTCTAGACTATTGGAGCTAGTGGTATCTAAAATGGGAT
CTCTGTGTGTTCCGGTCTTAGGTTTTAAATATTGGTCGCCAGGTGCAATAAAAACCGGATCACGATGAACATTGCCCCGCGACTAGTTTAGTAGTCTTTAAGA
TGACGCTAGTGATGGAGAAAAACGCCCCCTCCCCACCCCCATCTTACTCTACTACCGCCCTCCATTCATATTTTTCTAACCCTCTCCATACTTACTCATCCATCC
GCCCCCTACACACCTTACGTCCCTTACAACCTCCCTTTTCGAAAAATAACTGA
*

I guess this means that something went wrong with the PSL -> BED -> SAM conversion. Note that the CIGAR string has no hard clip operations in it. Does anybody have any clue about this?

ADD REPLY
0
Entering edit mode

I am also interested in this, how did you populate your sam with the sequences?

ADD REPLY
1
Entering edit mode
5.7 years ago
Botond Sipos ★ 1.7k

I wrote a conversion tool in Python: Uncle PSL: BLAT to SAM conversion in Python

ADD COMMENT
0
Entering edit mode

Hi! We are trying to use your tool but we have problems installing it even. Could you provide help?

ADD REPLY

Login before adding your answer.

Traffic: 667 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6