I constructed this type of file by making an alignment between two isolates of the same organism by aligning the protein coding sequences through the following command line with STAND ALONE BLAST:
blastn -query fasta1.fasta -subject fasta2.fasta -dust no -parse_deflines -evalue 1e-10 -max_target_seqs 1 -out BTOP
I got returned a text file like this :
Query= Sequence_1
Length=6624
Score E
Sequences producing significant alignments: (Bits) Value
Sequence_5 1528 0.0
>Sequence_5
Length=6645
Score = 1528 bits (827), Expect = 0.0
Identities = 943/1000 (94%), Gaps = 3/1000 (0%)
Strand=Plus/Plus
Query 5326 ACCATCCCTTTTGGTATTGCTTTCGCTTTAGGATCTATTGCTTTTTTATTTTTGAAGAAA 5385
|||||||| ||||| || ||| | || ||| || || | ||||||||||||||||||
Sbjct 5227 ACCATCCCCTTTGGAATAGCTATTGCGTTAACTTCGATAGTGTTTTTATTTTTGAAGAAA 5286
Query 5386 AAAACCAAATCTACTATTGATCTTTTGCGTGTTATTAATATCCCCAAAAGTGATTATGAT 5445
|||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||
Sbjct 5287 AAAACCAAATCTACTATTGATCTTTTGCGTGTCATTAATATCCCCAAAAGTGATTATGAT 5346
Query 5446 ATACCGACAAAACTTTCACCCAATAGATATATACCTTATACTAGTGGTAAATACAGAGGC 5505
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 5347 ATACCGACAAAACTTTCACCCAATAGATATATACCTTATACTAGTGGTAAATACAGAGGC 5406
Query 5506 AAACGGTACATTTACCTTGAAGGAGATAGTGGAACTGATAGTGGTTACACCGATCATTAT 5565
||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||
Sbjct 5407 AAACGGTACATTTACCTTGAAGGAGATAGTGGAACAGATAGTGGTTACACCGATCATTAT 5466
Unfortunately the format is lost a bit by copying and pasting from the original file but I think it is clear that "|" and "-" have a clear interpretation. So how can I analyze this type of file? In the NCBI page from which I got the command line they talk about Trace-back operations (BTOP). I have little experience with this type of file and I wanted to understand what kind of format it was and how to read it. On the NCBI page (https://www.ncbi.nlm.nih.gov/books/NBK279682/) they talk about SAM files, so a simple reading through a function suitable for parsing these files would be fine? Thanks in advance.
PS: if needed I can upload a partial and masked ID file since the data is sensitive data.
Please clearly explain what you want to achieve or what your exact issue is.
what you see here is the default blast output (== alignment output). It shows the start and end coordinates of all alignable sequence parts from the sequences you provided as input. In between those coordinates the alignment itself is depicked (where | stands for a match for instance)
What you are referring to is the sam-like output that blast can also produce if you specifically ask for it. That unfortunately you did not, the command on the NCBI helppage is different from what you executed. To get the sam output you need to add
-outfmt "6 qseqid sseqid btop"
. what you added the-out btop
will only redirect the output to a file called btopYes I know for output. So it is a SAM output that is returned. I'll try with MATLAB samread and see what comes out!
No, it is not.
What it returned, as in the result you posted, is the default blast output (== Pairwise output , -outfmt 0 ) , NOT sam format. to get sam output you need to add the -outfmt parameter and options as I indicated above.
From what I read from https://open.oregonstate.education/computationalbiology/chapter/command-line-blast/ the Pairwise output is not parseable . Is it correct?
goh, not parseable is a bit strong I would say, but it clearly is difficult and perhaps not even advised to start doing it.
Especially since there are much more suited output formats, that do allow easy parsing. The tabular being the most obvious one (-outfmt 6 or 7 ) also XML output is somewhat parseable. And I guess also the sam like output is parseable (though I never used that myself )
It is not easily parsable but one could. You would want to try
-outfmt 6
(or 7) (LINK). If you need SAM format output thenmagicblast
(LINK) is what you want.what does it mean ?
For example, if I take the MATLAB swalign (https://it.mathworks.com/help/bioinfo/ref/swalign.html) function this returns me a structure where for each alignment I have, separated, the scores and the alignment. I would like to understand what kind of format this is returned to me in order to choose a function or strategy to import (so I would like to do the reverse) the data into my development environment.