Tools to obtain sequence from bam file
2
0
Entering edit mode
19 months ago
Ankit ▴ 500

Hi everyone,

I would like to extract sequence of reads from a bam file in such a way that indels are also incoproored based on information obtained from flags/penalty information. Basically I want something like pairwise sequence alignment of all my reads ..1,2,3...N against my refence sequence of interest .like this (Just imagine each row is a read from Sam/bam file obtained after alignment of data) and showing wherever a deletion was present as '-'. Insertion as read adjusted according to reference.

Do you know any Unix or R based tools?

Thanks in advance

sequencing Bam indels • 2.2k views
ADD COMMENT
1
Entering edit mode

samtools tview ?

ADD REPLY
1
Entering edit mode

This is what you probably need: https://github.com/mlafave/sam2pairwise

ADD REPLY
0
Entering edit mode

Thank you for the response. I tried sam2pairwise. But I am not getting the required output.

May be I should explain it better.

Please see this image.enter image description here

I want to focus only on the output. So if an insertion or deletion occur the reference + dashes (adjusted to insertion in reads) will acts as a superset and reads will be the variables of different combinations (the biological sequence). So the dimension of the matrix will depends on reference +dashes (due to insertion in reads). Lets says 100 reads (rows) X 100 bases (columns) (adjusted by the dashes). I want to take out only the reads not the reference. To add on here I want to give an example of samtools fasta function, which converts a BAM to a FASTA. The issue is, it is not giving me the indels information (DASHES) and this basically shift the alignment of base position and also changes the dimension of matrix.

I hope my query is clear to you. If I can get only the sequence of reads as in image attached or even the fasta (same dimension), that will do the job.

Please suggest if you know some way to do that.

ADD REPLY
0
Entering edit mode

such a tool does not exist - or shall I say is not a mainstream and common need.

you could write one yourself, and the way it would work is that you read and parse the CIGAR string with a library such as simplesam (if you use Python) and then apply the transformations to both the reads and reference.

I will say that the problem could end up being more difficult in the sense that I am not sure you would want to add all insertions into the reference, only the ones that are valid insertions supported by the majority of the reads. Otherwise, you risk creating a wonky reference with lots of spurious inserts.

Basically you would want some sort of consensus to be built as well.

ADD REPLY
0
Entering edit mode

Basically I want something like pairwise sequence alignment of all my reads ..1,2,3...N against my refence sequence of interest .

Is that requirement not being satisfied by sam2pairwise?

ADD REPLY
0
Entering edit mode

sam2pairwise will do the alignments pairwise, and represents that single alignment

it will not insert bases into the reference based on another overlapping read, that is what the OP seems to want

ADD REPLY
1
Entering edit mode
19 months ago

I wrote https://lindenb.github.io/jvarkit/TViewCmd.html a java version of samtools tview. As far as I remember, I included insertions and deletions.

https://lindenb.github.io/jvarkit/TViewCmd.html

ADD COMMENT
1
Entering edit mode
19 months ago
cmdcolin ★ 3.8k

samtools tview does this. it is one of few tools that do this that I know

see screenshot

enter image description here

the red asterisks indicate positions where the reads have an insertion relative to the reference

there are not many other tools I know that do this, and they are often text based, here are two other tools that have similar functionality (Hapviz and VizALN from the HipSTR package, but they may be more difficult to install than samtools)

https://cmdcolin.github.io/awesome-genome-visualization/?latest=true&selected=%23Hapviz&tag=Text%20based https://cmdcolin.github.io/awesome-genome-visualization/?latest=true&selected=%23VizAln-from-HipSTR&tag=Text%20based

ADD COMMENT
0
Entering edit mode

Thank you for the suggestions.

I am wondering how to get the all bases rather than dots at the No-mismatch sites. It seems you are getting this with samtools tview. What is that option in samtools tview?

ADD REPLY
1
Entering edit mode

press ?in tview to get help

ADD REPLY
0
Entering edit mode

Thanks Pierre.

I have two questions:

  1. I noticed some asterisk getting inserted in both my reads and reference genome sequence. What are they? In IGV I do not see them as insertion.

  2. How to deal with paired end bam file which has some portion of reads overlapping ? Do I need to split bam file to read1 and read2 ? My main purpose is to get the same alignment as looking like in IGV (with show all bases option) as an output text file. But from IGV it is not possible? to get such alignment pattern for all the reads.

Thank you

ADD REPLY
0
Entering edit mode

Any further suggestions?

ADD REPLY
0
Entering edit mode

the asterisk is the insertion. the reference site does not have the inserted base, and only some reads will. I am not sure how to handle the pairs, perhaps you could collapse them using an external program https://www.google.com/search?q=merge+overlapping+paired+end+reads

ADD REPLY

Login before adding your answer.

Traffic: 2790 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6