How To Get All Contig Boundaries In A Bam File.
4
1
Entering edit mode
10.2 years ago
dustar1986 ▴ 350

Hi everyone,

Is there a smart way to get all contig boundaries after assembly? I know samtools can get number of sequences covering a defined region. Then I could write a pipeline to find each contig's boundary by looking for a region flank by coverage of 0. It seems this costs a lot of time. Is there an alternative and quicker way to know the start and end position of each contig?

Thanks a lot.

Dadi

next-gen sequencing contigs • 5.0k views
ADD COMMENT
0
Entering edit mode

I'm not sure I understand your question - are you aligning reads to already assembled contigs or assembling contigs from reads?

ADD REPLY
0
Entering edit mode

I am talking about assembling contigs from reads. Sorry to confuse you.

ADD REPLY
4
Entering edit mode
10.2 years ago

You could use samtools mpileup and awk to get the contigs:

samtools  mpileup file.bam |\
cut -d '        ' -f 1,2 |\
awk -F '        ' 'BEGIN {chr="";start=-1;end=-1} {if(chr!=$1 || int($2)!=end+1) { if(chr!="") {printf("%s:%d-%d\n",chr,start,end);} chr=$1;start=int($2);end=int($2);} else { end=end+1;}} END {if(chr!="") {printf("%s:%d-%d\n",chr,start,end); } }'
chr1:10044-10300
chr1:10330-10468
chr1:10504-10557
chr1:10585-10633
chr1:11302-11355
chr1:11594-11647
chr1:11696-11823
chr1:11827-13107
chr1:13122-13840
chr1:13874-13973
(...)
ADD COMMENT
0
Entering edit mode

Thank you very much. Extremely helpful. That's exact what I wanna get.

ADD REPLY
0
Entering edit mode

update 10 years later: just pipe into `bedtools merge|

ADD REPLY
1
Entering edit mode
8.1 years ago

You could use a sam2bed conversion step with a bedops --merge operation:

$ sam2bed < reads.sam | bedops --merge - > mergedReadCoordinates.bed

Note that this conversion from SAM to BED will change 1-indexed coordinates to 0-indexed coordinates.

ADD COMMENT
0
Entering edit mode
10.2 years ago
Pablo ★ 1.9k

I don't know if I got your question right, but if you have a BAM file, doing this:

samtools view -H file.bam| grep "^@SQ"

you will get the length of each contig in the "LN" sub-field (the start is always 1).

Apologies in advanced if I misunderstood you...

ADD COMMENT
0
Entering edit mode

I'm sorry maybe I didn't make it clear. If there are only 3 reads within the sequencing data. Sequence 1: chr1:100-150; Sequence 2: chr1:110-155; Sequence 3: chr1:130-175. Then it should be assembled into a contig:chr1:100-175. I am asking if I can get all the contigs position after reads assembly.

ADD REPLY
0
Entering edit mode
9.4 years ago

if you do in samtools idxstats with your file.bam

samtools idxstats file.bam > file.stats.txt

you will get output three columns with Contigname, contiglenth, no of reads

ADD COMMENT
0
Entering edit mode

how to obtain contig sequence?

ADD REPLY

Login before adding your answer.

Traffic: 2453 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6