Question: How Can I Know The Length Of Mapped Reads From Bam File?
3
gravatar for Jordan
6.7 years ago by
Jordan1.1k
Pittsburgh
Jordan1.1k wrote:

Hi,

I have a bam files produced from mapping by LifeScope Tool (mapping of SOLiD reads). I would like to know the length of the reads that were mapped to the reference genome based on the information present in bam files. Is there a tool which can give me the stats of such things from bam files?

Thanks!

bam mapping • 26k views
ADD COMMENTlink modified 2.6 years ago by doyourun120 • written 6.7 years ago by Jordan1.1k

Do you want the length after the read is aligned, including insertions and deletions?

ADD REPLYlink written 6.7 years ago by Zev.Kronenberg11k

I don't need to include insertions and deletions.

ADD REPLYlink written 6.7 years ago by Jordan1.1k

Hi Zev, what would the syntax be if I include indels and gaps in the reads?

Many thanks,

Reem Alhamidi

ADD REPLYlink written 7 months ago by alhamidi.reem10
14
gravatar for Chris Miller
6.7 years ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:

This will take the first million reads and give you counts of each read length. To look at more of the bam, get rid of the "head" command:

samtools view file.bam | head -n 1000000 | cut -f 10 | perl -ne 'chomp;print length($_) . "\n"' | sort | uniq -c
ADD COMMENTlink written 6.7 years ago by Chris Miller21k
14

little shorter version: samtools view test.bam | awk '{print length($10)}' | head -1000 | sort -u

ADD REPLYlink modified 6.7 years ago • written 6.7 years ago by Rm7.9k

That was simple as well. Thanks!

ADD REPLYlink written 6.7 years ago by Jordan1.1k

Hi, how can I count the number of reads with certain length based on the above command? Thanks in advance!

ADD REPLYlink written 3.1 years ago by SpreeFU0

Ah! That was very simple to follow and clever use of unix commands.. Thanks for that!

ADD REPLYlink written 6.7 years ago by Jordan1.1k

and do not use only sort but sort -n

ADD REPLYlink written 3.2 years ago by Korsocius130

This command outputs the length distribution of both mapped and unmapped reads right? I thought the question was to ask the length distribution output of only mapped reads.

ADD REPLYlink written 4 months ago by xiaoleiusc0
1

Sure - make it samtools view -F 4 file.bam ... if you want only mapped

ADD REPLYlink written 4 months ago by Chris Miller21k

thanks a lot! so if I only want unmapped reads length distribution, I just need to make the "F" case lower "f" right? for example samtools view -f 4 file.bam. I tried "samtools view file.bam", "samtools view -F 4 file.bam" and "samtools view -f file.bam", I found that the reads count in a certain length in -f output and -F output did not add up to the same length counts in "samtools view file.bam", I am confused, I thought the mapped and unmapped read counts should added up to the total counts of a certain length in "samtools view file.bam" right?

ADD REPLYlink written 4 months ago by xiaoleiusc0
3
gravatar for Rm
6.7 years ago by
Rm7.9k
Danville, PA
Rm7.9k wrote:

if you need to know alignment length of the read on the reference after it aligned: Using bioperl modules

##mapped_read_Length.pl 
#!/usr/bin/perl -w

use strict;
use warnings;  
use Bio::DB::Sam;

my $usage = "Usage $0 <input.bam>\n";
my $infile = shift or die $usage;
my $sam = Bio::DB::Sam->new(-bam  => $infile);
my @chromosomes = $sam->seq_ids;   
print "Read_ID\tChr\tStart\tEnd\tMapped_Length\tCIGAR_String\n";

foreach my $chr (@chromosomes){
   my @alignments = $sam->get_features_by_location(-seq_id => $chr);
   for my $a (@alignments) {
      my $id  = $a->name;
      my $cigar  = $a->cigar_str;
      my $start  = $a->start;
      my $end    = $a->end;
      my $mapped_length = $end-$start;
     print "$id\t$chr\t$start\t$end\t$mapped_length\t$cigar\n";    
}
}
__END__

OUTPUT:

Read_ID    Chr    Start    End    Mapped_Length    CIGAR_String
BMW-S105R:17:H02AWADXX:2:1109:4198:22439    chr1    564455    564503    48    6M1I43M
BMW-S105R:17:H02AWADXX:2:1109:4309:28023    chr1    564682    564732    50    5M1D45M
BMW-S105R:17:H02AWADXX:2:1109:15415:28127    chr1    564689    564737    48    37M1I12M
ADD COMMENTlink modified 6.7 years ago • written 6.7 years ago by Rm7.9k
1
gravatar for Zev.Kronenberg
6.7 years ago by
United States
Zev.Kronenberg11k wrote:

There are many ways to skin a cat. Below is a c++ implementation of BamTools that will produce what your want. You can get much more information out the the al BamTools::Alignment object.

More about BamTools https://github.com/pezmaster31/bamtools

#include  "api/api_global.h"
#include  "api/BamReader.h"

using namespace std;
using namespace BamTools;


int main(int argc, const char * argv[])
{
   std::string filename = argv[1];
   BamTools::BamReader reader
   BamTools::BamAlignment al;
      while(reader.GetNextAlignment(al)){
            std::cout << al.Length << "\n";

        }
}
ADD COMMENTlink modified 6.7 years ago • written 6.7 years ago by Zev.Kronenberg11k

I don't have experience with C++. Can it be implemented in a different language? I have used samtools.1 view to get the basic info. I think the output is in sam format. Probably I can write a code to parse the column 6 - CIGAR and output the length. But I was wondering if there is a tool already to do it. I didn't want to mistakes while computing length from CIGAR.

Looked into bamtools, but like I said I'm not good at C++.

ADD REPLYlink modified 6.7 years ago • written 6.7 years ago by Jordan1.1k

I know that this comment is old, but since it is on the first page of google...

You forgot to open a file in your example.

ADD REPLYlink written 2.0 years ago by Tg310
1
gravatar for Alex Reynolds
6.7 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

BEDOPS includes a bam2bed conversion script, which filters unmapped reads in the course of converting to BED. You can therefore get mapped read lengths by piping the sequence field of the converted data to awk, e.g.:

$ bam2bed < foo.bam \
    | cut -f 12 \
    | awk '{print length($0)}' \
    > mapped_read_lengths.txt

If you use samtools directly with alternatives suggested in other answers, you might first want to prefilter your data to mapped reads with GNU awk, before piping to downstream utilities:

$ samtools view foo.bam | gawk '(! and(4, $2))' > mapped_foo.sam
$ samtools view -S mapped_foo.sam | ...

Or, more simply:

$ samtools view foo.bam | gawk '(! and(4, $2))' | ...

Where ... denotes the use of downstream utilities.

ADD COMMENTlink modified 6.7 years ago • written 6.7 years ago by Alex Reynolds29k

The bam2bed will fail if you have spliced alignments. But he didn't specify if DNA or RNA alignment in the question so the answer is still valid.

ADD REPLYlink written 10 weeks ago by opplatek30
0
gravatar for doyourun
2.6 years ago by
doyourun120
United States
doyourun120 wrote:

If you are given bam files and want to know the READ length that was used in sequencing, this will work. replace bamfile by your bam file (and make sure it was indexed .. e.g. a bamfile.bam.bai exists)

Typically, I need to know this to run a pipeline.

All on one line.

samtools view bamfile.bam | head -n 1000 | gawk '{print length($10)}' | sort | uniq -c | perl -ane '$_ =~ s/^[ ]+//g;print $_' | sort -k 1nr,1nr | head -1 | cut -f2 -d " "

ADD COMMENTlink written 2.6 years ago by doyourun120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2153 users visited in the last hour