How Can I Know The Length Of Mapped Reads From Bam File?
6
6
Entering edit mode
9.5 years ago
Jordan ★ 1.2k

Hi,

I have a bam files produced from mapping by LifeScope Tool (mapping of SOLiD reads). I would like to know the length of the reads that were mapped to the reference genome based on the information present in bam files. Is there a tool which can give me the stats of such things from bam files?

Thanks!

bam mapping • 44k views
0
Entering edit mode

Do you want the length after the read is aligned, including insertions and deletions?

0
Entering edit mode

I don't need to include insertions and deletions.

0
Entering edit mode

Hi Zev, what would the syntax be if I include indels and gaps in the reads?

20
Entering edit mode
9.5 years ago

This will take the first million reads and give you counts of each read length. To look at more of the bam, get rid of the "head" command:

samtools view file.bam | head -n 1000000 | cut -f 10 | perl -ne 'chomp;print length($_) . "\n"' | sort | uniq -c  ADD COMMENT 0 Entering edit mode Ah! That was very simple to follow and clever use of unix commands.. Thanks for that! ADD REPLY 0 Entering edit mode and do not use only sort but sort -n ADD REPLY 0 Entering edit mode This command outputs the length distribution of both mapped and unmapped reads right? I thought the question was to ask the length distribution output of only mapped reads. ADD REPLY 1 Entering edit mode Sure - make it samtools view -F 4 file.bam ... if you want only mapped ADD REPLY 0 Entering edit mode thanks a lot! so if I only want unmapped reads length distribution, I just need to make the "F" case lower "f" right? for example samtools view -f 4 file.bam. I tried "samtools view file.bam", "samtools view -F 4 file.bam" and "samtools view -f file.bam", I found that the reads count in a certain length in -f output and -F output did not add up to the same length counts in "samtools view file.bam", I am confused, I thought the mapped and unmapped read counts should added up to the total counts of a certain length in "samtools view file.bam" right? ADD REPLY 0 Entering edit mode Hi, thank you for your answer. Do you know how to modify your code if I want to specify the reads in a certain region, for example, want to know the reads length distribution that are only located in chr2? ADD REPLY 16 Entering edit mode 9.5 years ago Rm 8.2k A little shorter version: samtools view test.bam | awk '{print length($10)}' | head -1000 | sort -u

1
Entering edit mode

What does 10 represent here? ADD REPLY 1 Entering edit mode The awk command is printing the string length of the tenth field (segment sequence) out of the SAM-formatted output from samtools view. ADD REPLY 0 Entering edit mode That was simple as well. Thanks! ADD REPLY 0 Entering edit mode Hi, how can I count the number of reads with certain length based on the above command? Thanks in advance! ADD REPLY 3 Entering edit mode 9.5 years ago Rm 8.2k if you need to know alignment length of the read on the reference after it aligned: Using bioperl modules ##mapped_read_Length.pl #!/usr/bin/perl -w use strict; use warnings; use Bio::DB::Sam; myusage = "Usage $0 <input.bam>\n"; my$infile = shift or die $usage; my$sam = Bio::DB::Sam->new(-bam  => $infile); my @chromosomes =$sam->seq_ids;

foreach my chr (@chromosomes){ my @alignments =sam->get_features_by_location(-seq_id => $chr); for my$a (@alignments) {
my $id =$a->name;
my $cigar =$a->cigar_str;
my $start =$a->start;
my $end =$a->end;
my $mapped_length =$end-$start; print "$id\t$chr\t$start\t$end\t$mapped_length\t$cigar\n"; } } __END__  OUTPUT: Read_ID Chr Start End Mapped_Length CIGAR_String BMW-S105R:17:H02AWADXX:2:1109:4198:22439 chr1 564455 564503 48 6M1I43M BMW-S105R:17:H02AWADXX:2:1109:4309:28023 chr1 564682 564732 50 5M1D45M BMW-S105R:17:H02AWADXX:2:1109:15415:28127 chr1 564689 564737 48 37M1I12M  ADD COMMENT 1 Entering edit mode 9.5 years ago There are many ways to skin a cat. Below is a c++ implementation of BamTools that will produce what your want. You can get much more information out the the al BamTools::Alignment object. More about BamTools https://github.com/pezmaster31/bamtools #include "api/api_global.h" #include "api/BamReader.h" using namespace std; using namespace BamTools; int main(int argc, const char * argv[]) { std::string filename = argv[1]; BamTools::BamReader reader BamTools::BamAlignment al; while(reader.GetNextAlignment(al)){ std::cout << al.Length << "\n"; } }  ADD COMMENT 0 Entering edit mode I don't have experience with C++. Can it be implemented in a different language? I have used samtools.1 view to get the basic info. I think the output is in sam format. Probably I can write a code to parse the column 6 - CIGAR and output the length. But I was wondering if there is a tool already to do it. I didn't want to mistakes while computing length from CIGAR. Looked into bamtools, but like I said I'm not good at C++. ADD REPLY 0 Entering edit mode I know that this comment is old, but since it is on the first page of google... You forgot to open a file in your example. ADD REPLY 1 Entering edit mode 9.5 years ago BEDOPS includes a bam2bed conversion script, which filters unmapped reads in the course of converting to BED. You can therefore get mapped read lengths by piping the sequence field of the converted data to awk, e.g.: $ bam2bed < foo.bam \
| cut -f 12 \
| awk '{print length($0)}' \ > mapped_read_lengths.txt  If you use samtools directly with alternatives suggested in other answers, you might first want to prefilter your data to mapped reads with GNU awk, before piping to downstream utilities: $ samtools view foo.bam | gawk '(! and(4, $2))' > mapped_foo.sam$ samtools view -S mapped_foo.sam | ...


Or, more simply:

$samtools view foo.bam | gawk '(! and(4,$2))' | ...


Where ... denotes the use of downstream utilities.

0
Entering edit mode

The bam2bed will fail if you have spliced alignments. But he didn't specify if DNA or RNA alignment in the question so the answer is still valid.

1
Entering edit mode
5.3 years ago
doyourun ▴ 180

If you are given bam files and want to know the READ length that was used in sequencing, this will work. replace bamfile by your bam file (and make sure it was indexed .. e.g. a bamfile.bam.bai exists)

Typically, I need to know this to run a pipeline.

# All on one line.

samtools view bamfile.bam | head -n 1000 | gawk '{print length($10)}' | sort | uniq -c | perl -ane '$_ =~ s/^[ ]+//g;print \$_' | sort -k 1nr,1nr | head -1 | cut -f2 -d " "