Question: Is It Possible To Get Fragment Length, Read Length And Number Of Fragments From A Bam/Sam File
6
gravatar for samsara
4.7 years ago by
samsara530
The Earth
samsara530 wrote:

Can i get following information from a BAM/SAM file ?

  • fragment length
  • read length
  • number of fragments

Or, do i need to ask the people who had sequenced the samples ?

next-gen bam • 19k views
ADD COMMENTlink modified 11 months ago by jerry10 • written 4.7 years ago by samsara530
8
gravatar for Fidel
4.4 years ago by
Fidel1.8k
Germany
Fidel1.8k wrote:

Sure you can. The BAM/SAM format contains all that information.

To get the fragment length you need a paired end sequencing. Otherwise you will need to use some tools to process the bam file and estimate the fragment length (the peak caller MACS does that). The fragment length is given in the 9th column (see the Sam Format Specification).

The read length can be obtained either by looking at the CIGAR field (column 6) or by counting the length of the sequence (column 10). The CIGAR field encodes the differences in the read sequence with respect to the reference genome, but usually most reads map perfectly to the reference genome and the CIGAR field can quickly inform you about the read length. A CIGAR will look like this for a read of 100 bp: 100M. In other words if you just see a number followed by M and nothing else, that number is the read length.

The number of fragments is the number of lines in the BAM/SAM file for single end or half the number of lines for paired-end. Here one must be careful, thus is probably is easier to count the lines of the FASTQ file instead. The BAM/SAM format may have been filtered to remove all fragments that did not map, thus a count based on the BAM/SAM file will underestimate the total.

ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by Fidel1.8k
4
gravatar for danrdanny
3.8 years ago by
danrdanny60
danrdanny60 wrote:

To clarify, the fragment length in the SAM file can be a little misleading because of the way some genome browsers interpret it.

Field 9 (using 1-based, not 0-based counting) is the fragment length, not the insert size:

SEB9BZKS1:207:H8D40ADXX:2:2111:11030:73639    163    chr10    116212135    60    41M    =    116212198    164    TTAGAAAGGTTAAAACAATTAATGTATTTTTTTCAACAAAT    <>))=1>>7@<?9><)=38>???@>>?==<=86<?<<>7<?    NM:i:2    AS:i:31    XS:i:21
SEB9BZKS1:207:H8D40ADXX:2:2111:11030:73639    83    chr10    116212198    60    101M    =    116212135    -164    TAAATGCCAATTACACTGACACCAGGAAACACACATCTAGGGCCAGGCACGGTGGCTCATGCCTGTAATCCCAGCACTTTGAGAGGCTGAGGCAGGCGGAT    >5>>A:CCCB@>5:;;--,55@;6@A;7667?ACHFFHHGD=9;=C@8.@EFFCIGD@?HD;HHFD?>@F<?DB<2<+CGIIIIHGHFCDADDFDDDD?=1    NM:i:0    AS:i:101    XS:i:57

So, in the above example, the fragment size is 164 bp's. The insert size is much smaller than that. Some browsers will tell you the insert size is 164. It is not.

The insert size in this case is 164 - (101 + 41). That's the fragment length minus the sum of the two reads, or 22 bp's. (The 101 and 41 come from field 6.)

This gets a little odd when pairs overlap:

SEB9BZKS1:207:H8D40ADXX:1:2110:9214:15411    99    chr10    116211652    60    101M    =    116211721    170    AGAAAGAAGAAAAGTAGGGGAGGGGAGAGGGGAGAAAGAGAGGAGAAAAAATATTAATAATAATGTTGAAAAGGACAGTATGATGATGACATATGCTGACT    =?@DFDF?FCDFBGAEFHGICEFHIGIIIIIG;AFHIIIIEHGCEBDFEC>B@CBDCDECDECDC@>CCCCCCB<8?BC>@DC@CDDCDCCCCDC>@ACCC    NM:i:0    AS:i:101    XS:i:0
SEB9BZKS1:207:H8D40ADXX:1:2110:9214:15411    147    chr10    116211721    60    101M    =    116211652    -170    AAAGGACAGTATGATGATGACATATGCTGACTTTGCTAAGCACTCTATGCATATTTACTTTAACTCAGGAGGCAGTGCTTAAGAGCTCAAGCTCTGGAATG    CCEEECEBC;>>DHEHEC=7DIIJIIHD@4>IGHCHCGIJHFIIIIIJJIHFBGGHBIJIGC9HCIIGHHCJIJIIHEFFJIGIGGIIHHHHD4FFFDCCC    NM:i:1    AS:i:96    XS:i:20

Above are two reads 101 bp's in length with a fragment size of 170bp. So, no insert. But, browsers like IGV will gladly tell you that the insert size in this case is 170bp's.

Hope this helps–I was quite confused at first as well.

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by danrdanny60
1

Paired-end reads are a neat molecular biology trick. Remember that "insert" refers to the DNA fragment between the adaptors, and not the gap between R1 and R2. Instead we refer to that as the "inner mate distance".

source : Torsten Seemann

ADD REPLYlink modified 11 months ago • written 11 months ago by Carlo Yague3.3k
3
gravatar for Sangwoo Kim
4.7 years ago by
Sangwoo Kim340
UC San Diego
Sangwoo Kim340 wrote:

I am not sure if FastQC gives fragment length (not read length) statistics. I know picard has CollectInsertSizeMetrics program. http://picard.sourceforge.net/command-line-overview.shtml#CollectInsertSizeMetrics

ADD COMMENTlink written 4.7 years ago by Sangwoo Kim340

You are, of course, correct.

ADD REPLYlink written 4.7 years ago by Sean Davis23k
1
gravatar for Sean Davis
4.7 years ago by
Sean Davis23k
National Institutes of Health, Bethesda, MD
Sean Davis23k wrote:

Take a look at using FastQC on your BAM file.

ADD COMMENTlink written 4.7 years ago by Sean Davis23k

I did not get any information regarding fragment length using FastQC

ADD REPLYlink written 4.7 years ago by samsara530
1
gravatar for venkateshr89
4.4 years ago by
venkateshr89680
United States
venkateshr89680 wrote:

Hi,

You can get the read length from FASTQC which would give your statistics about all your reads/bases/base composition etc., from raw fastq files/BAM/SAM file. You can get the fragment size from qualimap tool or GATK's Depth of Coverage tool. I personally prefer Qualimap because of the ease of use and nice histograms that it creates.

Good luck

ADD COMMENTlink written 4.4 years ago by venkateshr89680
0
gravatar for jerry
11 months ago by
jerry10
United States
jerry10 wrote:

Quick note on fragment length as applied to RNA-Seq, since I've been working on it and haven't found an answer yet.

For paired-end reads from RNA-Seq, column 9 of the SAM file gives you the genomic distance from the beginning of the R1 read to the end of the R2 read. Note that this is NOT necessarily the fragment length, because you could have splicing between the paired-end reads. In other words, the paired-end (R1 and R2) reads could be very far apart from each other because there is a huge intron in between. Take a look at a typical RNA-Seq SAM file, and you will see that the values in column 9 can be quite large (several hundred kb). Again, this is most likely due to the presence of large introns in between.

Obtaining the original fragment length requires knowledge of splice events that might occur in between your paired-end reads, which is not a trivial problem. I have yet to find any tools or scripts to correctly obtain the fragment length that take splicing into account. If anyone knows of any, please share here. Thanks!

ADD COMMENTlink written 11 months ago by jerry10

It can't be done. Even without splicing events it can't be done from the 9th column (TLEN) alone, since TLEN is the first MAPPED position to the last MAPPED position. You also have to read the CIGAR string, check for indels, clipped bases, etc. And even then who knows what trimming was done before hand. And who knows what differences from the reference occurred in the unsequenced portion of the read.

The BAM doesn't care about fragments. It's all about alignments. If you want fragment lengths either sequence the whole fragment, or for the global distribution use a bioanalyzer.

ADD REPLYlink written 11 months ago by John12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 955 users visited in the last hour