Question

Depth of coverage for specific target from bam file

0

Entering edit mode

8.8 years ago

jfjiang ▴ 10

Hi all,

We used multiple PCR to prepare the amplicons and loaded on MiSeq to get the sequencing data.

We now want to get the average depth of coverage for targeted amplicon.

However, it seems that if the amplicons have overlapped regions, GATK will merge the interval region first, then report the depth. And GATK can not disable this "merging".

Is there anyway to figure out this accurate depth of coverage, since if we dump the depth from BAM by samtools "depth" command, we will have some regions with overlapped depths.

Best,
Junfeng

coverage depth • 9.6k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.8 years ago by jfjiang ▴ 10

Ram · Answer 1 · 2015-07-06

8

Entering edit mode

8.8 years ago

Pierre Lindenbaum 161k

EDIT:

OK I wrote this experimental tools: https://github.com/lindenb/jvarkit/wiki/PcrSliceReads , it reads a SAM and a BED and tries to associate a short read to a PCR fragment. I'm missing some real data so it's difficult to test this program. Each time a pair of reads is associated to a PCR fragment, the read-group-id is changed from 'ID' to 'ID_fragmentname'. You can later use GATK/DepthOfCoverage using --partitionType readgroup to get the depth for each region.

$ head coverage.read_group_summary
sample_id       total   mean    granular_third_quartile granular_median granular_first_quartile %_bases_above_15
mysample_rg_myid_PCR2   1106    0.35    1       1       1       0.2
mysample_rg_myid_PCR3   674     0.21    1       1       1       0.0
mysample_rg_myid_PCR1   51      0.02    1       1       1       0.0
mysample_rg_myid_PCR6   1279    0.40    1       1       1       0.1
mysample_rg_myid_PCR7   1554    0.49    1       1       1       1.3
mysample_rg_myid_PCR4   1147    0.36    1       1       1       0.8
mysample_rg_myid_PCR5   1738    0.55    1       1       1       1.5
mysample_rg_myid_PCR9   1260    0.40    1       1       1       0.8
mysample_rg_myid_PCR8   1930    0.61    1       1       1       2.1

I wrote this tool https://github.com/lindenb/jvarkit/wiki/BamStats04 ; It prints the coverage for each region of a bed.

$ java -jar dist/bamstats04.jar \
    BED=data.bed \
    I=f.bam
#chrom  start   end length  mincov  maxcov  mean    nocoveragepb    percentcovered
1   429665  429785  120 42  105 72.36666666666666   0   100
1   430108  430144  36  9   9   9.0 0   100
1   439811  439904  93  0   36  3.6451612903225805  21  77
1   550198  550246  48  1325    1358    1344.4583333333333  0   100
1   629855  629906  51  223 520 420.70588235294116  0   100
1   689960  690029  69  926 1413    1248.9420289855072  0   100
1   690852  690972  120 126 193 171.24166666666667  0   100
1   787283  787406  123 212 489 333.9756097560976   0   100
1   789740  789877  137 245 688 528.6715328467153   0   1

ADD COMMENT • link updated 4.5 years ago by Ram 43k • written 8.8 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

Brilliant answer thank you worked perfectly. Couple of points for anyone wanting to use the tool though, I was using it with a bed file generated from hg19 alignments using STAR, Chr 23 = X, Chr 24 = Y. Also make sure your bed coordinates are sorted so that start < end. I had a few alignments which were on the anti-sense strand so the coordinates were the wrong way around.

ADD REPLY • link 3.8 years ago by mahmoudbassal84 ▴ 10

0

Entering edit mode

Thanks.

However, I think it the two regions are overlapped, this overlapped region will have larger depth of coverage than the others, which is not exactly the real depth of the target region.

Say, the two regions are 200 bp each, of which 100 bp region is overlapped, each of the region is sequenced 100X,

then the depth of the overlapped 100 bp region will be 200X while the other is 100X.

And the original thought is to get the sequenced depth.

I still can not find the best solution to this,

1) mapping to each target region only, discard those mapped reads less than 50 bp? then calculate the depth of this reigon.

2) use samtools to dump the bam to depth of each position, then get the median value as the sequence depth.

However, none of these ideas is the best I considered.

ADD REPLY • link 8.8 years ago by jfjiang ▴ 10

0

Entering edit mode

if you have two 200bp regions sequenced at 100x with a 100bp overlap, then you have a 300bp region sequenced at 100x

ADD REPLY • link 8.8 years ago by Jorge Amigo 14k

0

Entering edit mode

update 2021: there is now samtools ampliconclip samtools-ampliconclip http://www.htslib.org/doc/samtools-ampliconclip.html

Clip reads in a SAM compatible file based on data from a BED file.

ADD REPLY • link 3.1 years ago by Pierre Lindenbaum 161k

Ram · Answer 2 · 2015-07-06

0

Entering edit mode

8.8 years ago

Jorge Amigo 14k

here's a little bash script to get coverage information for each bed file region:

for region in `cat regions.bed | sed 's/\t/:/' | sed 's/\t/-/'`; do
  echo -n "$region "
  samtools depth -r $region reads.bam | awk '{c++;s+=$3;if($3>M)M=$3}END{print s/c, M}'
done

it'll print the average and the maximum coverage, but playing with the last awk section you can add any other metric you may be interested in.

ADD COMMENT • link 8.8 years ago by Jorge Amigo 14k

0

Entering edit mode

Thanks, this can indeed get the average depth of coverage for target region, however, it is not the real sequencing depth.

I added the comment on Pierre Lindenbaumthread.

ADD REPLY • link 8.8 years ago by jfjiang ▴ 10

1

Entering edit mode

sorry, but this code really gives you the real sequencing depth for each independent region defined on the bed file, which is what you were asking for. the for loop reads each line, transforms it into a chr:start-end format, calls samtools depth to get the coverage of each base of each individual region, and the awk section does the calculation for each individual region. hence, the output of the script is a line per bed file region, followed by its average and maximum coverages. if there are any regions that overlap it won't affect the coverage calculation of each individual region.

I think you're not making yourself clear enough. a bam file is linear: you have reads mapped to a genome that provide a depth of coverage value for each individual base of that genome. in your example, if you have in your bam file two 200bp regions sequenced at 100x with a 100bp overlap, it means that there's a 300bp region sequenced at 100x in your bam file. there's no other calculation you need to do. if by any reason there's any other thing you want to do with bed file regions and overlaps then I would suggest you to look at bedtools, but if you need the coverage for each particular region then Pierre's solution as well as this one is perfectly valid.

ADD REPLY • link 8.8 years ago by Jorge Amigo 14k

0

Entering edit mode

Thanks, Jorge,

I think I did not make myself clear, what I really want to know is not the depth of coverage, but the original reads depth for one region.

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 8.8 years ago by jfjiang ▴ 10

0

Entering edit mode

I thought so. then I would still recommend you to look at bedtools.

ADD REPLY • link 8.8 years ago by Jorge Amigo 14k

Ram · Answer 3 · 2015-07-07

if you have your coverage regions in bedgraph format (chr\tstart\tend\coverage), a way you could work it out is by taking advantage of the unionbedg function from bedtools:

while IFS='' read -r line || [[ -n $line ]]; do
  (( n++ ))
  echo "$line" > temp.$n.bedgraph
done <> regions.bedgraph
bedtools unionbedg -i temp.*.bedgraph \
| perl -lane '
foreach $f (@F[3..$#F]){$s+=$f}
print join "\t", @F[0..2], $s;
' \
> regions.joined.txt
rm temp.*.bedgraph

I haven't tested it, but I hope you get the idea.