Question: Depth of coverage for specific target from bam file
0
gravatar for jfjiang
3.3 years ago by
jfjiang10
China
jfjiang10 wrote:

Hi all,

We used multiple PCR to prepare the amplicons and loaded on MiSeq to get the sequencing data.

We now want to get the average depth of coverage for targeted amplicon.

However, it seems that if the amplicons have overlapped regions, GATK will merge the interval region first, then report the depth. And GATK can not disable this "merging".

Is there anyway to figure out this accurate depth of coverage, since if we dump the depth from BAM by samtools "depth" command, we will have some regions with overlapped depths.

 

Best,

Junfeng

depth coverage • 3.8k views
ADD COMMENTlink modified 3.3 years ago by Jorge Amigo11k • written 3.3 years ago by jfjiang10
7
gravatar for Pierre Lindenbaum
3.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum112k wrote:

EDIT:

OK I wrote this experimental tools: https://github.com/lindenb/jvarkit/wiki/PcrSliceReads , it reads a SAM and a BED and tries to associate a short read to a PCR fragment. I'm missing some real data so it's difficult to test this program. Each time a pair of reads is associated to a PCR fragment, the read-group-id is changed from 'ID' to 'ID_fragmentname'. You can later use GATK/DepthOfCoverage using *--partitionType readgroup * to get the depth for each region.

$ head coverage.read_group_summary
sample_id       total   mean    granular_third_quartile granular_median granular_first_quartile %_bases_above_15
mysample_rg_myid_PCR2   1106    0.35    1       1       1       0.2
mysample_rg_myid_PCR3   674     0.21    1       1       1       0.0
mysample_rg_myid_PCR1   51      0.02    1       1       1       0.0
mysample_rg_myid_PCR6   1279    0.40    1       1       1       0.1
mysample_rg_myid_PCR7   1554    0.49    1       1       1       1.3
mysample_rg_myid_PCR4   1147    0.36    1       1       1       0.8
mysample_rg_myid_PCR5   1738    0.55    1       1       1       1.5
mysample_rg_myid_PCR9   1260    0.40    1       1       1       0.8
mysample_rg_myid_PCR8   1930    0.61    1       1       1       2.1

 

 


I wrote this tool  https://github.com/lindenb/jvarkit/wiki/BamStats04 ; It prints the coverage for each region of a bed.

 

$ java -jar dist/bamstats04.jar \
    BED=data.bed \
    I=f.bam
#chrom  start   end length  mincov  maxcov  mean    nocoveragepb    percentcovered
1   429665  429785  120 42  105 72.36666666666666   0   100
1   430108  430144  36  9   9   9.0 0   100
1   439811  439904  93  0   36  3.6451612903225805  21  77
1   550198  550246  48  1325    1358    1344.4583333333333  0   100
1   629855  629906  51  223 520 420.70588235294116  0   100
1   689960  690029  69  926 1413    1248.9420289855072  0   100
1   690852  690972  120 126 193 171.24166666666667  0   100
1   787283  787406  123 212 489 333.9756097560976   0   100
1   789740  789877  137 245 688 528.6715328467153   0   1
ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Pierre Lindenbaum112k

Thanks.

However, I think it the two regions are overlapped, this overlapped region will have larger depth of coverage than the others, which is not exactly the real depth of the target region.

Say, the two regions are 200 bp each, of which 100 bp region is overlapped, each of the region is sequenced 100X,

then the depth of the overlapped 100 bp region will be 200X while the other is 100X.

And the original thought is to get the sequenced depth.

I still can not find the best solution to this,

1) mapping to each target region only, discard those mapped reads less than 50 bp? then calculate the depth of this reigon.

2) use samtools to dump the bam to depth of each position, then get the median value as the sequence depth.

However, none of these ideas is the best I considered.

ADD REPLYlink written 3.3 years ago by jfjiang10

if you have two 200bp regions sequenced at 100x with a 100bp overlap, then you have a 300bp region sequenced at 100x

ADD REPLYlink written 3.3 years ago by Jorge Amigo11k
0
gravatar for Jorge Amigo
3.3 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

here's a little bash script to get coverage information for each bed file region:

for region in `cat regions.bed | sed 's/\t/:/' | sed 's/\t/-/'`; do
  echo -n "$region "
  samtools depth -r $region reads.bam | awk '{c++;s+=$3;if($3>M)M=$3}END{print s/c, M}'
done

it'll print the average and the maximum coverage, but playing with the last awk section you can add any other metric you may be interested in.

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Jorge Amigo11k

Thanks, this can indeed get the average depth of coverage for target region, however, it is not the real sequencing depth.

I added the comment on Pierre Lindenbaumthread.

ADD REPLYlink written 3.3 years ago by jfjiang10
1

sorry, but this code really gives you the real sequencing depth for each independent region defined on the bed file, which is what you were asking for. the for loop reads each line, transforms it into a chr:start-end format, calls samtools depth to get the coverage of each base of each individual region, and the awk section does the calculation for each individual region. hence, the output of the script is a line per bed file region, followed by its average and maximum coverages. if there are any regions that overlap it won't affect the coverage calculation of each individual region.

I think you're not making yourself clear enough. a bam file is linear: you have reads mapped to a genome that provide a depth of coverage value for each individual base of that genome. in your example, if you have in your bam file two 200bp regions sequenced at 100x with a 100bp overlap, it means that there's a 300bp region sequenced at 100x in your bam file. there's no other calculation you need to do. if by any reason there's any other thing you want to do with bed file regions and overlaps then I would suggest you to look at bedtools, but if you need the coverage for each particular region then Pierre's solution as well as this one is perfectly valid.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Jorge Amigo11k

Thanks, Jorge,

I think I did not make myself clear, what I really want to know is not the depth of coverage, but the original reads depth for one region.

 

ADD REPLYlink written 3.3 years ago by jfjiang10

I thought so. then I would still recommend you to look at bedtools.

ADD REPLYlink written 3.3 years ago by Jorge Amigo11k
0
gravatar for Jorge Amigo
3.3 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

if you have your coverage regions in bedgraph format (chr\tstart\tend\coverage), a way you could work it out is by taking advantage of the unionbedg function from bedtools:

while IFS='' read -r line || [[ -n $line ]]; do
  (( n++ ))
  echo "$line" > temp.$n.bedgraph
done < regions.bedgraph
bedtools unionbedg -i temp.*.bedgraph \
| perl -lane '
foreach $f (@F[3..$#F]){$s+=$f}
print join "\t", @F[0..2], $s;
' \
> regions.joined.txt
rm temp.*.bedgraph

I haven't tested it, but I hope you get the idea.

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Jorge Amigo11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2013 users visited in the last hour