Question: Obtaining depth for all positions for different exonic ranges from mpileup file
0
gravatar for kspata
4 months ago by
kspata40
Chicago
kspata40 wrote:

Hi,

I am analyzing 96 samples sequenced using NextSeq PE 150 reads. The goal is to study variants and per exon coverage of 10 genes. I have obtained the bed file for all the exonic regions of the 10 genes from UCSC genome browser. I am using trim_galore -> bowtie2 -> samtools pipeline for calling variants. I am using hg38 human genome assembly for alignment.

I calculated per exon coverage using bedtools intersect command as follows:

bedtools coverage -abam Sample-1.sorted.bam -b sorted.intervals.bed -counts > PerExonCoverage.txt

This gives me an output file in following format:

Chromosome    Start         End         Coverage   
      chr1    161677476     161677553   6561
     chr20    1895448       1895526     2377
     chr20    1915099       1915455    12081

I want to know the coverage of each nucleotide position for all exons for e.g as follows:

chr1 161677476   13
chr1 161677477  120
chr1 161677478  130

and so on..

I have an intermediate mpileup file which I have used to extract coverage of all positions like this:

cut -f1-4 Sample-1.mpileup > Sample-1.depth

This gives me coverage for all positions of the reference genome and not for the positions of selected exons.

The above command also results in a file size of almost 60GB per sample which is very huge for the resources available to me.

Is there a way in which I can get the coverage of all positions but only for exons listed in BED using either mpileup or BAM file.

Help will be appreciated. Thanks in advance!!

ADD COMMENTlink modified 4 months ago by Pierre Lindenbaum119k • written 4 months ago by kspata40
1
gravatar for Pierre Lindenbaum
4 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:
samtools depth -b sorted.intervals.bed Sample-1.sorted.bam
ADD COMMENTlink written 4 months ago by Pierre Lindenbaum119k

Thanks Pierre for quick reply. I performed the above command and got the output but in the output the first position for every exon is excluded. For example for the exon range:

chr3    46372903    46373961

The coverage starts from the 46372904 position instead of 46372903. The length of the exon is 1059 bp but coverage is reported for 1058 bp. Should I subtract 1 from every start position and calculate again? Will this be a correct approach?

Also, I changed the range from chr3 46372903 46373961 to chr3 46372895 46373961 and it gives me coverage for the region outside exon as well. What is the reason for obtaining coverage outside the exonic regions? Shouldn't the coverage of regions outside exon be 0 or very low? The exonic region was captured using IDT predesinged gene capture pool.

ADD REPLYlink modified 4 months ago • written 4 months ago by kspata40

it's a bed file. It's ZERO based and uses half-open intervals. The output of samtools is ONE-based.

ADD REPLYlink modified 4 months ago • written 4 months ago by Pierre Lindenbaum119k

see Cheat Sheet For One-Based Vs Zero-Based Coordinate Systems

ADD REPLYlink written 4 months ago by Pierre Lindenbaum119k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 686 users visited in the last hour