Question: Finding Gaps in Targeted NGS Coverage Incorporating Replicates
gravatar for Robert Sicko
3.2 years ago by
Robert Sicko630
United States
Robert Sicko630 wrote:

I'm trying to identify regions of a targeted NGS panel that lack 20x coverage. I've figured out how to get my desired output (described below) per sequencing run. I'm running into trouble figuring out how to use sample replicates and only output regions that are missing in both replicates.

Example: I have run1, run2 and each run has sample1, sample2, sample3 I want to output regions for sample1 that are under 20x in both run1 and run2

My process for a single sequencing run that is working:

I used the LRG as described in my other question to create a ROI bed file with annotations. I used mosdepth to get per-base output per sample, then bedtools to intersect my ROI.bed to get a file that has columns

Chr Start End Region Size Strand Chr2 Start2 End2 Coverage Overlap

Where Chr Start End are from my ROI.bed and Chr2 Start2 End2 are from the per-base.bed from mosdepth. I create a file that combines all samples and all regions under 20x using awk.

awk -F $'\t' '{if($10 <20) {print FILENAME"\t"$0}}' /coverage/*_gene_final_roi.txt > /run1/all_samples_under_20x.txt

Which has columns

Sample Chr Start End Region Size Strand Chr2 Start2 End2 Coverage Overlap

Finally, I use that file and pandas to generate the tables I need (count of regions, sum of bases and recurrently missed regions) - my script is on my SO answer if it's helpful.

To handle the replicates I tried:




cat $file1 $file2 | awk -v OFS="\t" 'NR==1{print $0,"NEW";next}{print $0,$1"_"$5}' | sort -k13,13 | uniq -D | cut --complement -f 13 > $file_collapsed

python --input /coverage/20x_5bp_collapsed_replicates_r1_r2.txt --output /coverage/Illumina/Illumina_r1_r2_20x_5bp

My thought was create a new field combining sample ($1) and region ($5), sort based on it and use uniq -D to keep only duplicate values (regions by sample present in file1 and file2). However, I'm missing regions that are definitely under 20x in both replicates.

I've searched SO for pandas combine functions, but the per-base output can vary in depth across a single ROI, leading to multiple rows in /run1/all_samples_under_20x.txt where fields $1 and $5 are duplicated, which screws things up when I try to use the $1 $5 combination as a key in combining across replicates.

awk pandas ngs python bedtools • 1.1k views
ADD COMMENTlink modified 3.1 years ago • written 3.2 years ago by Robert Sicko630

I'm not sure of the logic of the approach - apologies. If I am correct in assuming what you want to do, I would:

  1. output per-base read-depth of each replicate separately
  2. find common bases < 20 read-depth
  3. overlap these common bases with ROI.bed
ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by Kevin Blighe69k

thanks, I figured I was over complicating things. Comparing the per-base output to find the common regions under 20x does seem like a better approach. I'll have to guess and check on finding the regions in common in the per-base output that are under 20x. My initial thought is combine the replicate per-base files, sort, then bedtools merge -i combined.bed -c 4 -o max However, since the intervals can differ in size between replicates, I'm thinking I might miss some regions under 20x that are partially overlapped by a region in the replicate that is over 20x. I'll have to test this approach out.

ADD REPLYlink written 3.2 years ago by Robert Sicko630

Okay, well, I think that bedtools merge has quite a few command-line parameters that should help you to get what you need. If not, you could use an awk script. I have a very complex awk script that does something along the same lines of logic (i.e., does one bp co-ordinate fall within a BED region, and also then checks the end BP), but it could take a while to adapt to your situation.

In fact, here is that script: separating bed file into separate files for every 100000 base pairs

ADD REPLYlink written 3.2 years ago by Kevin Blighe69k
gravatar for Robert Sicko
3.1 years ago by
Robert Sicko630
United States
Robert Sicko630 wrote:

Thanks Kevin for pointing me in right direction. I was able to accomplish what I need doing exactly what you say.

For anyone who may run across a similar problem. I used bedtools unionbedg to output a combined per-base file with depth of each replicate in columns 4 and 5. Then I could awk the regions under 20x.

bedtools unionbedg -i rep1.per-base.bed.gz rep2.per-base.bed.gz | \
awk -F $'\t' 'BEGIN {OFS = FS}{if($4 >=$5) {print $1,$2,$3,$4} else if($5>$4) {print $1,$2,$3,$5}}' - > sample.per-base_collapsed_repl.bed

I then combined all samples from a run with

awk -F $'\t' '{if($10 <20) {print FILENAME"\t"$0}}' /coverage/*_gene_final_roi.txt > /run1/all_samples_under_20x.txt

Finally, I used the python script I mentioned in my question to generate tables.

ADD COMMENTlink written 3.1 years ago by Robert Sicko630

Great to see that you found your own solution, and thank for the acknowledgement. Best of luck.


ADD REPLYlink written 3.1 years ago by Kevin Blighe69k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 985 users visited in the last hour