Question

Finding Gaps in Targeted NGS Coverage Incorporating Replicates

0

Entering edit mode

6.4 years ago

Robert Sicko ▴ 630

I'm trying to identify regions of a targeted NGS panel that lack 20x coverage. I've figured out how to get my desired output (described below) per sequencing run. I'm running into trouble figuring out how to use sample replicates and only output regions that are missing in both replicates.

Example: I have run1, run2 and each run has sample1, sample2, sample3 I want to output regions for sample1 that are under 20x in both run1 and run2

My process for a single sequencing run that is working:

I used the LRG as described in my other question to create a ROI bed file with annotations. I used mosdepth to get per-base output per sample, then bedtools to intersect my ROI.bed to get a file that has columns

Chr Start End Region Size Strand Chr2 Start2 End2 Coverage Overlap

Where Chr Start End are from my ROI.bed and Chr2 Start2 End2 are from the per-base.bed from mosdepth. I create a file that combines all samples and all regions under 20x using awk.

awk -F $'\t' '{if($10 <20) {print FILENAME"\t"$0}}' /coverage/*_gene_final_roi.txt > /run1/all_samples_under_20x.txt

Which has columns

Sample Chr Start End Region Size Strand Chr2 Start2 End2 Coverage Overlap

Finally, I use that file and pandas to generate the tables I need (count of regions, sum of bases and recurrently missed regions) - my script is on my SO answer if it's helpful.

To handle the replicates I tried:

file1=/run1/all_samples_under_20x.txt 

file2=/run2/all_samples_under_20x.txt

file_collapsed=/coverage/20x_5bp_collapsed_replicates_r1_r2.txt

cat $file1 $file2 | awk -v OFS="\t" 'NR==1{print $0,"NEW";next}{print $0,$1"_"$5}' | sort -k13,13 | uniq -D | cut --complement -f 13 > $file_collapsed

python mosdepth_count.py --input /coverage/20x_5bp_collapsed_replicates_r1_r2.txt --output /coverage/Illumina/Illumina_r1_r2_20x_5bp

My thought was create a new field combining sample ($1) and region ($5), sort based on it and use uniq -D to keep only duplicate values (regions by sample present in file1 and file2). However, I'm missing regions that are definitely under 20x in both replicates.

I've searched SO for pandas combine functions, but the per-base output can vary in depth across a single ROI, leading to multiple rows in /run1/all_samples_under_20x.txt where fields $1 and $5 are duplicated, which screws things up when I try to use the $1 $5 combination as a key in combining across replicates.

pandas awk ngs python bedtools • 1.8k views

ADD COMMENT • link 6.4 years ago by Robert Sicko ▴ 630

1

Entering edit mode

I'm not sure of the logic of the approach - apologies. If I am correct in assuming what you want to do, I would:

output per-base read-depth of each replicate separately
find common bases < 20 read-depth
overlap these common bases with ROI.bed

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

0

Entering edit mode

thanks, I figured I was over complicating things. Comparing the per-base output to find the common regions under 20x does seem like a better approach. I'll have to guess and check on finding the regions in common in the per-base output that are under 20x. My initial thought is combine the replicate per-base files, sort, then bedtools merge -i combined.bed -c 4 -o max However, since the intervals can differ in size between replicates, I'm thinking I might miss some regions under 20x that are partially overlapped by a region in the replicate that is over 20x. I'll have to test this approach out.

ADD REPLY • link 6.4 years ago by Robert Sicko ▴ 630

0

Entering edit mode

Okay, well, I think that bedtools merge has quite a few command-line parameters that should help you to get what you need. If not, you could use an awk script. I have a very complex awk script that does something along the same lines of logic (i.e., does one bp co-ordinate fall within a BED region, and also then checks the end BP), but it could take a while to adapt to your situation.

In fact, here is that script: separating bed file into separate files for every 100000 base pairs

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

score 1 · Accepted Answer · 2017-11-28

Thanks Kevin for pointing me in right direction. I was able to accomplish what I need doing exactly what you say.

For anyone who may run across a similar problem. I used bedtools unionbedg to output a combined per-base file with depth of each replicate in columns 4 and 5. Then I could awk the regions under 20x.

bedtools unionbedg -i rep1.per-base.bed.gz rep2.per-base.bed.gz | \
awk -F $'\t' 'BEGIN {OFS = FS}{if($4 >=$5) {print $1,$2,$3,$4} else if($5>$4) {print $1,$2,$3,$5}}' - > sample.per-base_collapsed_repl.bed

I then combined all samples from a run with

awk -F $'\t' '{if($10 <20) {print FILENAME"\t"$0}}' /coverage/*_gene_final_roi.txt > /run1/all_samples_under_20x.txt

Finally, I used the python script I mentioned in my question to generate tables.