Entering edit mode
5.2 years ago
loveleaves102
•
0
Hi! I'm trying to generate regions of mm9 with matched length and GC-content to my existing file. An existing file would look like this:
Chromosome Start pos Stop pos Name GC content
chr10 117037832 117039194 peak_850 0.447137
chr10 117042940 117045812 peak_190 0.508008
chr10 117056118 117056783 peak_419 0.526316
chr10 117060829 117061104 peak_1627 0.414545
chr10 117061666 117062300 peak_617 0.441640
I'm very new to Bash and HPC. This is what I have written so far:
peakFile="directory/existing.file"
module load bedtools2
while read line;
do
echo $line | cut -d ' ' -f1-4 > temp1.bed
sed -i 's/ /\t/g' temp1.bed # to make temp1.bed become a standard BED file
gc=`echo $line | cut -d ' ' -f5`
gc1=`printf "%.2g" $gc` # round up GC content to 2 decimal places
count="0"
while [ $count -lt 3 ] # generate 3 regions for each existing regions
do
bedtools shuffle -i temp1.bed -g mm9.size-file -excl excluding-regions-file -noOverlapping > temp2.bed
bedtools nuc -fi mm9-genome.fa -bed temp2.bed > temp3.bed
tail -n +2 temp3.bed > temp4.bed # an ugly way to skip the column names from output of bedtools nuc
num=$(awk -F '\t' '{print $6}' temp4.bed)
num1=`printf "%.2g" $num` # get the new region's GC content and round up to 2 decimal places
if (( $(echo "$num1 == $gc1" | bc -l) ))
then
cat temp4.bed >> background.bed
count=$((count+1))
fi
done
rm temp*
done < $peakFile
Some of my output look like this (output of bedtools nuc):
chr17 44310586 44311064 peak_1098 0.533473 0.466527 140 120 103 115 0 0 478
chr5 26875969 26876447 peak_1098 0.531381 0.468619 138 103 121 116 0 0 478
chr9 13818501 13818979 peak_1098 0.533473 0.466527 128 95 128 127 0 0 478
chr13 95085276 95087304 peak_1953 0.584813 0.415187 571 384 458 615 0 0 2028
chr4 100988825 100990853 peak_1953 0.579389 0.420611 538 402 451 637 0 0 2028
chr3 99295326 99297354 peak_1953 0.583333 0.416667 576 421 424 607 0 0 2028
My code is very slow now, taking 30 hrs to process 200+ regions. Can you help suggesting me some way to improve the running time? Thank you so much!
I think the bottleneck here is the GC content comparison. The odds that GC content matches to 2 decimal places is pretty low. Can you try
awk
to pick matching GC content instead of doing anif
line by line? This script is forcingbash
to work like a programming language, which it's not supposed to do.The reason you're using bash seems to be because
bedtools
is involved. I'd recommend switching to python or R (or perl, which would be really good here) for the programming part and using system calls for the bedtools operations.