Generating genome regions with matched length and GC-content
0
0
Entering edit mode
5.2 years ago

Hi! I'm trying to generate regions of mm9 with matched length and GC-content to my existing file. An existing file would look like this:

Chromosome     Start pos      Stop pos     Name       GC content
 chr10   117037832       117039194       peak_850        0.447137
 chr10   117042940       117045812       peak_190        0.508008
 chr10   117056118       117056783       peak_419        0.526316
 chr10   117060829       117061104       peak_1627       0.414545
 chr10   117061666       117062300       peak_617        0.441640

I'm very new to Bash and HPC. This is what I have written so far:

peakFile="directory/existing.file"

module load bedtools2

while read line;
do
        echo $line | cut -d ' ' -f1-4 > temp1.bed
        sed -i 's/ /\t/g' temp1.bed       # to make temp1.bed become a standard BED file
        gc=`echo $line | cut -d ' ' -f5`
        gc1=`printf "%.2g" $gc`         # round up GC content to 2 decimal places
        count="0"

        while [ $count -lt 3 ]        # generate 3 regions for each existing regions
        do
                bedtools shuffle -i temp1.bed -g mm9.size-file -excl excluding-regions-file -noOverlapping > temp2.bed
                bedtools nuc -fi mm9-genome.fa -bed temp2.bed > temp3.bed
                tail -n +2 temp3.bed > temp4.bed         # an ugly way to skip the column names from output of bedtools nuc
                num=$(awk -F '\t' '{print $6}' temp4.bed)
                num1=`printf "%.2g" $num`                   # get the new region's GC content and round up to 2 decimal places
                if (( $(echo "$num1 == $gc1" | bc -l) ))
                then
                        cat temp4.bed >> background.bed
                        count=$((count+1))
                fi
        done
        rm temp*
done < $peakFile

Some of my output look like this (output of bedtools nuc):

chr17   44310586        44311064        peak_1098       0.533473        0.466527        140     120     103     115     0       0       478
chr5    26875969        26876447        peak_1098       0.531381        0.468619        138     103     121     116     0       0       478
chr9    13818501        13818979        peak_1098       0.533473        0.466527        128     95      128     127     0       0       478
chr13   95085276        95087304        peak_1953       0.584813        0.415187        571     384     458     615     0       0       2028
chr4    100988825       100990853       peak_1953       0.579389        0.420611        538     402     451     637     0       0       2028
chr3    99295326        99297354        peak_1953       0.583333        0.416667        576     421     424     607     0       0       2028

My code is very slow now, taking 30 hrs to process 200+ regions. Can you help suggesting me some way to improve the running time? Thank you so much!

genome • 1.1k views
ADD COMMENT
1
Entering edit mode

I think the bottleneck here is the GC content comparison. The odds that GC content matches to 2 decimal places is pretty low. Can you try awk to pick matching GC content instead of doing an if line by line? This script is forcing bash to work like a programming language, which it's not supposed to do.

The reason you're using bash seems to be because bedtools is involved. I'd recommend switching to python or R (or perl, which would be really good here) for the programming part and using system calls for the bedtools operations.

ADD REPLY

Login before adding your answer.

Traffic: 1012 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6