Question

Generating genome regions with matched length and GC-content

0

Entering edit mode

5.2 years ago

loveleaves102 • 0

Hi! I'm trying to generate regions of mm9 with matched length and GC-content to my existing file. An existing file would look like this:

Chromosome     Start pos      Stop pos     Name       GC content
 chr10   117037832       117039194       peak_850        0.447137
 chr10   117042940       117045812       peak_190        0.508008
 chr10   117056118       117056783       peak_419        0.526316
 chr10   117060829       117061104       peak_1627       0.414545
 chr10   117061666       117062300       peak_617        0.441640

I'm very new to Bash and HPC. This is what I have written so far:

peakFile="directory/existing.file"

module load bedtools2

while read line;
do
        echo $line | cut -d ' ' -f1-4 > temp1.bed
        sed -i 's/ /\t/g' temp1.bed       # to make temp1.bed become a standard BED file
        gc=`echo $line | cut -d ' ' -f5`
        gc1=`printf "%.2g" $gc`         # round up GC content to 2 decimal places
        count="0"

        while [ $count -lt 3 ]        # generate 3 regions for each existing regions
        do
                bedtools shuffle -i temp1.bed -g mm9.size-file -excl excluding-regions-file -noOverlapping > temp2.bed
                bedtools nuc -fi mm9-genome.fa -bed temp2.bed > temp3.bed
                tail -n +2 temp3.bed > temp4.bed         # an ugly way to skip the column names from output of bedtools nuc
                num=$(awk -F '\t' '{print $6}' temp4.bed)
                num1=`printf "%.2g" $num`                   # get the new region's GC content and round up to 2 decimal places
                if (( $(echo "$num1 == $gc1" | bc -l) ))
                then
                        cat temp4.bed >> background.bed
                        count=$((count+1))
                fi
        done
        rm temp*
done < $peakFile

Some of my output look like this (output of bedtools nuc):

chr17   44310586        44311064        peak_1098       0.533473        0.466527        140     120     103     115     0       0       478
chr5    26875969        26876447        peak_1098       0.531381        0.468619        138     103     121     116     0       0       478
chr9    13818501        13818979        peak_1098       0.533473        0.466527        128     95      128     127     0       0       478
chr13   95085276        95087304        peak_1953       0.584813        0.415187        571     384     458     615     0       0       2028
chr4    100988825       100990853       peak_1953       0.579389        0.420611        538     402     451     637     0       0       2028
chr3    99295326        99297354        peak_1953       0.583333        0.416667        576     421     424     607     0       0       2028

My code is very slow now, taking 30 hrs to process 200+ regions. Can you help suggesting me some way to improve the running time? Thank you so much!

genome • 1.1k views

ADD COMMENT • link 5.2 years ago by loveleaves102 • 0

1

Entering edit mode

I think the bottleneck here is the GC content comparison. The odds that GC content matches to 2 decimal places is pretty low. Can you try awk to pick matching GC content instead of doing an if line by line? This script is forcing bash to work like a programming language, which it's not supposed to do.

The reason you're using bash seems to be because bedtools is involved. I'd recommend switching to python or R (or perl, which would be really good here) for the programming part and using system calls for the bedtools operations.

ADD REPLY • link 5.2 years ago by Ram 44k