Question: Generating genome regions with matched length and GC-content
0
gravatar for loveleaves102
16 days ago by
loveleaves1020 wrote:

Hi! I'm trying to generate regions of mm9 with matched length and GC-content to my existing file. An existing file would look like this:

Chromosome     Start pos      Stop pos     Name       GC content
 chr10   117037832       117039194       peak_850        0.447137
 chr10   117042940       117045812       peak_190        0.508008
 chr10   117056118       117056783       peak_419        0.526316
 chr10   117060829       117061104       peak_1627       0.414545
 chr10   117061666       117062300       peak_617        0.441640

I'm very new to Bash and HPC. This is what I have written so far:

peakFile="directory/existing.file"

module load bedtools2

while read line;
do
        echo $line | cut -d ' ' -f1-4 > temp1.bed
        sed -i 's/ /\t/g' temp1.bed       # to make temp1.bed become a standard BED file
        gc=`echo $line | cut -d ' ' -f5`
        gc1=`printf "%.2g" $gc`         # round up GC content to 2 decimal places
        count="0"

        while [ $count -lt 3 ]        # generate 3 regions for each existing regions
        do
                bedtools shuffle -i temp1.bed -g mm9.size-file -excl excluding-regions-file -noOverlapping > temp2.bed
                bedtools nuc -fi mm9-genome.fa -bed temp2.bed > temp3.bed
                tail -n +2 temp3.bed > temp4.bed         # an ugly way to skip the column names from output of bedtools nuc
                num=$(awk -F '\t' '{print $6}' temp4.bed)
                num1=`printf "%.2g" $num`                   # get the new region's GC content and round up to 2 decimal places
                if (( $(echo "$num1 == $gc1" | bc -l) ))
                then
                        cat temp4.bed >> background.bed
                        count=$((count+1))
                fi
        done
        rm temp*
done < $peakFile

Some of my output look like this (output of bedtools nuc):

chr17   44310586        44311064        peak_1098       0.533473        0.466527        140     120     103     115     0       0       478
chr5    26875969        26876447        peak_1098       0.531381        0.468619        138     103     121     116     0       0       478
chr9    13818501        13818979        peak_1098       0.533473        0.466527        128     95      128     127     0       0       478
chr13   95085276        95087304        peak_1953       0.584813        0.415187        571     384     458     615     0       0       2028
chr4    100988825       100990853       peak_1953       0.579389        0.420611        538     402     451     637     0       0       2028
chr3    99295326        99297354        peak_1953       0.583333        0.416667        576     421     424     607     0       0       2028

My code is very slow now, taking 30 hrs to process 200+ regions. Can you help suggesting me some way to improve the running time? Thank you so much!

genome • 123 views
ADD COMMENTlink written 16 days ago by loveleaves1020
1

I think the bottleneck here is the GC content comparison. The odds that GC content matches to 2 decimal places is pretty low. Can you try awk to pick matching GC content instead of doing an if line by line? This script is forcing bash to work like a programming language, which it's not supposed to do.

The reason you're using bash seems to be because bedtools is involved. I'd recommend switching to python or R (or perl, which would be really good here) for the programming part and using system calls for the bedtools operations.

ADD REPLYlink written 16 days ago by RamRS22k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2014 users visited in the last hour