Question: Faster way to get GC content and Coverage from .fasta
0
gravatar for danilo.tatoni
3 months ago by
danilo.tatoni60 wrote:

Guys, I've got a .fasta file , with the usual scheme

>NODE XX _ lenght XX_cov.xx.xx
ACGACGACGACACTCTACAGATACGACAT
>NODE YY_lenght YY_cov.YY.YY
CGACGACATCAGCATACAGATACTA

I wrote a small script to get the GC content of each contig sequence, and so on, to convert the "cov." value expressed into the nucleotidic coverage. Here's the code

    #! /bin/bash

## 
## This script needs to do few things: 
##
## First of all, conversion of cov with nucleotidic one

# Parameters options

display_help() {
    echo "Usage: $0 [file.fasta] [option..]" >&2
    echo
    echo "  -k, --kmer  kmer last value"
    echo "  -l, --lenght    read lenght"
    echo 
    echo "  example: $0 file.fasta -k 77 -l 150"
    echo 
    exit 1
}



case "$1" in 
    -h | --help )
        display_help
        exit 0
        ;;
esac 

case "$2" in
    -k | --kmer )
        KMER=($3)
        ;;
esac 

case "$4" in            
    -l | --lenght)
        READLENGHT=($5)
        ;;
esac 



# Check if the format is correct
if [[ $1 != *.fasta ]]; then
    echo "The input file $1 it's not in fasta format!"
    echo ""    
    echo "Usage: ./nomescript.sh NO_HIT.fasta"
  exit 1
fi



# calculate the coefficient used to convert coverage. Final is Y 
# calculate GC content of each sequence 

K=$(($READLENGHT - $KMER + 1))
Y=$(echo "scale=4; $K / $READLENGHT" | bc)

while read odd; do
    echo -n "${odd##}" | cut -d "_" -f 1,2,3,4 && printf "nucleotide_cov: " 
    echo "scale=4;${odd##*_} / $Y" | bc 
    read even
    echo "${even##}" &&
    ACOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "A")  
    GCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "G")
    CCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "C")
    TCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "T")
    TOTALBASES=$(($ACOUNT+$GCOUNT+$CCOUNT+$TCOUNT))
    GCCONT=$(($GCOUNT+$CCOUNT))
    printf "GC_CONT: " 
    echo "scale=2;$GCCONT / $TOTALBASES *100" | bc  
done < "$1"

It actually does the work, stamping a file like:

>NODE 
nucleotide_cov:
ACGACGAATCATACGACATAA
GC_CONT: 
>NODE
nucleotide_cov
GCAGCATACATCAGATACGATAC
GC_CONT

My problem is that is extremely slow when it works against .fasta > 200Mb.

I'm looking for a way to increase its speed, or doing the same work with other kind of script. Maybe.. python?

ADD COMMENTlink modified 3 months ago by cpad01126.3k • written 3 months ago by danilo.tatoni60
1
gravatar for danilo.tatoni
3 months ago by
danilo.tatoni60 wrote:

As a lesson to everyone, I've found the solution by using just awk, instead of creating a while read loop. Here's the work-around:

awk -F '_' -v Y="$Y" '{ if(NR%2==1) {
    printf "%s %s %s %s %s\nnucleotidic_cov : %.4f\n",$1,$2,$3,$4,$5, ($6 / Y)
} else {
    x=gsub(/[AT]/,""); 
    z=gsub(/[GC]/,""); 
    printf "GC_CONT : %.2f%%\n", (z*100)/(x+z)
    }
 }' large_file

Explanation: 1) " NR%2==1 " is used to use just odd lines. By using underscore as a field separator, it's easy to call each part of the line, by using $1, $2 , etc.. "%s" is used, as expected, to keep parts of interest into each odd line. 2) "x" and "z" are variables defined into the awk subshell: they define the number of match of both the letters inside parenthesis. I think this is the faster way (absolutely, also in comparison with other language) to count the GC content and the AT content of a sequence.

Credits to @amisax, from unix.stackexchange.com

ADD COMMENTlink written 3 months ago by danilo.tatoni60
1
gravatar for Selenocysteine
3 months ago by
Dublin, Ireland
Selenocysteine460 wrote:

I can suggest you this script from one of my colleagues ;) https://github.com/oXis/gtool

ADD COMMENTlink written 3 months ago by Selenocysteine460
0
gravatar for cpad0112
3 months ago by
cpad01126.3k
cpad01126.3k wrote:

output (with seqkit):

$ seqkit fx2tab  --name  --gc test.fa 
NODE XX _ lenght XX_cov.xx.xx           48.28
NODE YY_lenght YY_cov.YY.YY         44.00

output with infoseq from emboss: (there is a space in fasta header. Hence emboss program is output is different. If there are no spaces, then name would print entire header). If your sequence has Ns, emboss infoseq is not useful in calculating %GC.

$ infoseq test.fa  -only -desc -name -length -pgc
Display basic information about sequences
Name           Length %GC    Description 
NODE           29     48.28  XX _ lenght XX_cov.xx.xx
NODE           25     44.00  YY_lenght YY_cov.YY.YY

input:

$ cat test.fa 
>NODE XX _ lenght XX_cov.xx.xx
ACGACGACGACACTCTACAGATACGACAT
>NODE YY_lenght YY_cov.YY.YY
CGACGACATCAGCATACAGATACTA
ADD COMMENTlink modified 3 months ago • written 3 months ago by cpad01126.3k

Man, please explain me the reason why you wrote this answer, except to promote your sample code.

ADD REPLYlink written 3 months ago by danilo.tatoni60

When you ask a question it can have multiple answers and all of them can be correct. There is no one (or only) way of doing things in bioinformatics. This just happens to be another option for someone else looking for a similar need in future.

ADD REPLYlink written 3 months ago by genomax49k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1679 users visited in the last hour