Faster way to get GC content and Coverage from .fasta
4
0
Entering edit mode
3.9 years ago
Shred ▴ 460

Guys, I've got a .fasta file , with the usual scheme

>NODE XX _ lenght XX_cov.xx.xx
ACGACGACGACACTCTACAGATACGACAT
>NODE YY_lenght YY_cov.YY.YY
CGACGACATCAGCATACAGATACTA

I wrote a small script to get the GC content of each contig sequence, and so on, to convert the "cov." value expressed into the nucleotidic coverage. Here's the code

    #! /bin/bash

## 
## This script needs to do few things: 
##
## First of all, conversion of cov with nucleotidic one

# Parameters options

display_help() {
    echo "Usage: $0 [file.fasta] [option..]" >&2
    echo
    echo "  -k, --kmer  kmer last value"
    echo "  -l, --lenght    read lenght"
    echo 
    echo "  example: $0 file.fasta -k 77 -l 150"
    echo 
    exit 1
}



case "$1" in 
    -h | --help )
        display_help
        exit 0
        ;;
esac 

case "$2" in
    -k | --kmer )
        KMER=($3)
        ;;
esac 

case "$4" in            
    -l | --lenght)
        READLENGHT=($5)
        ;;
esac 



# Check if the format is correct
if [[ $1 != *.fasta ]]; then
    echo "The input file $1 it's not in fasta format!"
    echo ""    
    echo "Usage: ./nomescript.sh NO_HIT.fasta"
  exit 1
fi



# calculate the coefficient used to convert coverage. Final is Y 
# calculate GC content of each sequence 

K=$(($READLENGHT - $KMER + 1))
Y=$(echo "scale=4; $K / $READLENGHT" | bc)

while read odd; do
    echo -n "${odd##}" | cut -d "_" -f 1,2,3,4 && printf "nucleotide_cov: " 
    echo "scale=4;${odd##*_} / $Y" | bc 
    read even
    echo "${even##}" &&
    ACOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "A")  
    GCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "G")
    CCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "C")
    TCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "T")
    TOTALBASES=$(($ACOUNT+$GCOUNT+$CCOUNT+$TCOUNT))
    GCCONT=$(($GCOUNT+$CCOUNT))
    printf "GC_CONT: " 
    echo "scale=2;$GCCONT / $TOTALBASES *100" | bc  
done < "$1"

It actually does the work, stamping a file like:

>NODE 
nucleotide_cov:
ACGACGAATCATACGACATAA
GC_CONT: 
>NODE
nucleotide_cov
GCAGCATACATCAGATACGATAC
GC_CONT

My problem is that is extremely slow when it works against .fasta > 200Mb.

I'm looking for a way to increase its speed, or doing the same work with other kind of script. Maybe.. python?

sequencing next-gen genome bash fasta • 9.8k views
ADD COMMENT
3
Entering edit mode
3.9 years ago
Shred ▴ 460

As a lesson to everyone, I've found the solution by using just awk, instead of creating a while read loop. Here's the work-around:

awk -F '_' -v Y="$Y" '{ if(NR%2==1) {
    printf "%s %s %s %s %s\nnucleotidic_cov : %.4f\n",$1,$2,$3,$4,$5, ($6 / Y)
} else {
    x=gsub(/[AT]/,""); 
    z=gsub(/[GC]/,""); 
    printf "GC_CONT : %.2f%%\n", (z*100)/(x+z)
    }
 }' large_file

Explanation: 1) " NR%2==1 " is used to use just odd lines. By using underscore as a field separator, it's easy to call each part of the line, by using $1, $2 , etc.. "%s" is used, as expected, to keep parts of interest into each odd line. 2) "x" and "z" are variables defined into the awk subshell: they define the number of match of both the letters inside parenthesis. I think this is the faster way (absolutely, also in comparison with other language) to count the GC content and the AT content of a sequence.

Credits to @amisax, from unix.stackexchange.com

ADD COMMENT
0
Entering edit mode

I tried running the above command, but I am getting this error

"awk: cmd. line:2: (FILENAME=10s.fasta FNR=1) fatal: division by zero attempted"

ADD REPLY
0
Entering edit mode

Probably because the fasta file used by the author have a very sepcific formatting of the tilte line.

ADD REPLY
2
Entering edit mode
3.9 years ago

output (with seqkit):

$ seqkit fx2tab  --name  --gc test.fa 
NODE XX _ lenght XX_cov.xx.xx           48.28
NODE YY_lenght YY_cov.YY.YY         44.00

output with infoseq from emboss: (there is a space in fasta header. Hence emboss program is output is different. If there are no spaces, then name would print entire header). If your sequence has Ns, emboss infoseq is not useful in calculating %GC.

$ infoseq test.fa  -only -desc -name -length -pgc
Display basic information about sequences
Name           Length %GC    Description 
NODE           29     48.28  XX _ lenght XX_cov.xx.xx
NODE           25     44.00  YY_lenght YY_cov.YY.YY

input:

$ cat test.fa 
>NODE XX _ lenght XX_cov.xx.xx
ACGACGACGACACTCTACAGATACGACAT
>NODE YY_lenght YY_cov.YY.YY
CGACGACATCAGCATACAGATACTA
ADD COMMENT
0
Entering edit mode

Man, please explain me the reason why you wrote this answer, except to promote your sample code.

ADD REPLY
1
Entering edit mode

When you ask a question it can have multiple answers and all of them can be correct. There is no one (or only) way of doing things in bioinformatics. This just happens to be another option for someone else looking for a similar need in future.

ADD REPLY
1
Entering edit mode
3.9 years ago

I can suggest you this script from one of my colleagues ;) https://github.com/oXis/gtool

ADD COMMENT
1
Entering edit mode
13 months ago

Hello,

Based on Shared's answer, using awk you can do:

awk '!/^>/{gc+=gsub(/[gGcC]/,""); at+=gsub(/[aAtT]/,"");} END{ printf "%.2f%%\n", (gc*100)/(gc+at) }' file.fasta

I recommend checking out biobash github there is a script that does this. It is a tool set for common bioinformatics tasks we are working on, and it is written in bash.

ADD COMMENT
1
Entering edit mode

Now that this has been revisited, here's a perl one-liner that would print the overall GC content too:

perl -lane 'unless (/^>/) { $l += length(); $gc++ while /[GC]/ig } END { print $gc/$l}' file.fasta

If GC content by contig is required, then a little bit of code is needed to print GC content after reading each one:

perl -lane 'if (/^>(.+)/) {
 printf "%s %d %.2f\n", $n, $l, $gc/$l if $l;
 $n = $1; $l = 0; $gc = 0;
} else {
 $l += length($_);
 $gc++ while /[GC]/ig;
} END {
 printf "%s %d %.2f\n", $n, $l, $gc/$l if $l;
}' file.fasta
ADD REPLY
0
Entering edit mode

Thanks for writing something sensible.

ADD REPLY

Login before adding your answer.

Traffic: 2247 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6