Question: Percentage of bases in a fasta sequence
0
gravatar for banerjeeshayantan
12 days ago by
banerjeeshayantan50 wrote:

I have the hg38 reference sequence. I wanted to calculate the percentage of each of the bases in the entire genome. So how do I find out the %A,%G,%C,%T in the hg38.fa file? Is this information already available? If not, can someone please guide me through the steps required to get this information?
Thanks!

sequencing • 202 views
ADD COMMENTlink modified 12 days ago by ATpoint5.6k • written 12 days ago by banerjeeshayantan50
2

It is a reference sequence and once you get the numbers for each base, you can calculate the frequency easy. Refer to a fast solution in SU forum (copy/pasted here):

echo 'int cache[256],x,y;char buf[4096],letters[]="tacgn-"; int main(){while((x=read(0,buf,sizeof buf))>0)for(y=0;y<x;y++)cache[(unsigned char)buf[y]]++;for(x=0;x<sizeof letters-1;x++)printf("%c: %d\n",letters[x],cache[letters[x]]);}' | gcc -w -xc -; ./a.out < file; rm a.out;

replace file with hg38.fa. Up vote OP. Unfortunately it is not case insensitive.

Example run time on an i5-6200 with 8 gb ram:

time (echo 'int cache[256],x,y;char buf[4096],letters[]="ATGCNatgcn-"; int main(){while((x=read(0,buf,sizeof buf))>0)for(y=0;y<x;y++)cache[(unsigned char)buf[y]]++;for(x=0;x<sizeof letters-1;x++)printf("%c: %d\n",letters[x],cache[letters[x]]);}' | gcc -w -xc -; ./a.out <  reference/hg38/hg38.fa; rm a.out;)

A: 434444996
T: 435086702
G: 295683846
C: 295469343
N: 159967178
a: 463840726
t: 465881444
g: 330651380
c: 328258454
n: 3313
-: 0

real    0m8.474s
user    0m7.855s
sys 0m0.614s
ADD REPLYlink modified 12 days ago • written 12 days ago by cpad01127.7k

Thanks for your reply. So if I want to calculate the %A, then should I include "#a" in the calculation too? Or just stick to the upper case A. (Same goes for other nucleotides)

ADD REPLYlink written 8 days ago by banerjeeshayantan50
1

All. They might be soft-masked as repetitive region nucleotides, but are still part of the genome.

ADD REPLYlink written 8 days ago by ATpoint5.6k

Definitely not very fast and is ram intensive but this is what I use grep -v ">" input.fa |grep -o . |sort |uniq -c

Gives counts which can then be converted to percentages

ADD REPLYlink modified 12 days ago • written 12 days ago by microfuge920
4
gravatar for ATpoint
12 days ago by
ATpoint5.6k
Germany
ATpoint5.6k wrote:

Super-easy with seqtk:

## Get seqtk from conda or from Git:
conda install -c bioconda seqtk

## This will give you the output (see below) for each chromosome in hg38.fa
seqtk comp hg38.fa

Output format: chr, length, #A, #C, #G, #T, #2, #3, #4, #CpG, #tv, #ts, #CpG-ts

## Or for the entire fasta:
seqtk comp hg38.fa | awk 'OFS="\t" {sumA+=$3; sumC+=$4; sumG+=$5; sumT+=$6} END {print "A:"sumA,"C:"sumC,"G:"sumG,"T:"sumT}'

If you want fractions, just modify the snippet to divide by the sum of the four columns or the entire number of characters in hg38.fa.

By the way, the numbers for hg38, including the EBV decoy, including random and U contigs, excluding any ALT sequences are:

A:866386023 C:598632365 G:600803779 T:868882461

ADD COMMENTlink modified 12 days ago • written 12 days ago by ATpoint5.6k
0
gravatar for genomax
12 days ago by
genomax51k
United States
genomax51k wrote:

Previous threads: Percentage of bases in sequence in R
BBTools option: stats.sh in=sequences.fasta gc=gc.txt

ADD COMMENTlink written 12 days ago by genomax51k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1267 users visited in the last hour