How to get number of characters (nucleotides) in fasta file
Entering edit mode
5 weeks ago


I have a bunch of fasta files from refseq and I want to know how many bases are on each one. I want to count all characters after the first line of each sequence. I have something among these lines:

for file in *.faa; do echo $file; grep -c ">" "$file" | wc -c; done; 

But this gets all characters after > including the headers of the sequences. I just want the number of nucleotides.

I have tried some other grep options but they don't seem to make it better, I would very much appreciate some help.

count fasta linux characters commands • 300 views
Entering edit mode

Just use seqkit stats

seqkit stats *.faa

file                    format  type  num_seqs    sum_len  min_len  avg_len  max_len
tests/seqs4amplicon.fa  FASTA   DNA          2         33       16     16.5       17
tests/hairpin.fa        FASTA   RNA     28,645  2,949,871       39      103    2,354
tests/contigs.fa        FASTA   DNA          9         54        2        6       10
tests/hsa.fa            FASTA   DNA          4         40       10       10       10

For a large number of files:

seqkit stats --infile-list <(find seq-dir/ -name "*.faa") --tabular -j 12 -o stats.tsv
Entering edit mode
5 weeks ago

grep -c will do a counting of all the '>' occurrences in the file.

if you want to count but omit the header line from the counting do something like this:

for file in *.faa; do echo $file; sed '1d' "$file" | wc -c; done; 

the sed '1d' part will delete the first line of <file> and pass the rest of the file through to wc

if you want to use grep, you should use grep -v '>' (this will grep any line that does not match '>' )

You should also keep in mind that wc will count any character in the file (even the ones you can't see, such as line breaks etc), but that will give a good approximation of the total count.

Entering edit mode
5 weeks ago

Assuming that sequences are in single line in each record:

for file in *.fa; do awk -v OFS="\t" '/^>/ {getline seq}; {gsub(/>/,"",$0); print FILENAME, $0, length(seq)}' $file; done

This would print: file name, fasta header (without >), sequence length (count)

Entering edit mode
5 weeks ago
boczniak767 ▴ 790

Maybe grep -v '>' file.fa | sed -e 's/\(.\)/&\n/g' | wc -l

It removes the line with name, then translates any character to the same character followed by the newline character and finally counts lines.

It works ok only if you have only one sequence in the file.


Login before adding your answer.

Traffic: 1603 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6