How to get number of characters (nucleotides) in fasta file
4
0
Entering edit mode
2.8 years ago

Hi,

I have a bunch of fasta files from refseq and I want to know how many bases are on each one. I want to count all characters after the first line of each sequence. I have something among these lines:

for file in *.faa; do echo $file; grep -c ">" "$file" | wc -c; done; 

But this gets all characters after > including the headers of the sequences. I just want the number of nucleotides.

I have tried some other grep options but they don't seem to make it better, I would very much appreciate some help.

count fasta linux characters commands • 9.5k views
ADD COMMENT
1
Entering edit mode

Just use seqkit stats

seqkit stats *.faa

file                    format  type  num_seqs    sum_len  min_len  avg_len  max_len
tests/seqs4amplicon.fa  FASTA   DNA          2         33       16     16.5       17
tests/hairpin.fa        FASTA   RNA     28,645  2,949,871       39      103    2,354
tests/contigs.fa        FASTA   DNA          9         54        2        6       10
tests/hsa.fa            FASTA   DNA          4         40       10       10       10

For a large number of files:

seqkit stats --infile-list <(find seq-dir/ -name "*.faa") --tabular -j 12 -o stats.tsv
ADD REPLY
1
Entering edit mode
2.8 years ago

grep -c will do a counting of all the '>' occurrences in the file.

if you want to count but omit the header line from the counting do something like this:

for file in *.faa; do echo $file; sed '1d' "$file" | wc -c; done; 

the sed '1d' part will delete the first line of <file> and pass the rest of the file through to wc

if you want to use grep, you should use grep -v '>' (this will grep any line that does not match '>' )

You should also keep in mind that wc will count any character in the file (even the ones you can't see, such as line breaks etc), but that will give a good approximation of the total count.

ADD COMMENT
1
Entering edit mode
2.8 years ago

Assuming that sequences are in single line in each record:

for file in *.fa; do awk -v OFS="\t" '/^>/ {getline seq}; {gsub(/>/,"",$0); print FILENAME, $0, length(seq)}' $file; done

This would print: file name, fasta header (without >), sequence length (count)

ADD COMMENT
1
Entering edit mode
2.8 years ago
boczniak767 ▴ 850

Maybe grep -v '>' file.fa | sed -e 's/\(.\)/&\n/g' | wc -l

It removes the line with name, then translates any character to the same character followed by the newline character and finally counts lines.

It works ok only if you have only one sequence in the file.

ADD COMMENT
0
Entering edit mode
23 months ago
bio_driven • 0

If you really love grep there might be an easier-to-read solution - though it may have drawbacks.

grep -v ">" fasta_file.fa | grep -E -o "G|C|T|A|N" | wc -l

Assuming you have only standard FASTA/FASTQ sequence characters (i.e. only G, C, T, A, and N) then you can pipe two greps together. The first grep -v removes the header line (as suggested by Lieven and boczniak767) and then the subsequent grep -E -o uses multiple pattern matching (and catching multiple matches per line) and a pipe to wc -l to count the number of bases.

You could try putting the above in a loop. I haven't tested the performance relative to other approaches though and multiple comparisons will likely be slower than regex. Also, if the genome is quite large then there may possibly be some RAM issues but again - I haven't tested that. In any case purpose-built applications (like seqkit stats as suggested by shenwei356) would be the go with more memory-intense genomes.

I'm interested to hear if anyone sees any holes/issues with this approach, long-term learning is the best!

ADD COMMENT

Login before adding your answer.

Traffic: 1994 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6