How to generate sequence length distribution from Fasta file
2
2
Entering edit mode
22 months ago
2822462298 ▴ 70

I got a fasta file of miRNAs with sequence length ranging from 17-32. How can I generate a table of length distribution like:

length      count
17            x
18            y
19            z


Also, any idea if I use fastq files? Do I need to convert fastq to fasta first?

RNA-Seq sRNA-seq fatsa • 2.4k views
2
Entering edit mode
22 months ago
Fatima ▴ 960

You can use a script like this to find the lengths

cat file.fa | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }'


https://www.danielecook.com/generate-fasta-sequence-lengths/

And this script should give you length and count

cat file.fa | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,10) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }' | sed '/^$/d' | awk '{print$2}' | sort | uniq -c | awk '{print $1,$2}'


I changed the substr($0,2,100) to substr($0,2,10) because in my case my headers had tabs and that would mess up things, so I only got characters 2-10 of the header (character 1 was > ). You can play with this based on your headers or even use $1 instead of substr($0,2,100) or even get rid of it (see the next command)

Simpler version:

cat file.fa | awk '$0 ~ ">" {print c;}$0 !~ ">" {c+=length($0);} END { print c; }' | sed '/^$/d' | sort | uniq -c | awk '{print $1,$2}'

0
Entering edit mode

Thanks Fatima, this works well for me!

1
Entering edit mode
22 months ago
Strand NGS ▴ 40

If you are comfortable in R, you can use the following code. Note: You will have to install the package seqinr before trying this

library(seqinr)
table(getLength(f))

1
Entering edit mode

Thank you! This is useful but it really took a long time to read the fasta file in r