How to generate sequence length distribution from Fasta file
2
2
Entering edit mode
2.8 years ago
2822462298 ▴ 110

I got a fasta file of miRNAs with sequence length ranging from 17-32. How can I generate a table of length distribution like:

length      count
17            x
18            y
19            z


Also, any idea if I use fastq files? Do I need to convert fastq to fasta first?

RNA-Seq sRNA-seq fatsa • 4.1k views
3
Entering edit mode
2.8 years ago
Fatima ▴ 1000

You can use a script like this to find the lengths

cat file.fa | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }'


https://www.danielecook.com/generate-fasta-sequence-lengths/

And this script should give you length and count

cat file.fa | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,10) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }' | sed '/^$/d' | awk '{print$2}' | sort | uniq -c | awk '{print $1,$2}'


I changed the substr($0,2,100) to substr($0,2,10) because in my case my headers had tabs and that would mess up things, so I only got characters 2-10 of the header (character 1 was > ). You can play with this based on your headers or even use $1 instead of substr($0,2,100) or even get rid of it (see the next command)

Simpler version:

cat file.fa | awk '$0 ~ ">" {print c;}$0 !~ ">" {c+=length($0);} END { print c; }' | sed '/^$/d' | sort | uniq -c | awk '{print $1,$2}'

0
Entering edit mode

Thanks Fatima, this works well for me!

0
Entering edit mode

Wandered here from Google, a revision for others who land here: The 'Simpler version' has a bug, it needs to reset c=0 at each sequence header. If you do not add this, your code will print cumulative lengths rather than the length of each read.

cat file.fa | awk '$0 ~ ">" {print c;c=0}$0 !~ ">" {c+=length($0);} END { print c; }' | sed '/^$/d' | sort | uniq -c | awk '{print $1,$2}'

1
Entering edit mode
2.8 years ago
Strand NGS ▴ 40

If you are comfortable in R, you can use the following code. Note: You will have to install the package seqinr before trying this

library(seqinr)
table(getLength(f))

1
Entering edit mode

Thank you! This is useful but it really took a long time to read the fasta file in r

Traffic: 2678 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.