How to generate sequence length distribution from Fasta file
2
2
Entering edit mode
22 months ago
2822462298 ▴ 70

I got a fasta file of miRNAs with sequence length ranging from 17-32. How can I generate a table of length distribution like:

length      count
17            x
18            y
19            z

Also, any idea if I use fastq files? Do I need to convert fastq to fasta first?

RNA-Seq sRNA-seq fatsa • 2.4k views
ADD COMMENT
2
Entering edit mode
22 months ago
Fatima ▴ 960

You can use a script like this to find the lengths

cat file.fa | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }'

https://www.danielecook.com/generate-fasta-sequence-lengths/

And this script should give you length and count

cat file.fa | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,10) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }' | sed '/^$/d' | awk '{print $2}' | sort | uniq -c | awk '{print $1,$2}'

I changed the substr($0,2,100) to substr($0,2,10) because in my case my headers had tabs and that would mess up things, so I only got characters 2-10 of the header (character 1 was > ). You can play with this based on your headers or even use $1 instead of substr($0,2,100) or even get rid of it (see the next command)

Simpler version:

cat file.fa | awk '$0 ~ ">" {print c;} $0 !~ ">" {c+=length($0);} END { print c; }' | sed '/^$/d' | sort | uniq -c | awk '{print $1,$2}'
ADD COMMENT
0
Entering edit mode

Thanks Fatima, this works well for me!

ADD REPLY
1
Entering edit mode
22 months ago
Strand NGS ▴ 40

If you are comfortable in R, you can use the following code. Note: You will have to install the package seqinr before trying this

library(seqinr)
f <- read.fasta("input.fa")
table(getLength(f))
ADD COMMENT
1
Entering edit mode

Thank you! This is useful but it really took a long time to read the fasta file in r

ADD REPLY

Login before adding your answer.

Traffic: 1369 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6