Question

Sequence length from Fasta

5

Entering edit mode

9.5 years ago

bongbang ▴ 80

Is there a ready-made command-line program that can output sequence lengths from a fasta file? Like so:

YKR054C,4092
YLR106C,4910
...

I've seen examples of people trying to roll their own using awk, but I would rather not do that if I can avoid it. Between Emboss, Biopython, Bioperl, etc there must be something that can do this, right? What I'm looking for would be the equivalent of what blast_formatter can do for blast hits, something that can extract simple information that's already there in a format specified by user. Thank you.

fasta • 70k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 9.5 years ago by bongbang ▴ 80

1

Entering edit mode

"but I would rather not do that if I can avoid it" why ?

ADD REPLY • link 9.5 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

Because I looked at the syntax for awk and it seemed rather messy (it didn't help that the name reminds me of "awkward"). I thought I was going to save myself time, but now it appears that awk is probably the most efficient way.

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by bongbang ▴ 80

0

Entering edit mode

see also: Code Golf: Mean Length Of Fasta Sequences

ADD REPLY • link 9.4 years ago by Pierre Lindenbaum 161k

43

Entering edit mode

9.4 years ago

Matt Shirley 10k

Why not samtools faidx sample.fa and then cut -f1-2 sample.fa.fai? The fasta index file will give you sequence name and length in the first two columns, and you probably already have this file indexed.

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 9.4 years ago by Matt Shirley 10k

9

Entering edit mode

5.5 years ago

mg ▴ 250

Using SeqKit's fx2tab:

seqkit fx2tab --length --name --header-line  foo.fasta

This prints sequence name and length with empty columns for sequences and qualities.

ADD COMMENT • link 5.5 years ago by mg ▴ 250

0

Entering edit mode

seqkit really is the best

ADD REPLY • link 16 months ago by O.rka ▴ 710

4

Entering edit mode

9.5 years ago

Renesh ★ 2.2k

You can make your own code for this,

Copy and paste following code in fasta_len.py file:

import sys
from Bio import SeqIO

FastaFile = open(sys.argv[1], 'rU')

for rec in SeqIO.parse(FastaFile, 'fasta'):
    name = rec.id
    seq = rec.seq
    seqLen = len(rec)
    print name, seqLen

FastaFile.close()

Make this file executable by:

chmod u+x fasta_len.py

and run as,

./fasta_len.py file.fasta

Note: you should have installed biopython

ADD COMMENT • link updated 16 months ago by Ram 43k • written 9.5 years ago by Renesh ★ 2.2k

1

Entering edit mode

import sys
from Bio.SeqIO.FastaIO import SimpleFastaParser

FastaFile = open(sys.argv[1], 'rU')
LenFile = open('./data/lengths.csv', 'w')

for name, seq in SimpleFastaParser(FastaFile):
    seqLen = len(seq)
    LenFile.write(name + ',' + str(seqLen) + '\n')

FastaFile.close()
LenFile.close()

This is somewhat more efficient for large files http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:low-level-fasta-fastq

ADD REPLY • link updated 16 months ago by Ram 43k • written 6.5 years ago by khimulya ▴ 10

0

Entering edit mode

I was hoping to avoid something like that (which is really not too burdensome at all, I know). I now realize that sometimes writing your own script can actually take less time than looking for a canned solution. (Doesn't always work out that way, though.)

ADD REPLY • link updated 5.6 years ago by Ram 43k • written 9.4 years ago by bongbang ▴ 80

0

Entering edit mode

Works like a charm. Thank you!

ADD REPLY • link 6.4 years ago by chrystine.yan • 0

3

Entering edit mode

9.4 years ago

hpmcwill ★ 1.2k

Assuming you already have EMBOSS installed you could use 'infoseq' to get the format you desire. For example:

$ infoseq -auto -nocolumns -delimiter ',' -only -noheading -name -length sequences.fa
2AAA_HUMAN,589
AAPK1_PIG,385
AAPK1_MOUSE,559
AAPK1_PONAB,554
AAPK1_HUMAN,559
AAPK1_RAT,559
...

Depending on the identifier format used in your source fasta file you may need to choose between the appropriate -name, -accession, -gi or -seqversion option for the identifier column, or use the "-sformat pearson" option to maintain the original source identifier. For example to return the original identifiers when they are in db|acc|id format:

$ infoseq -auto -nocolumns -delimiter ',' -only -noheading -name -length -sformat pearson sequences.fa
sp|P30153|2AAA_HUMAN,589
sp|Q09136|AAPK1_PIG,385
sp|Q5EG47|AAPK1_MOUSE,559
sp|Q5RDH5|AAPK1_PONAB,554
sp|Q13131|AAPK1_HUMAN,559
sp|P54645|AAPK1_RAT,559
...

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 9.4 years ago by hpmcwill ★ 1.2k

0

Entering edit mode

9.4 years ago

onuralp ▴ 190

Granted that this is a ridiculous way to do this, I thought it might be useful to demonstrate the algorithmic steps and how they are implemented using Perl. You can think about how to modify and improve various steps, and write your own scripts.

Make sure that you have Perl installed on your computer.

C:\Users\bongbang\Desktop\perl -v

Copy and paste the following code into a text file and save it as calculateLength.pl

#!/usr/bin/perl
use warnings;
use strict;

# Point to where your input fasta file is located
my $fasta_file = "/Users/bongbang/Desktop/my_sequences.fasta";

# Declare that you want to read this file, and print the error message
# in case that the input fasta file does not exist at the specified path
open(FASTA, "<", $fasta_file) or die("Probably wrong path: $fasta_file\n");

# We will linearize the sequences into this hash
my %singleLineSequences;

# Initialize a variable to store sequence ids
my $sequence_id;

# Read the fasta file line by line
while(<FASTA>){
    my $line = $_; chomp($line);

    # if this is a new sequence with the header at the beginnning
    # Extract sequence id using regular expression (\S+) and
    # store it into the variable $sequence_id

    if ($line =~ m/^>(\S+)/){

        $sequence_id = $1; # e.g., YKR054C            

        # Reserve an entry in your hash
        # $sequence_id = YKR054C
        # $singleLineSequences{'YKR054C'} = ""
        # No sequence has been added yet, hence the empty value
        $singleLineSequences{$sequence_id} = ""; 

        }

    # if the line is not a header but part of a sequence, 
    # append this part to the corresponding sequence id in
    # your hash entry

    else {

        # paste the current line to the end of the sequence
        $singleLineSequences{$sequence_id} = $singleLineSequences{$sequence_id} . $line;

        }
   }

# Now that your hash contains single line sequences, you can simply
# loop over each sequence in your hash, determine the sequence length,
# and print it out

foreach my $sequence_entry (keys %singleLineSequences){

    # grab a hash entry and store it into a variable
    my $currentSequence = $singleLineSequences{$sequence_entry};

    # determine length of the sequence
    my $lengthSequence = length($currentSequence);

    # print the result: id,length
    print $sequence_entry . "," . $lengthSequence . "\n";
}

Go into the directory including your script, and execute it as follows:

C:\Users\bongbang\Desktop\perl calculateLength.pl

ADD COMMENT • link updated 5.8 years ago by Ram 43k • written 9.4 years ago by onuralp ▴ 190

0

Entering edit mode

Why not use BioPerl to achieve both module reuse as well as algorithm elicitation?

ADD REPLY • link 9.4 years ago by Ram 43k

0

Entering edit mode

Hi,

I aim to identify the number of alanines as a function of the sequence length. Is there a way to count the number of alanines in each sequence along with the sequence length? An output that looks something like:

sp|P0A9Z1|GLNB_ECOLI,112,A=10



sp|P62517|OPGH_ECOLI,847,A=23

sp|P76290|CMOA_ECOLI,247,A=41

Thanks in advance.

ADD REPLY • link 4.5 years ago by d.kaur • 0

0

Entering edit mode

What have you tried?

ADD REPLY • link 4.5 years ago by Ram 43k

Ram · Accepted Answer · 2014-11-09

25

Entering edit mode

9.4 years ago

Ram 43k

Heng Li's bioawk might be a good compromise between the simplicity of awk and readability of programming languages.

bioawk -c fastx '{ print $name, length($seq) }' < sequences.fa

ADD COMMENT • link 5.6 years ago by Ram 43k

1

Entering edit mode

That's exactly what I was looking for. If you make that an answer, with the missing quotation mark added, I'll accept it. Thanks.

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 9.4 years ago by bongbang ▴ 80

0

Entering edit mode

It's done :)

ADD REPLY • link 9.4 years ago by Ram 43k