Question: Sequence length from Fasta
3
gravatar for bongbang
3.7 years ago by
bongbang60
United States
bongbang60 wrote:

Is there a ready-made command-line program that can output sequence lengths from a fasta file? Like so:

YKR054C,4092
YLR106C,4910
...

I've seen examples of people trying to roll their own using awk, but I would rather not do that if I can avoid it. Between Emboss, Biopython, Bioperl, etc there must be something that can do this, right? What I'm looking for would be the equivalent of what blast_formatter can do for blast hits, something that can extract simple information that's already there in a format specified by user. Thank you.


 

fasta • 18k views
ADD COMMENTlink modified 3.7 years ago by hpmcwill1.1k • written 3.7 years ago by bongbang60
1

"but I would rather not do that if I can avoid it" why ?

ADD REPLYlink written 3.7 years ago by Pierre Lindenbaum110k
1

Because I looked at the syntax for awk and it seemed rather messy (it didn't help that the name reminds me of "awkward"). I thought I was going to save myself time, but now it appears that awk is probably the most efficient way. 
 

ADD REPLYlink written 3.7 years ago by bongbang60

see also: Code Golf: Mean Length Of Fasta Sequences

ADD REPLYlink written 3.7 years ago by Pierre Lindenbaum110k
11
gravatar for Ram
3.7 years ago by
Ram16k
Houston, TX
Ram16k wrote:

bioawk might be a good compromise between the simplicity of awk and readability of programming languages.

bioawk -c fastx '{ print $name, length($seq) }' <sequences.fa 

bioawk here: https://github.com/lh3/bioawk

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by Ram16k
1

That's exactly what I was looking for. If you make that an answer, with the missing quotation mark added, I'll accept it. Thanks.
 

ADD REPLYlink written 3.7 years ago by bongbang60

It's done :)

ADD REPLYlink written 3.7 years ago by Ram16k
22
gravatar for Matt Shirley
3.7 years ago by
Matt Shirley8.4k
Cambridge, MA
Matt Shirley8.4k wrote:

Why not samtools faidx sample.fa and then cut -f1-2 sample.fa.fai? The fasta index file will give you sequence name and length in the first two columns, and you probably already have this file indexed. 

ADD COMMENTlink written 3.7 years ago by Matt Shirley8.4k
3
gravatar for Renesh
3.7 years ago by
Renesh1.3k
United States
Renesh1.3k wrote:

You can make your own code for this,

copy and paste following code in fasta_len.py file

import sys
from Bio import SeqIO

FastaFile = open(sys.argv[1], 'rU')

for rec in SeqIO.parse(FastaFile, 'fasta'):
    name = rec.id
    seq = rec.seq
    seqLen = len(rec)
    print name, seqLen

FastaFile.close()

make this file executable by,

chmod u+x fasta_len.py

and run as,

./fasta_len.py file.fasta

Note: you should have installed biopython

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by Renesh1.3k

I was hoping to avoid something like that (which is really not too burdensome at all, I know). I now realize that sometimes writing your own script can actually take less time than looking for a canned solution. (Doesn't always work out that way, though.)  
 

ADD REPLYlink written 3.7 years ago by bongbang60
import sys
from Bio.SeqIO.FastaIO import SimpleFastaParser

FastaFile = open(sys.argv[1], 'rU')
LenFile = open('./data/lengths.csv', 'w')

for name, seq in SimpleFastaParser(FastaFile):
    seqLen = len(seq)
    LenFile.write(name + ',' + str(seqLen) + '\n')

FastaFile.close()
LenFile.close()

This is somewhat more efficient for large files http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:low-level-fasta-fastq

ADD REPLYlink modified 8 months ago • written 8 months ago by khimulya0

Works like a charm. Thank you!

ADD REPLYlink written 7 months ago by chrystine.yan0
1
gravatar for hpmcwill
3.7 years ago by
hpmcwill1.1k
United Kingdom
hpmcwill1.1k wrote:

Assuming you already have EMBOSS installed you could use 'infoseq' to get the format you desire. For example:

$ infoseq -auto -nocolumns -delimiter ',' -only -noheading -name -length sequences.fa 
2AAA_HUMAN,589
AAPK1_PIG,385
AAPK1_MOUSE,559
AAPK1_PONAB,554
AAPK1_HUMAN,559
AAPK1_RAT,559
...

Depending on the identifier format used in your source fasta file you may need to choose between the appropriate "-name", "-accession", "-gi" or "-seqversion" option for the identifier column, or use the  "-sformat pearson" option to maintain the original source identifier. For example to return the original identifiers when they are in "db|acc|id" format:

$ infoseq -auto -nocolumns -delimiter ',' -only -noheading -name -length -sformat pearson sequences.fa 
sp|P30153|2AAA_HUMAN,589
sp|Q09136|AAPK1_PIG,385
sp|Q5EG47|AAPK1_MOUSE,559
sp|Q5RDH5|AAPK1_PONAB,554
sp|Q13131|AAPK1_HUMAN,559
sp|P54645|AAPK1_RAT,559
...

 

ADD COMMENTlink written 3.7 years ago by hpmcwill1.1k
0
gravatar for onuralp
3.7 years ago by
onuralp180
Turkey
onuralp180 wrote:

Granted that this is a ridiculous way to do this, I thought it might be useful to demonstrate the algorithmic steps and how they are implemented using Perl. You can think about how to modify and improve various steps, and write your own scripts.

Make sure that you have Perl installed on your computer.

C:\Users\bongbang\Desktop\perl -v

Copy and paste the following code into a text file and save it as calculateLength.pl

#!/usr/bin/perl
use warnings;
use strict;

# Point to where your input fasta file is located
my $fasta_file = "/Users/bongbang/Desktop/my_sequences.fasta";

# Declare that you want to read this file, and print the error message
# in case that the input fasta file does not exist at the specified path
open(FASTA, "<", $fasta_file) or die("Probably wrong path: $fasta_file\n");

# We will linearize the sequences into this hash
my %singleLineSequences;

# Initialize a variable to store sequence ids
my $sequence_id;

# Read the fasta file line by line
while(<FASTA>){
    my $line = $_; chomp($line);

    # if this is a new sequence with the header at the beginnning
    # Extract sequence id using regular expression (\S+) and
    # store it into the variable $sequence_id

    if ($line =~ m/^>(\S+)/){

        $sequence_id = $1; # e.g., YKR054C            

        # Reserve an entry in your hash
        # $sequence_id = YKR054C
        # $singleLineSequences{'YKR054C'} = ""
        # No sequence has been added yet, hence the empty value
        $singleLineSequences{$sequence_id} = ""; 

        }

    # if the line is not a header but part of a sequence, 
    # append this part to the corresponding sequence id in
    # your hash entry

    else {

        # paste the current line to the end of the sequence
        $singleLineSequences{$sequence_id} = $singleLineSequences{$sequence_id} . $line;

        }
   }

# Now that your hash contains single line sequences, you can simply
# loop over each sequence in your hash, determine the sequence length,
# and print it out

foreach my $sequence_entry (keys %singleLineSequences){

    # grab a hash entry and store it into a variable
    my $currentSequence = $singleLineSequences{$sequence_entry};

    # determine length of the sequence
    my $lengthSequence = length($currentSequence);

    # print the result: id,length
    print $sequence_entry . "," . $lengthSequence . "\n";
}

Go into the directory including your script, and execute it as follows:

C:\Users\bongbang\Desktop\perl calculateLength.pl
ADD COMMENTlink modified 28 days ago by Ram16k • written 3.7 years ago by onuralp180

Why not use BioPerl to achieve both module reuse as well as algorithm elicitation?

ADD REPLYlink written 3.7 years ago by Ram16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1401 users visited in the last hour