Question: Sequence length from Fasta
4
gravatar for bongbang
5.7 years ago by
bongbang70
United States
bongbang70 wrote:

Is there a ready-made command-line program that can output sequence lengths from a fasta file? Like so:

YKR054C,4092
YLR106C,4910
...

I've seen examples of people trying to roll their own using awk, but I would rather not do that if I can avoid it. Between Emboss, Biopython, Bioperl, etc there must be something that can do this, right? What I'm looking for would be the equivalent of what blast_formatter can do for blast hits, something that can extract simple information that's already there in a format specified by user. Thank you.


 

fasta • 33k views
ADD COMMENTlink modified 22 months ago by mg100 • written 5.7 years ago by bongbang70
1

"but I would rather not do that if I can avoid it" why ?

ADD REPLYlink written 5.7 years ago by Pierre Lindenbaum129k
1

Because I looked at the syntax for awk and it seemed rather messy (it didn't help that the name reminds me of "awkward"). I thought I was going to save myself time, but now it appears that awk is probably the most efficient way. 
 

ADD REPLYlink written 5.7 years ago by bongbang70

see also: Code Golf: Mean Length Of Fasta Sequences

ADD REPLYlink written 5.7 years ago by Pierre Lindenbaum129k
17
gravatar for RamRS
5.7 years ago by
RamRS28k
Houston, TX
RamRS28k wrote:

Heng Li's bioawk might be a good compromise between the simplicity of awk and readability of programming languages.

bioawk -c fastx '{ print $name, length($seq) }' < sequences.fa
ADD COMMENTlink modified 22 months ago • written 5.7 years ago by RamRS28k
1

That's exactly what I was looking for. If you make that an answer, with the missing quotation mark added, I'll accept it. Thanks.

ADD REPLYlink modified 10 months ago by RamRS28k • written 5.7 years ago by bongbang70

It's done :)

ADD REPLYlink written 5.7 years ago by RamRS28k
32
gravatar for Matt Shirley
5.7 years ago by
Matt Shirley9.4k
Cambridge, MA
Matt Shirley9.4k wrote:

Why not samtools faidx sample.fa and then cut -f1-2 sample.fa.fai? The fasta index file will give you sequence name and length in the first two columns, and you probably already have this file indexed.

ADD COMMENTlink modified 22 months ago by RamRS28k • written 5.7 years ago by Matt Shirley9.4k
4
gravatar for Renesh
5.7 years ago by
Renesh1.9k
United States
Renesh1.9k wrote:

You can make your own code for this,

Copy and paste following code in fasta_len.py file:

import sys
from Bio import SeqIO

FastaFile = open(sys.argv[1], 'rU')

for rec in SeqIO.parse(FastaFile, 'fasta'):
    name = rec.id
    seq = rec.seq
    seqLen = len(rec)
    print name, seqLen

FastaFile.close()

Make this file executable by:

chmod u+x fasta_len.py

and run as,

./fasta_len.py file.fasta

Note: you should have installed biopython

ADD COMMENTlink modified 22 months ago by RamRS28k • written 5.7 years ago by Renesh1.9k
1
import sys
from Bio.SeqIO.FastaIO import SimpleFastaParser

FastaFile = open(sys.argv[1], 'rU')
LenFile = open('./data/lengths.csv', 'w')

for name, seq in SimpleFastaParser(FastaFile):
    seqLen = len(seq)
    LenFile.write(name + ',' + str(seqLen) + '\n')

FastaFile.close()
LenFile.close()

This is somewhat more efficient for large files http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:low-level-fasta-fastq

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by khimulya10

I was hoping to avoid something like that (which is really not too burdensome at all, I know). I now realize that sometimes writing your own script can actually take less time than looking for a canned solution. (Doesn't always work out that way, though.)

ADD REPLYlink modified 22 months ago by RamRS28k • written 5.7 years ago by bongbang70

Works like a charm. Thank you!

ADD REPLYlink written 2.7 years ago by chrystine.yan0
1
gravatar for hpmcwill
5.7 years ago by
hpmcwill1.1k
United Kingdom
hpmcwill1.1k wrote:

Assuming you already have EMBOSS installed you could use 'infoseq' to get the format you desire. For example:

$ infoseq -auto -nocolumns -delimiter ',' -only -noheading -name -length sequences.fa
2AAA_HUMAN,589
AAPK1_PIG,385
AAPK1_MOUSE,559
AAPK1_PONAB,554
AAPK1_HUMAN,559
AAPK1_RAT,559
...

Depending on the identifier format used in your source fasta file you may need to choose between the appropriate -name, -accession, -gi or -seqversion option for the identifier column, or use the "-sformat pearson" option to maintain the original source identifier. For example to return the original identifiers when they are in db|acc|id format:

$ infoseq -auto -nocolumns -delimiter ',' -only -noheading -name -length -sformat pearson sequences.fa
sp|P30153|2AAA_HUMAN,589
sp|Q09136|AAPK1_PIG,385
sp|Q5EG47|AAPK1_MOUSE,559
sp|Q5RDH5|AAPK1_PONAB,554
sp|Q13131|AAPK1_HUMAN,559
sp|P54645|AAPK1_RAT,559
...
ADD COMMENTlink modified 22 months ago by RamRS28k • written 5.7 years ago by hpmcwill1.1k
1
gravatar for mg
22 months ago by
mg100
mg100 wrote:

Using SeqKit's fx2tab:

seqkit fx2tab --length --name --header-line  foo.fasta

This prints sequence name and length with empty columns for sequences and qualities.

ADD COMMENTlink modified 22 months ago • written 22 months ago by mg100
0
gravatar for onuralp
5.7 years ago by
onuralp190
Turkey
onuralp190 wrote:

Granted that this is a ridiculous way to do this, I thought it might be useful to demonstrate the algorithmic steps and how they are implemented using Perl. You can think about how to modify and improve various steps, and write your own scripts.

Make sure that you have Perl installed on your computer.

C:\Users\bongbang\Desktop\perl -v

Copy and paste the following code into a text file and save it as calculateLength.pl

#!/usr/bin/perl
use warnings;
use strict;

# Point to where your input fasta file is located
my $fasta_file = "/Users/bongbang/Desktop/my_sequences.fasta";

# Declare that you want to read this file, and print the error message
# in case that the input fasta file does not exist at the specified path
open(FASTA, "<", $fasta_file) or die("Probably wrong path: $fasta_file\n");

# We will linearize the sequences into this hash
my %singleLineSequences;

# Initialize a variable to store sequence ids
my $sequence_id;

# Read the fasta file line by line
while(<FASTA>){
    my $line = $_; chomp($line);

    # if this is a new sequence with the header at the beginnning
    # Extract sequence id using regular expression (\S+) and
    # store it into the variable $sequence_id

    if ($line =~ m/^>(\S+)/){

        $sequence_id = $1; # e.g., YKR054C            

        # Reserve an entry in your hash
        # $sequence_id = YKR054C
        # $singleLineSequences{'YKR054C'} = ""
        # No sequence has been added yet, hence the empty value
        $singleLineSequences{$sequence_id} = ""; 

        }

    # if the line is not a header but part of a sequence, 
    # append this part to the corresponding sequence id in
    # your hash entry

    else {

        # paste the current line to the end of the sequence
        $singleLineSequences{$sequence_id} = $singleLineSequences{$sequence_id} . $line;

        }
   }

# Now that your hash contains single line sequences, you can simply
# loop over each sequence in your hash, determine the sequence length,
# and print it out

foreach my $sequence_entry (keys %singleLineSequences){

    # grab a hash entry and store it into a variable
    my $currentSequence = $singleLineSequences{$sequence_entry};

    # determine length of the sequence
    my $lengthSequence = length($currentSequence);

    # print the result: id,length
    print $sequence_entry . "," . $lengthSequence . "\n";
}

Go into the directory including your script, and execute it as follows:

C:\Users\bongbang\Desktop\perl calculateLength.pl
ADD COMMENTlink modified 2.1 years ago by RamRS28k • written 5.7 years ago by onuralp190

Why not use BioPerl to achieve both module reuse as well as algorithm elicitation?

ADD REPLYlink written 5.7 years ago by RamRS28k

Hi,

I aim to identify the number of alanines as a function of the sequence length. Is there a way to count the number of alanines in each sequence along with the sequence length? An output that looks something like:

sp|P0A9Z1|GLNB_ECOLI,112,A=10

sp|P62517|OPGH_ECOLI,847,A=23

sp|P76290|CMOA_ECOLI,247,A=41

Thanks in advance.

ADD REPLYlink modified 9 months ago • written 9 months ago by d.kaur0

What have you tried?

ADD REPLYlink written 9 months ago by RamRS28k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1468 users visited in the last hour