Granted that this is a ridiculous way to do this, I thought it might be useful to demonstrate the algorithmic steps and how they are implemented using Perl. You can think about how to modify and improve various steps, and write your own scripts.
Make sure that you have Perl installed on your computer.
C:\Users\bongbang\Desktop\perl -v
Copy and paste the following code into a text file and save it as calculateLength.pl
#!/usr/bin/perl
use warnings;
use strict;
# Point to where your input fasta file is located
my $fasta_file = "/Users/bongbang/Desktop/my_sequences.fasta";
# Declare that you want to read this file, and print the error message
# in case that the input fasta file does not exist at the specified path
open(FASTA, "<", $fasta_file) or die("Probably wrong path: $fasta_file\n");
# We will linearize the sequences into this hash
my %singleLineSequences;
# Initialize a variable to store sequence ids
my $sequence_id;
# Read the fasta file line by line
while(<FASTA>){
my $line = $_; chomp($line);
# if this is a new sequence with the header at the beginnning
# Extract sequence id using regular expression (\S+) and
# store it into the variable $sequence_id
if ($line =~ m/^>(\S+)/){
$sequence_id = $1; # e.g., YKR054C
# Reserve an entry in your hash
# $sequence_id = YKR054C
# $singleLineSequences{'YKR054C'} = ""
# No sequence has been added yet, hence the empty value
$singleLineSequences{$sequence_id} = "";
}
# if the line is not a header but part of a sequence,
# append this part to the corresponding sequence id in
# your hash entry
else {
# paste the current line to the end of the sequence
$singleLineSequences{$sequence_id} = $singleLineSequences{$sequence_id} . $line;
}
}
# Now that your hash contains single line sequences, you can simply
# loop over each sequence in your hash, determine the sequence length,
# and print it out
foreach my $sequence_entry (keys %singleLineSequences){
# grab a hash entry and store it into a variable
my $currentSequence = $singleLineSequences{$sequence_entry};
# determine length of the sequence
my $lengthSequence = length($currentSequence);
# print the result: id,length
print $sequence_entry . "," . $lengthSequence . "\n";
}
Go into the directory including your script, and execute it as follows:
C:\Users\bongbang\Desktop\perl calculateLength.pl
"but I would rather not do that if I can avoid it" why ?
Because I looked at the syntax for awk and it seemed rather messy (it didn't help that the name reminds me of "awkward"). I thought I was going to save myself time, but now it appears that awk is probably the most efficient way.
see also: Code Golf: Mean Length Of Fasta Sequences