Question

How can I assign sequence length to an ID in a FASTA file using Perl?

0

Entering edit mode

7.5 years ago

SaltedPork ▴ 170

I took the sequences from a FASTA file and concatenated them to form one big sequence, which was the basis of the my research. I now have a series of coordinates (inside this concatenated sequence) that I am interested in.

I want to be able to find the original ID's of the sequences that match with the coordinates inside this concatenated sequence. I am currently writing a Perl script, does anyone have any suggestions?

#! /usr/bin/perl -w
use strict;
use Cwd;

my $input = $ARGV[0];
open (my $INPUT, "<$input") or die "unable to open $input";
while (<INPUT>) {
if( /^[AGCT]/ {

}
}
close $input;

Obviously my program isn't finishee, but i think i will try the Length function inside Perl and assign those to an array.

Perl fasta • 1.8k views

ADD COMMENT • link updated 7.5 years ago by Jorge Amigo 14k • written 7.5 years ago by SaltedPork ▴ 170

0

Entering edit mode

Showing some example input, output would be helpful.

ADD REPLY • link 7.5 years ago by venu 7.1k

0

Entering edit mode

Input would be a standard Fasta file, and a file with coordinates in two columns (start-stop). Output would be a list of ID's that match to a set of coordinates i input

ADD REPLY • link 7.5 years ago by SaltedPork ▴ 170

score 1 · Answer 1 · 2016-11-21

1

Entering edit mode

7.5 years ago

Jorge Amigo 14k

if you concatenate all sequences in one then you lose the original ID information. what your assignment probably wanted you to do is to store the lengths of the original sequences, and then extract which ID contains which position. if that is the case there are many things you could try that would depend on how skilful you are in perl, but I would go for storing cumulative lengths hash for each sequence with id, that I would have previously stored in an %idSeq hash. once you have all this information stored, you only need to loop through the positions requested looking if each position is below each sorted cumulative length:

foreach $id
 $totalLength += length($seq)
 $lengthId{$totalLength} = $id
foreach $pos
 foreach $length
  if ($length >= $pos) { print; last }

ADD COMMENT • link 7.5 years ago by Jorge Amigo 14k

0

Entering edit mode

Thanks for replying! I would assign the sequence to $seq, Id's to $id and then use those inside a hash, with the hash key as $lengthId?

ADD REPLY • link 7.5 years ago by SaltedPork ▴ 170

0

Entering edit mode

that would be the idea. you can either store every pair of id and seq into a hash and then loop through it, or you could evaluate sequence lengths on the fly asking if there's any position inside each particular sequence.

ADD REPLY • link 7.5 years ago by Jorge Amigo 14k