Question: How can I assign sequence length to an ID in a FASTA file using Perl?
0
gravatar for SaltedPork
4.0 years ago by
SaltedPork110
SaltedPork110 wrote:

I took the sequences from a FASTA file and concatenated them to form one big sequence, which was the basis of the my research. I now have a series of coordinates (inside this concatenated sequence) that I am interested in.

I want to be able to find the original ID's of the sequences that match with the coordinates inside this concatenated sequence. I am currently writing a Perl script, does anyone have any suggestions?

#! /usr/bin/perl -w
use strict;
use Cwd;

my $input = $ARGV[0];
open (my $INPUT, "<$input") or die "unable to open $input";
while (<INPUT>) {
if( /^[AGCT]/ {

}
}
close $input;

Obviously my program isn't finishee, but i think i will try the Length function inside Perl and assign those to an array.

fasta perl • 965 views
ADD COMMENTlink modified 4.0 years ago by Jorge Amigo12k • written 4.0 years ago by SaltedPork110

Showing some example input, output would be helpful.

ADD REPLYlink written 4.0 years ago by venu6.7k

Input would be a standard Fasta file, and a file with coordinates in two columns (start-stop). Output would be a list of ID's that match to a set of coordinates i input

ADD REPLYlink written 4.0 years ago by SaltedPork110
1
gravatar for Jorge Amigo
4.0 years ago by
Jorge Amigo12k
Santiago de Compostela, Spain
Jorge Amigo12k wrote:

if you concatenate all sequences in one then you lose the original ID information. what your assignment probably wanted you to do is to store the lengths of the original sequences, and then extract which ID contains which position. if that is the case there are many things you could try that would depend on how skilful you are in perl, but I would go for storing cumulative lengths hash for each sequence with id, that I would have previously stored in an %idSeq hash. once you have all this information stored, you only need to loop through the positions requested looking if each position is below each sorted cumulative length:

foreach $id
 $totalLength += length($seq)
 $lengthId{$totalLength} = $id
foreach $pos
 foreach $length
  if ($length >= $pos) { print; last }
ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by Jorge Amigo12k

Thanks for replying! I would assign the sequence to $seq, Id's to $id and then use those inside a hash, with the hash key as $lengthId?

ADD REPLYlink written 4.0 years ago by SaltedPork110

that would be the idea. you can either store every pair of id and seq into a hash and then loop through it, or you could evaluate sequence lengths on the fly asking if there's any position inside each particular sequence.

ADD REPLYlink written 4.0 years ago by Jorge Amigo12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1929 users visited in the last hour