Extracting A Sequence By Position Using Perl
3
1
Entering edit mode
11.2 years ago
Jimmyk ▴ 20

Hi guys, How do i extract a sequences from a fasta file by taking the start and end position from a gene predicted file: here the example the file with the orf statistics is my predicted file and for example the start position for the first orf is 65 and the end is 213. and the fasta file i'm going to search those position is the other one

my predicted file looks like this

>Seq1 [organism=S.burgodofry...
orf00001       65      213  +1     2.93
orf00002      799     2328  +1     7.09
orf00003     2331     3437  +3     6.09
orf00004     3457     4044  +1     6.15
>Seq2 [organism=S.burgodofry...
orf00001       55      317  +1     2.17
orf00002      206      610  +2     5.28
orf00003      747     2408  +3     4.85


and my fasta sequence sequence look like this:

>Seq1 [organism=S.burgodofry]...
ACTGTAGATGACATGACCAGTACGATACAGAT...
....
........
>Seq2 [organism=.....]
ATGTCGTGACTAGTACGATCAGATCAGAT
.........................
..............
...

perl fasta sequence retrieval • 4.3k views
1
Entering edit mode

You don't say which fields in your gene prediction file correspond to start and end. And that isn't a Fasta file, my friend. What have you tried so far?

0
Entering edit mode

my bad, the file with the orf statistics are is my predicted file and for example the start position for the first orf is 65 and the end is 213. and the fasta file i'm going to search those position is the other one

3
Entering edit mode
11.2 years ago
David L. ▴ 110

If you don't mind using BioPerl, you can index your fasta file with Bio::Index::Fasta or Bio::DB::Fasta. You can retrieve the sequence as a Bio::Seq object from the index and use the subseq method to extract the sequence between start and end position.

The BioPerl Tutorial has a [?]section[?] about Bio::Index::Fasta/Bio::DB::Fasta with sample code.

0
Entering edit mode

The thing is i'm new for programming, except some perl reading

0
Entering edit mode

Time to put that reading into practice then :-)

2
Entering edit mode
11.2 years ago

I've had a similar problem before when I had to extract gene predictions from a GFF3 file. You can try to adapt the answers given in the FriendFeed thread, though the answer uses BioPython (oh, the times before BioStar...).

2
Entering edit mode
11.2 years ago
lexnederbragt ★ 1.3k

"Beginner's" perl way:

1. Reading the table using the split function to get the column values in a list (do you need to adjust for the frame or is the starting position given 'in-frame'?)
2. Put start and stop positions in a hash (to keep things simple you could use SeqX_orf0000Y as keys)
3. Parsing fasta files with perl: see these answers
4. Getting the relevant portion of the sequence using the substr function

More complicated ways involve complex data structures, BioPerl etc.