Question: Perl Code for Sequence Extraction
0
gravatar for csmpresent
4.3 years ago by
csmpresent20
India
csmpresent20 wrote:

Hi,

Please let me know how the coding can be done for extracting each sequence in separate files, from a single file (*.txt or *.fa) containing those multiple sequences in Fasta formay.

Now, please let me elaborate this:

Suppose I have a file MultiFasta.txt which containes these three sequences in Fasta format:

> Species1
GTTGATGTAGCTTAAACTTAAAGCAAGGCA... ...AACAGACTTACACATGCAAGCATCCACGCCCCGGTGAG

> Species2
CGCTTAACCACACCC... ... ...CCATAA

> Species3
ATTAGATACCCC... ...TATATACCGCCATCTTCAGCAAACCC

Now I want these sequences should be extracted in separate files (eithe text format or fasta format). Please let me know what should be the code for the same. I was trying but, all in vein.

Thanks in advance.

 

ADD COMMENTlink modified 4.3 years ago by iraun3.5k • written 4.3 years ago by csmpresent20
3

There are about 100 questions dealing with something similar like this. See: similar posts. Does any of those help you?

For using BioPerl see the SeqIO documentation: http://www.bioperl.org/wiki/HOWTO:SeqIO

What specifically have you tried?

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by Michael Dondrup46k
1

As a side note, in contrast to the code presented in the tutorial, your program should always start like this

 

#!/usr/bin/env perl

use strict;
use warnings;
use diagnostics; # mandatory for a beginner

If you get any error message from such a script, you may post it here, otherwise not ;) (because that means you already know what you are doing)

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by Michael Dondrup46k

I am not an expert in Perl. I have Goggle'd for the codes. A number of codes are available, but I did not find where it is stumbling. 

Thanks for the help.

ADD REPLYlink written 4.3 years ago by csmpresent20

One becomes good at something only by trying repeatedly and reducing failure at each step. Saying "I'm not good at XYZ" is no good if said skill is crucial to one's profession or passion. 

Also, you could always use a different programming language. And when people give you suggestions that they know will help beginners, it is in your benefit to accept and try the suggestion.

ADD REPLYlink written 4.3 years ago by RamRS21k
2
gravatar for iraun
4.3 years ago by
iraun3.5k
Norway
iraun3.5k wrote:

I know that you're looking for perl solution, but here I show you an awk one-liner possible solution.

awk '/^>/{f=substr($1,2);s=f".fasta"} {print > s}' yourfile.fasta

This awk command will produce one fasta file for each sequence stored in your file. The name of the output files will be the header of each fasta record stored in your multifasta file. I don't know if you are familiar with awk but just in case:

  • /^>/ ---> if line starts with ">" (header line of fasta record).
  • f=substr($1,2) ---> Remove the ">" of header and save the string in f variable (this variable will be the output filename)
  • s=f".fasta" ---> Output file will be the content of variable $f (header) concatenated with the extension ".fasta"
  • print > s ---> save the fasta record to $s variable (output file).
ADD COMMENTlink written 4.3 years ago by iraun3.5k
1

Try Heng Li's bioawk. awk for biological data formats makes parsing easier :)

https://github.com/lh3/bioawk

ADD REPLYlink written 4.3 years ago by RamRS21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 904 users visited in the last hour