Question: Extract sequence number "X" from multifasta file
1
gravatar for user230613
5.3 years ago by
user230613280
Europe
user230613280 wrote:

Hi there,
I know there are different ways to extract some "wanted" sequences from multifasta file. In general, having one or more IDs, you can retrieve the sequence from multifasta file using bioperl script or another kind of software as ones pointed out here (Extracting Multiple Fasta Sequences At A Time From A File Containing Many Sequences). In my case I know that I want the sequence number 1, 5 and 10 from a multifasta file. Any idea how can I achieve this using perl/bioperl preferably?

Thanks in advance.
P.D: I'm a beginner in bioinformatics word, sorry if it is a very basic question.

fasta • 2.5k views
ADD COMMENTlink modified 3.1 years ago by Biostar ♦♦ 20 • written 5.3 years ago by user230613280
3
gravatar for SES
5.3 years ago by
SES8.3k
Vancouver, BC
SES8.3k wrote:

Here is a BioPerl example since you asked:

I put this in a script so it is easier to read, but in practice, I would just do this at the command line with a one-liner (that's how I wrote it):

perl -MBio::SeqIO -M'List::MoreUtils qw(any)' -E '@ids = (1, 5, 10); $num = 0; $seqio = Bio::SeqIO->new(-fh => \*STDIN); while ($seq = $seqio->next_seq) { $num++; say join "\n", ">".$seq->id, $seq->seq if any { $_ == $num } @ids; }' < seqs.fas

That will save you from opening an editor. It's not as readable but it works the same. :)

ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by SES8.3k

Thank you a lot SES. One last question, could you explain me a little bit the meaning of this last two lines?:

say join "\n", ">".$seq->id, $seq->seq
if any { $_ == $num } @ids;

And another thing, using "say" is possible to print to a file like with "print?

Thanks again.

 

ADD REPLYlink written 5.3 years ago by user230613280
1

say was added in perl version 5.10, it just adds a newline to your print, so it saves you some typing. If you have something as old as 5.10, you can enable it with use feature 'say'; or you can enable all new features at the command line with -E.

join should be self-explanatory, it just adds a newline after the sequence ID and the sequence itself. In Perl, statements are terminated by a semi-colon, so that last statement is one line. That style is called post-fix, with the conditional at the end, and this makes the code much more readable (esp. when put on separate lines) by removing unnecessary braces. That statement should be easy to read, it says, "print the record if any IDs match those in the list." And that sentence is almost working code! That is a good reason to use the functions from List::Util or List::MoreUtils, they are very expressive and they make the code easy to read (and they are fast, as they are all written in C).

Note that I imported the any function from the package List::MoreUtils. All of the functions in that package are now part of List::Utils, which is shipped with Perl (the lastest version includes these functions). I used the List::MoreUtils package so the code will work with any version of Perl, not just the latest. I think it will simplify things going forward now that all the functions are in one package and in core Perl.

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by SES8.3k
1
gravatar for Brian Bushnell
5.3 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

There is a program in BBTools which will do that:

getreads.sh in=data.fasta out=filtered.fasta id=1,5,10

 

ADD COMMENTlink written 5.3 years ago by Brian Bushnell17k
1
gravatar for Pierre Lindenbaum
5.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum127k wrote:

"I want the sequence number 1, 5 and 10 from a multifasta file"

 

awk '/^>/ {N++;} {if(N==1 || N==5 ||N==10) print;}' in.fa
ADD COMMENTlink written 5.3 years ago by Pierre Lindenbaum127k

I might be mistaken, but I think this only prints the lines containing the sequence identifiers, correct?

ADD REPLYlink written 5.3 years ago by Matt Shirley9.3k

no, there is no 'next' statement in

{N++; }

so awk continues to scan the matching patterns. It would only print the headers with

{N++; next;}

 

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by Pierre Lindenbaum127k
1

Thank you Pierre. I'm still impressed by how concise most awk scripts can be.

ADD REPLYlink written 5.3 years ago by Matt Shirley9.3k
1
gravatar for Matt Shirley
5.3 years ago by
Matt Shirley9.3k
Cambridge, MA
Matt Shirley9.3k wrote:

I know you are looking for a Perl solution, but you could use my pyfaidx module for this. In the latest version I've added numeric indexing, so you can do:

 

ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by Matt Shirley9.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 969 users visited in the last hour