Question: Extract sequences from fasta NOT present in list of IDs
0
gravatar for iraun
5.8 years ago by
iraun3.8k
Norway
iraun3.8k wrote:

Hello all,
I've a list of IDs, and I would like to extract those sequences from a fasta file, that are not included in my list. Anyone know how to do it with perl/bioperl? Anyway to modify this script in order to get the sequences NOT present in my list?

use Bio::DB::Fasta;

my $fastaFile = shift;
my $queryFile = shift;

my $db = Bio::DB::Fasta->new( $fastaFile );
open (IN, $queryFile);
while (<IN>){
    chomp;
    $seq = $_;
    my $sequence = $db->seq($seq);
    if  (!defined( $sequence )) {
            die "Sequence $seq not found. \n"
    }  
    print ">$seq\n", "$sequence\n";
}

 

Thanks in advance.

fasta • 2.8k views
ADD COMMENTlink modified 5.8 years ago by Brian Bushnell17k • written 5.8 years ago by iraun3.8k
2
gravatar for RamRS
5.8 years ago by
RamRS30k
Baylor College of Medicine, Houston, TX
RamRS30k wrote:

My script might help: https://github.com/RamRS/myPerlScripts/blob/master/filterMulFa.pl

Just negate the if(exists)

ADD COMMENTlink written 5.8 years ago by RamRS30k

Thanks a lot RamRS. I would take a look. So, if I negate the sentence, I would get all the sequences of fasta file that are not included in my ID list, it's correct?

ADD REPLYlink written 5.8 years ago by iraun3.8k

That is correct. It creates a hash of all IDs in the ID file and then, iterates through the FA file seq by seq. If the ID of the current sequence is not in the hash, it should print that seq out.

ADD REPLYlink written 5.8 years ago by RamRS30k

Hi ram,

I am trying to use your script but I am getting this message:

perl filterMulFa.pl ids fasta -> Getopt::ArgParse: Option fasta is required.

To have this module I used cpan. Second point, how its possible to negate the sentence in perl? many thanks in advance.

pd. I am just starting with perl

ADD REPLYlink modified 10 months ago by RamRS30k • written 5.0 years ago by ericpintof0
1

The syntax to use the script is

perl filterMulFa.pl -f input_file.fasta

To negate the functionality, change the if(exists(...)) at line 62 to unless(exists(...))

In Perl, you can negate if($condition) with either if( ! $condition) or unless($condition)

ADD REPLYlink modified 10 months ago • written 5.0 years ago by RamRS30k

Many thanks for your answer. Actually still I cannot run the script. I created a fake sequence.fasta:

>a
CCCDDD
>b
AAAAAA
>c

list.id:

b
c
perl filterMulFa.pl -f sequence.fasta -i list.id

and the return is this:

perl filterMulFa.pl: extract subset of sequences from FASTA file

usage: 03_pick_IDs_fasta.pl --ids|-i --fasta|-f

This script reads a FASTA file with multiple \            sequences, a
text file with a list of IDs and picks from the \           FASTA file
sequences that have an ID match in the IDs file. \             These
sequences are then printed to STDOUT

required named arguments:

  --ids, -i IDS          file with IDs of sequences to be picked
  --fasta, -f FASTA      input FASTA file

I dont know what is going wrong. Thanks in advance

ADD REPLYlink modified 10 months ago by RamRS30k • written 5.0 years ago by ericpintof0
1

Hmmm. I was working on fixing a bug in this script, and now I see what it could be. For the moment, could you just comment out or remove lines 24-28 please? The script should work fine after that.

ADD REPLYlink written 5.0 years ago by RamRS30k
2
gravatar for Brian Bushnell
5.8 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

You can do this with BBTools:

filterbyname.sh in=reads.fasta out=filtered.fasta names=file include=f

include toggles between including the specified reads, or excluding the specified reads. The names file can either be a list of names (one name per line) or another fasta/fastq/sam file.

ADD COMMENTlink modified 10 months ago by RamRS30k • written 5.8 years ago by Brian Bushnell17k

Many thanks, It worked perfect. 
I am surprise about this java option, people says that for bioinformatic solutions perl and python are the most useful languages. Should I learn java then? I ask this because I decided to learn one of this languages 

ADD REPLYlink written 5.0 years ago by ericpintof0

It depends on the use-case.  Perl and Python are scripting languages, better suited to interacting with the shell.  Java and C are compiled languages, better suited for processing large amounts of data quickly.  It's useful to know one of each.

ADD REPLYlink written 5.0 years ago by Brian Bushnell17k
0
gravatar for PoGibas
5.8 years ago by
PoGibas4.8k
Vilnius
PoGibas4.8k wrote:

Quick and dirty.

cat dummy.fasta
>one
AAAAAAAAAAAAAAAAAAAA
>two
CCCCCCCCCCCCCCCCCCCC
>tree
GGGGGGGGGGGGGGGGGGGG
>four
TTTTTTTTTTTTTTTTTTTT

cat dummy.ID
two

grep '>' dummy.fasta | grep -v -F -f dummy.ID - | grep -A1 -w -F -f - dummy.fasta | grep -v '^--'

>one
AAAAAAAAAAAAAAAAAAAA
>tree
GGGGGGGGGGGGGGGGGGGG
>four
TTTTTTTTTTTTTTTTTTTT

 

ADD COMMENTlink modified 5.8 years ago • written 5.8 years ago by PoGibas4.8k

But how does this deal with multi-line sequences?

ADD REPLYlink written 5.8 years ago by RamRS30k

It doesn't, I just wanted to show the way how to solve this problem using basic unix commands. If OP has multi-line fasta (that wasn't mentioned) there are great answers here on biostars how to convert multi-line fasta to single line.

ADD REPLYlink modified 5.8 years ago • written 5.8 years ago by PoGibas4.8k

Ah. On a side note, have you tried bioawk? It is awesome in dealing with '>' as the delimiter for FASTA input :)

ADD REPLYlink written 5.8 years ago by RamRS30k

Some time ago there was thread on multi-line vs single line fasta: How To Get Fasta Format Using Fastafrombed Or How To Turn Linearized Fasta To The Same Length Columns :)

ADD REPLYlink modified 5.8 years ago • written 5.8 years ago by PoGibas4.8k

Yes, but multi-line FASTA exists and needs to be accounted into most solutions. Which is why I usually go with Bio- packages and let them deal with these quirks!

ADD REPLYlink written 5.8 years ago by RamRS30k

Thanks Pgibas, but actually I'm looking a way to do it in perl...  :)

ADD REPLYlink written 5.8 years ago by iraun3.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1396 users visited in the last hour