Question: Extract sequences from fasta NOT present in list of IDs
0
gravatar for iraun
4.9 years ago by
iraun3.6k
Norway
iraun3.6k wrote:

Hello all,
I've a list of IDs, and I would like to extract those sequences from a fasta file, that are not included in my list. Anyone know how to do it with perl/bioperl? Anyway to modify this script in order to get the sequences NOT present in my list?

use Bio::DB::Fasta;

my $fastaFile = shift;
my $queryFile = shift;

my $db = Bio::DB::Fasta->new( $fastaFile );
open (IN, $queryFile);
while (<IN>){
    chomp;
    $seq = $_;
    my $sequence = $db->seq($seq);
    if  (!defined( $sequence )) {
            die "Sequence $seq not found. \n"
    }  
    print ">$seq\n", "$sequence\n";
}

 

Thanks in advance.

fasta • 2.4k views
ADD COMMENTlink modified 4.9 years ago by Brian Bushnell16k • written 4.9 years ago by iraun3.6k
2
gravatar for RamRS
4.9 years ago by
RamRS24k
Houston, TX
RamRS24k wrote:

My script might help: https://github.com/RamRS/myPerlScripts/blob/master/filterMulFa.pl

Just negate the if(exists)

ADD COMMENTlink written 4.9 years ago by RamRS24k

Thanks a lot RamRS. I would take a look. So, if I negate the sentence, I would get all the sequences of fasta file that are not included in my ID list, it's correct?

ADD REPLYlink written 4.9 years ago by iraun3.6k

That is correct. It creates a hash of all IDs in the ID file and then, iterates through the FA file seq by seq. If the ID of the current sequence is not in the hash, it should print that seq out.

ADD REPLYlink written 4.9 years ago by RamRS24k

Hi ram, 

I am trying to use your script but I am getting this message: 

perl filterMulFa.pl ids fasta -> Getopt::ArgParse: Option fasta is required.  To have this module I used cpan. Second point, how its possible to negate the sentence in perl? many thanks in advance. 

pd. I am just starting with perl  

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by ericpintof0
1

The syntax to use the script is 

perl filterMulFa.pl -f input_file.fasta

To negate the functionality, change the if(exists(...)) at line 62 to unless(exists(...))

In Perl, you an negate if($condition) with either if( ! $condition) or unless($condition)

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by RamRS24k

Many thanks for your answer. Actually still I cannot run the script. I created a fake sequence.fasta: 
>a

CCCDDD

>b

AAAAAA
>c

list.id:

b

c

perl filterMulFa.pl -f sequence.fasta -i list.id  and the return is this:
perl filterMulFa.pl: extract subset of sequences from FASTA file

 

usage: 03_pick_IDs_fasta.pl --ids|-i --fasta|-f

 

This script reads a FASTA file with multiple \            sequences, a

text file with a list of IDs and picks from the \           FASTA file

sequences that have an ID match in the IDs file. \             These

sequences are then printed to STDOUT

 

required named arguments:

  --ids, -i IDS          file with IDs of sequences to be picked

  --fasta, -f FASTA      input FASTA file

I dont know what is going wrong. Thanks in advance

ADD REPLYlink written 4.2 years ago by ericpintof0
1

Hmmm. I was working on fixing a bug in this script, and now I see what it could be. For the moment, could you just comment out or remove lines 24-28 please? The script should work fine after that.

ADD REPLYlink written 4.2 years ago by RamRS24k
2
gravatar for Brian Bushnell
4.9 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

You can do this with BBTools:

filterbyname.sh in=reads.fasta out=filtered.fasta names=file include=f

"include" toggles between including the specified reads, or excluding the specified reads.  The names file can either be a list of names (one name per line) or another fasta/fastq/sam file.

ADD COMMENTlink written 4.9 years ago by Brian Bushnell16k

Many thanks, It worked perfect. 
I am surprise about this java option, people says that for bioinformatic solutions perl and python are the most useful languages. Should I learn java then? I ask this because I decided to learn one of this languages 

ADD REPLYlink written 4.2 years ago by ericpintof0

It depends on the use-case.  Perl and Python are scripting languages, better suited to interacting with the shell.  Java and C are compiled languages, better suited for processing large amounts of data quickly.  It's useful to know one of each.

ADD REPLYlink written 4.2 years ago by Brian Bushnell16k
0
gravatar for PoGibas
4.9 years ago by
PoGibas4.8k
Vilnius
PoGibas4.8k wrote:

Quick and dirty.

cat dummy.fasta
>one
AAAAAAAAAAAAAAAAAAAA
>two
CCCCCCCCCCCCCCCCCCCC
>tree
GGGGGGGGGGGGGGGGGGGG
>four
TTTTTTTTTTTTTTTTTTTT

cat dummy.ID
two

grep '>' dummy.fasta | grep -v -F -f dummy.ID - | grep -A1 -w -F -f - dummy.fasta | grep -v '^--'

>one
AAAAAAAAAAAAAAAAAAAA
>tree
GGGGGGGGGGGGGGGGGGGG
>four
TTTTTTTTTTTTTTTTTTTT

 

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by PoGibas4.8k

But how does this deal with multi-line sequences?

ADD REPLYlink written 4.9 years ago by RamRS24k

It doesn't, I just wanted to show the way how to solve this problem using basic unix commands. If OP has multi-line fasta (that wasn't mentioned) there are great answers here on biostars how to convert multi-line fasta to single line.

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by PoGibas4.8k

Ah. On a side note, have you tried bioawk? It is awesome in dealing with '>' as the delimiter for FASTA input :)

ADD REPLYlink written 4.9 years ago by RamRS24k

Some time ago there was thread on multi-line vs single line fasta: How To Get Fasta Format Using Fastafrombed Or How To Turn Linearized Fasta To The Same Length Columns :)

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by PoGibas4.8k

Yes, but multi-line FASTA exists and needs to be accounted into most solutions. Which is why I usually go with Bio- packages and let them deal with these quirks!

ADD REPLYlink written 4.9 years ago by RamRS24k

Thanks Pgibas, but actually I'm looking a way to do it in perl...  :)

ADD REPLYlink written 4.9 years ago by iraun3.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1605 users visited in the last hour