Extract sequences from fasta NOT present in list of IDs
3
0
Entering edit mode
9.4 years ago
iraun 6.2k

Hello all,

I've a list of IDs, and I would like to extract those sequences from a fasta file, that are not included in my list. Anyone know how to do it with perl/bioperl? Anyway to modify this script in order to get the sequences NOT present in my list?

use Bio::DB::Fasta;

my $fastaFile = shift;
my $queryFile = shift;

my $db = Bio::DB::Fasta->new( $fastaFile );
open (IN, $queryFile);
while (<IN>){
    chomp;
    $seq = $_;
    my $sequence = $db->seq($seq);
    if  (!defined( $sequence )) {
            die "Sequence $seq not found. \n"
    }  
    print ">$seq\n", "$sequence\n";
}

Thanks in advance

fasta • 4.5k views
ADD COMMENT
2
Entering edit mode
9.4 years ago
Ram 43k

My script might help: https://github.com/RamRS/myPerlScripts/blob/master/filterMulFa.pl

Just negate the if(exists)

ADD COMMENT
0
Entering edit mode

Thanks a lot RamRS. I would take a look. So, if I negate the sentence, I would get all the sequences of fasta file that are not included in my ID list, it's correct?

ADD REPLY
0
Entering edit mode

That is correct. It creates a hash of all IDs in the ID file and then, iterates through the FA file seq by seq. If the ID of the current sequence is not in the hash, it should print that seq out.

ADD REPLY
0
Entering edit mode

Hi ram,

I am trying to use your script but I am getting this message:

perl filterMulFa.pl ids fasta -> Getopt::ArgParse: Option fasta is required.

To have this module I used cpan. Second point, how its possible to negate the sentence in perl? many thanks in advance.

pd. I am just starting with perl

ADD REPLY
1
Entering edit mode

The syntax to use the script is

perl filterMulFa.pl -f input_file.fasta

To negate the functionality, change the if(exists(...)) at line 62 to unless(exists(...))

In Perl, you can negate if($condition) with either if( ! $condition) or unless($condition)

ADD REPLY
0
Entering edit mode

Many thanks for your answer. Actually still I cannot run the script. I created a fake sequence.fasta:

>a
CCCDDD
>b
AAAAAA
>c

list.id:

b
c
perl filterMulFa.pl -f sequence.fasta -i list.id

and the return is this:

perl filterMulFa.pl: extract subset of sequences from FASTA file

usage: 03_pick_IDs_fasta.pl --ids|-i --fasta|-f

This script reads a FASTA file with multiple \            sequences, a
text file with a list of IDs and picks from the \           FASTA file
sequences that have an ID match in the IDs file. \             These
sequences are then printed to STDOUT

required named arguments:

  --ids, -i IDS          file with IDs of sequences to be picked
  --fasta, -f FASTA      input FASTA file

I dont know what is going wrong. Thanks in advance

ADD REPLY
1
Entering edit mode

Hmmm. I was working on fixing a bug in this script, and now I see what it could be. For the moment, could you just comment out or remove lines 24-28 please? The script should work fine after that.

ADD REPLY
2
Entering edit mode
9.4 years ago

You can do this with BBTools:

filterbyname.sh in=reads.fasta out=filtered.fasta names=file include=f

include toggles between including the specified reads, or excluding the specified reads. The names file can either be a list of names (one name per line) or another fasta/fastq/sam file.

ADD COMMENT
0
Entering edit mode

Many thanks, It worked perfect.

I am surprise about this java option, people says that for bioinformatic solutions perl and python are the most useful languages. Should I learn java then? I ask this because I decided to learn one of this languages

ADD REPLY
0
Entering edit mode

It depends on the use-case. Perl and Python are scripting languages, better suited to interacting with the shell. Java and C are compiled languages, better suited for processing large amounts of data quickly. It's useful to know one of each.

ADD REPLY
0
Entering edit mode
9.4 years ago
PoGibas 5.1k

Quick and dirty.

cat dummy.fasta
>one
AAAAAAAAAAAAAAAAAAAA
>two
CCCCCCCCCCCCCCCCCCCC
>tree
GGGGGGGGGGGGGGGGGGGG
>four
TTTTTTTTTTTTTTTTTTTT

cat dummy.ID
two

grep '>' dummy.fasta | grep -v -F -f dummy.ID - | grep -A1 -w -F -f - dummy.fasta | grep -v '^--'
>one
AAAAAAAAAAAAAAAAAAAA
>tree
GGGGGGGGGGGGGGGGGGGG
>four
TTTTTTTTTTTTTTTTTTTT
ADD COMMENT
0
Entering edit mode

But how does this deal with multi-line sequences?

ADD REPLY
0
Entering edit mode

It doesn't, I just wanted to show the way how to solve this problem using basic unix commands. If OP has multi-line fasta (that wasn't mentioned) there are great answers here on biostars how to convert multi-line fasta to single line.

ADD REPLY
0
Entering edit mode

Ah. On a side note, have you tried bioawk? It is awesome in dealing with '>' as the delimiter for FASTA input :)

ADD REPLY
0
Entering edit mode

Some time ago there was thread on multi-line vs single line fasta: How To Get Fasta Format Using Fastafrombed Or How To Turn Linearized Fasta To The Same Length Columns :)

ADD REPLY
0
Entering edit mode

Yes, but multi-line FASTA exists and needs to be accounted into most solutions. Which is why I usually go with Bio- packages and let them deal with these quirks!

ADD REPLY
0
Entering edit mode

Thanks Pgibas, but actually I'm looking a way to do it in perl... :)

ADD REPLY

Login before adding your answer.

Traffic: 1349 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6