Question

Is There Such A Way/ Program To Do This...

0

Entering edit mode

10.1 years ago

Biogeek ▴ 470

Hi guys,

I followed the Trinity/ RSEM/ EdgeR pipeline and have produced some FPKM spreadsheets for genes and isoforms.

I am currently analysing the isoforms:

I have 3 sample conditions and I am comparing two conditions.So far I have worked out Log2 fold changes between conditions to get the change in level of expression between 2 samples for all of the isoforms.

I now have a .FASTA file generated from Trinity and I wan't to extract the isoforms I am interested in (top 1000 increases in expression) between 2 samples. I have the list of sequences on Excel. Is there anyway I can pick these out of the .FASTA file and BLAST them against NCBI in one step??

So far I'm using CLCBio to load up my .Fasta file and I am filter searching for the sequences of interest then blasting the lists. As you can imagine you need to copy/paste the isoforms of interest names from excel into CLCBio search filter one by one and this takes a long time.

Ideally I'm looking for a quick and simple step or automated program which will do this for me, or is there anyway I can use the spreadsheet for a program to read off and find the sequences of interest. Copying and pasting one by one is taking too long - especially to copy and paste 1000 sequence names...

If anyone knows of a program or way to do this easier, I would love to know.

Many thanks.

fasta blast rna-seq • 2.6k views

ADD COMMENT • link updated 10.1 years ago by Sam ★ 4.7k • written 10.1 years ago by Biogeek ▴ 470

1

Entering edit mode

By the way, you really ought to change the Title of your question to something meaningful/informative, such as "retrieving fasta from id list, batch blast". People who know the answer will be much more likely to find your question and answer it, and going forward, the question and answer will have value for those with the same question in the future. (Professional courtesy in the digital age: meaningful titles, meaningful subject lines...)

ADD REPLY • link 10.1 years ago by seidel 11k

score 2 · Answer 1 · 2014-03-03

You can use the following perl script:

#bin/perl

use strict;

open(SEQLIST, $ARGV[0]);
open(FASTA, $ARGV[1]);

my %index;
for my $line (<SEQLIST>){
    chomp($line);
    $index{$line} = 1;
}
close(SEQLIST);

print {*STDERR} "Sequence Name read\n";

my $ok = 0;
for my $line (<FASTA>){
    chomp($line);
    if($line =~ /^>/){
        my @getName = split(/ /, $line);
        my $name = substr $getName[0], 1;
        if(exists($index{$name})){
            $ok = 1;
            print "$line\n";
        }
        else{
            $ok = 0;
        }
    }
    elsif($ok == 1){
        print "$line\n";
    }
}
close(FASTA);

Where the first argument will be your sequence name file and the second will be the fasta file. Please note that the file should be of the following format:

Sequence list file:
Name1
Name2

Fasta File:
>Name1 descriptions
NNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNN

>Name2 descriptions
NNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNN

score 1 · Answer 2 · 2014-03-03

You can use fastacmd to retrieve the fasta sequences with your list. However, you'll have to make a blastdb out of your fasta file using formatdb and the -o option. Calling fastacmd -i and your list of identifiers that match the entires in your fasta file should work. You can then blast the resulting file in batch using a variety of methods (either via the web, or your command line).