Question: Is There Such A Way/ Program To Do This...
0
gravatar for Biogeek
6.8 years ago by
Biogeek400
Biogeek400 wrote:

Hi guys,

I followed the Trinity/ RSEM/ EdgeR pipeline and have produced some FPKM spreadsheets for genes and isoforms.

I am currently analysing the isoforms:

I have 3 sample conditions and I am comparing two conditions.So far I have worked out Log2 fold changes between conditions to get the change in level of expression between 2 samples for all of the isoforms.

I now have a .FASTA file generated from Trinity and I wan't to extract the isoforms I am interested in (top 1000 increases in expression) between 2 samples. I have the list of sequences on Excel. Is there anyway I can pick these out of the .FASTA file and BLAST them against NCBI in one step??

So far I'm using CLCBio to load up my .Fasta file and I am filter searching for the sequences of interest then blasting the lists. As you can imagine you need to copy/paste the isoforms of interest names from excel into CLCBio search filter one by one and this takes a long time.

Ideally I'm looking for a quick and simple step or automated program which will do this for me, or is there anyway I can use the spreadsheet for a program to read off and find the sequences of interest. Copying and pasting one by one is taking too long - especially to copy and paste 1000 sequence names...

If anyone knows of a program or way to do this easier, I would love to know.

Many thanks.

fasta blast rna-seq • 2.0k views
ADD COMMENTlink modified 6.8 years ago by Sam3.3k • written 6.8 years ago by Biogeek400
1

By the way, you really ought to change the Title of your question to something meaningful/informative, such as "retrieving fasta from id list, batch blast". People who know the answer will be much more likely to find your question and answer it, and going forward, the question and answer will have value for those with the same question in the future. (Professional courtesy in the digital age: meaningful titles, meaningful subject lines...)

ADD REPLYlink written 6.8 years ago by seidel7.1k
2
gravatar for Sam
6.8 years ago by
Sam3.3k
New York
Sam3.3k wrote:

You can use the following perl script:

#bin/perl

use strict;

open(SEQLIST, $ARGV[0]);
open(FASTA, $ARGV[1]);

my %index;
for my $line (<SEQLIST>){
    chomp($line);
    $index{$line} = 1;
}
close(SEQLIST);

print {*STDERR} "Sequence Name read\n";

my $ok = 0;
for my $line (<FASTA>){
    chomp($line);
    if($line =~ /^>/){
        my @getName = split(/ /, $line);
        my $name = substr $getName[0], 1;
        if(exists($index{$name})){
            $ok = 1;
            print "$line\n";
        }
        else{
            $ok = 0;
        }
    }
    elsif($ok == 1){
        print "$line\n";
    }
}
close(FASTA);

Where the first argument will be your sequence name file and the second will be the fasta file. Please note that the file should be of the following format:

Sequence list file:
Name1
Name2

Fasta File:
>Name1 descriptions
NNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNN

>Name2 descriptions
NNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNN
ADD COMMENTlink written 6.8 years ago by Sam3.3k
1
gravatar for seidel
6.8 years ago by
seidel7.1k
United States
seidel7.1k wrote:

You can use fastacmd to retrieve the fasta sequences with your list. However, you'll have to make a blastdb out of your fasta file using formatdb and the -o option. Calling fastacmd -i and your list of identifiers that match the entires in your fasta file should work. You can then blast the resulting file in batch using a variety of methods (either via the web, or your command line).

ADD COMMENTlink written 6.8 years ago by seidel7.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1164 users visited in the last hour