Question: Extract Cds Fastas From A Gff Annotation + Reference Sequence
15
gravatar for Federico Giorgi
7.4 years ago by
Columbia University
Federico Giorgi540 wrote:

Dear all,

my task is very simple, yet I was wondering if some biop* library already exists to carry it out. I have a gff annotation file, e.g.

chr10    JIGSAWGAZE    gene    17991    30631    .    +    .    ID=VIT_10s0116g00040
chr10    JIGSAWGAZE    mRNA    17991    30631    .    +    .    ID=VIT_10s0116g00040.t01;Parent=VIT_10s0116g00040
chr10    JIGSAWGAZE    UTR    17991    18031    .    +    .    Parent=VIT_10s0116g00040.t01
chr10    JIGSAWGAZE    CDS    18032    18049    .    +    .    Parent=VIT_10s0116g00040.t01
chr10    JIGSAWGAZE    CDS    18300    18378    .    +    .    Parent=VIT_10s0116g00040.t01
chr10    JIGSAWGAZE    CDS    30600    30631    .    +    .    Parent=VIT_10s0116g00040.t01
chr10    JIGSAWGAZE    gene    31111    32814    .    -    .    ID=VIT_10s0116g00060
chr10    JIGSAWGAZE    mRNA    31111    32814    .    -    .    ID=VIT_10s0116g00060.t01;Parent=VIT_10s0116g00060
chr10    JIGSAWGAZE    UTR    31111    31290    .    -    .    Parent=VIT_10s0116g00060.t01
chr10    JIGSAWGAZE    CDS    31291    31835    .    -    .    Parent=VIT_10s0116g00060.t01
chr10    JIGSAWGAZE    CDS    31897    31968    .    -    .    Parent=VIT_10s0116g00060.t01
chr10    JIGSAWGAZE    CDS    32106    32814    .    -    .    Parent=VIT_10s0116g00060.t01

And a genome fasta, containing the chromosome sequences. I need to extract the full length, intron-free CDSs, one per gene. The header of the fasta should be the gene id itself. Is there an automated script that does this? Thank you! :-)

Edit: this is the best solution so far:

gff2fasta.pl

#!/usr/bin/perl
use strict;
use warnings;
use Bio::Seq;
use Bio::SeqIO;
use Bio::DB::Fasta;

$| = 1;    # Flush output
my $outfile_cds = Bio::SeqIO->new( -format => 'fasta', -file => ">$ARGV[2].cds.fasta" );
my $outfile_pep = Bio::SeqIO->new( -format => 'fasta', -file => ">$ARGV[2].pep.fasta" );
my $outfile_cdna = Bio::SeqIO->new( -format => 'fasta', -file => ">$ARGV[2].cdna.fasta" );
my $outfile_gene = Bio::SeqIO->new( -format => 'fasta', -file => ">$ARGV[2].gene.fasta" );
my $outfile_upstream3000 = Bio::SeqIO->new( -format => 'fasta', -file => ">$ARGV[2].upstream3000.fasta" );

###### Output type description ######
# cds - translated sequence (starting with ATG and ending with a stop codon included)
# cdna - transcribed sequence (devoid of introns, but containing untranslated exons)
# protein - cds translated (includes a * as the stop codon)
# gene - the entire gene sequence (including UTRs and introns)
# upstream3000 - the 3000 upstream region of the gene (likely including the promoter)

### First, index the genome
my $file_fasta = $ARGV[0];
my $db = Bio::DB::Fasta->new($file_fasta);
print ("Genome fasta parsed\n");



### Second, parse the GFF3
my %CDS;
my %CDNA;
my $mRNA_name;
my $frame;
open GFF, "<$ARGV[1]" or die $!;
while ( my $line = <GFF> ) {
    chomp $line;
    my @array = split( "\t", $line );
    my $type = $array[2];

    if ($type eq 'gene') {
        my @attrs = split( ";", $array[8] );
        $attrs[0] =~ s/ID=//;
        my $gene_name = $attrs[0];
        my $gene_start = $array[3];
        my $gene_end = $array[4];
        my $gene_seq = $db->seq( $array[0], $gene_start, $gene_end );
        my $output_gene = Bio::Seq->new(
            -seq        => $gene_seq,
            -id         => $gene_name,
            -display_id => $gene_name,
            -alphabet   => 'dna',
        );


        # The upstream 3000
        my $upstream_start;
        my $upstream_end;
        if($array[6] eq '+') {
            $upstream_start=$gene_start-3000;
            $upstream_end=$gene_start-1;
        }
        elsif ($array[6] eq '-') {
            $upstream_start=$gene_end+1;
            $upstream_end=$gene_end+3000;
        }
        my $upstream_seq = $db->seq( $array[0], $upstream_start, $upstream_end );
        my $output_upstream3000 = Bio::Seq->new(
            -seq        => $upstream_seq,
            -id         => $gene_name."_upstream3000",
            -display_id => $gene_name."_upstream3000",
            -alphabet   => 'dna',
        );

        # Reverse Complement if the frame is minus
        if($array[6] eq '+') {
        }
        elsif ($array[6] eq '-') {
            $output_gene = $output_gene->revcom();
            $output_upstream3000 = $output_upstream3000->revcom();
        }
        else {
            die "Unknown frame! At line $. of the GFF\n";
        }
        $outfile_gene->write_seq($output_gene);
        $outfile_upstream3000->write_seq($output_upstream3000);
    }


    if ( ( $type eq 'mRNA' ) and ( $. > 2 ) ) {
        # CDS: Collect CDSs and extract sequence of the previous mRNA
        my $mergedCDS_seq;
        # WARNING we must sort by $cds_coord[1]


        foreach my $key (sort {$a <=> $b} keys %CDS) { # Ascending numeric sort of the starting coordinate
            my $coord = $CDS{$key};
            my @cds_coord = split( " ", $coord );
            my $cds_seq = $db->seq( $cds_coord[0], $cds_coord[1], $cds_coord[2] );
            $mergedCDS_seq .= $cds_seq;
        }

        my $output_cds = Bio::Seq->new(
            -seq        => $mergedCDS_seq,
            -id         => $mRNA_name,
            -display_id => $mRNA_name,
            -alphabet   => 'dna',
        );
        if ($frame eq '-') {
            $output_cds = $output_cds->revcom();
        }
        my $output_pep = $output_cds->translate();
        $outfile_cds->write_seq($output_cds);
        $outfile_pep->write_seq($output_pep);


        # CDNA: Collect UTRs and CDSs and extract sequence of the previous mRNA
        my $mergedCDNA_seq;
        foreach my $key (sort {$a <=> $b} keys %CDNA) { # Ascending numeric sort of the starting coordinate
            my $coord = $CDNA{$key};
            my @cds_coord = split( " ", $coord );
            my $cds_seq = $db->seq( $cds_coord[0], $cds_coord[1], $cds_coord[2] );
            $mergedCDNA_seq .= $cds_seq;
        }

        my $output_cdna = Bio::Seq->new(
            -seq        => $mergedCDNA_seq,
            -id         => $mRNA_name,
            -display_id => $mRNA_name,
            -alphabet   => 'dna',
        );
        if ($frame eq '-') {
            $output_cdna = $output_cdna->revcom();
        }
        $outfile_cdna->write_seq($output_cdna);



        # Now initialize the next mRNA
        my @attrs = split( ";", $array[8] );
        $attrs[0] =~ s/ID=//;
        $mRNA_name = $attrs[0];
        $frame=$array[6];
        %CDS = (); %CDNA = (); # Empty the chunk arrays
    }
    elsif ( $type eq 'mRNA' ) {    # First mRNA
        my @attrs = split( ";", $array[8] );
        $attrs[0] =~ s/ID=//;
        $mRNA_name = $attrs[0];
        $frame=$array[6];
    }
    elsif ( $type eq 'CDS' ) {
        my $cds_coord = $array[0] . " " . $array[3] . " " . $array[4];
        $CDS{$array[3]}=$cds_coord;
        $CDNA{$array[3]}=$cds_coord;
    }
    elsif ($type eq 'UTR' ) {
        my $utr_coord = $array[0] . " " . $array[3] . " " . $array[4];
        $CDNA{$array[3]}=$utr_coord;
    }

}

close GFF;

Federico

gff cds • 30k views
ADD COMMENTlink modified 3.1 years ago by sutturka150 • written 7.4 years ago by Federico Giorgi540

The script is good, but needs adjustment when the 'phase' (8-th column of GFF3) is non-zero. http://www.sequenceontology.org/gff3.shtml

I found a script gff3_file_to_proteins.pl in the EVidenceModeler package.

ADD REPLYlink modified 28 days ago by RamRS24k • written 4.8 years ago by fumihiko0

The gff2fasta.pl has a bug: the last gene record in gff3 file won't be output. Because it doesn't have the next gene.

ADD REPLYlink written 4.2 years ago by kevinchjp10

This is great.  I have created a github version of this gff2fasta.pl script and made some modifications to write only sequences that have non-zero length. I will be adding an output for exons as well. 

https://github.com/ISUgenomics/common_scripts/blob/master/gff2fasta.pl

ADD REPLYlink written 4.1 years ago by severin0
8
gravatar for Malcolm.Cook
7.4 years ago by
Malcolm.Cook1.1k
kansas, usa
Malcolm.Cook1.1k wrote:

BioPerl's Bio::DB::SeqFeatureStore module can parse the GFF into hierarchical gene models, provide random access to the fasta (which is indexed on demand).

What remains is to write a function to splice an mRNA's CDS.

This yields a substantially shorter script, one whose logic has (hopefully) been tested/proved during its development, and that can take an Bio::DB::SeqFeatureStore compatible format in addition to gff and fasta.

The new script, called like this (I downloaded chromosome 10 of grape genome for testing against your GFF)

 gff2cdsfa.pl -fasta t/VV10.gdna -gff t/t1.gff

prints output like:

>VIT_10s0116g00040.t01
ATGCCGTTTTCGTTTTTGATTGGGGTTGGAATTTTGCTCATGAATTGGTTTGACATTTGG
AGACAGACGTTGAATAAAGCTTGGGAATCTGATCTTGGATTGTTGCATCTTCAGGACATC
TTATCTTAAATGCCGTTTTCGTTTTTG
>VIT_10s0116g00060.t01
TAATAAATCGATATTCAAAATCATTCAATTCGAGATCTCTCGCTCTATCAAAGCATCGGA
TTTCTAAATTCTCATACGGCCCTCCCGGAATTCCTTTCAGAGCCTGTTGAATAATTTTTA
TAAATGCCATCATTTCACCAATTCGTACTAAATAACGAGATAATGAATCTCCTTCTTTTT
GCCATTGGACTTCCCAATCAAATTCGTCATAACACTCATAATGATCAACTTTAAGAAGAT
CCCATTGTATTCCGGAAGCTTGTAGCATTGGTCCTGACAAACCCCAATTTATTGCTTCTT
CTACACCAATAATGCCTACTCCCTCAACTCGTTCTAAAAAAATAGGATTACGCATAATCA
GTTTTTGATATTCAGCAACCCCTGTTAAAAAATAATCGCAGAAATCCAAACATTTATCTA
TCCATCCATGCGGTAGATCAACAACTACTCCTCCTATACGAAAATAATTATGCATCATTC
TCATACCGGTGGTAGCTTCGAATAGATCATATACTAACTCTCTTTCTCTGAAAATATAGA
AGAAAGGAGTCTGTACACCAATATCTGCCATAAAAGGGCCAAGCCATAATAAATGAGAAG
CTATACGACTCAACTCCAACATAATCACTCTAATATAGCTGGCTCTTTTAGGTACTTGAA
TATTTCCCAATTGCTCTGGTGCATTTGCGGTTATTGCTTCTGTGAACATCTAGTATCGTC
ATAATGTCAGCCAATTTCATTCTTTTAACTAACTGAGGAAGAATTTGCAAATTGATAAAA
CTCAATGAATCGCATGTAGAGATATTGATAACATGTATAAAGTTAATGGTATTTCATAAC
TAATTGATTGAGCAGCAACTCGTAGACCACCTAAAAAGGAATATTTATTATTTGATCCAT
ATCCTAACATAAGAAATCCAATGGGAGCAATACATGAAATGGCGATCCATAAAAAAACAC
CGATATTGAGATCGGATAGAACAAGGTTAGAGCCAAAAGGAATTATTAAATAACTTAGTA
GAATTGATATGACTGCTATGGATGGTCCGATACTGAATAAATGAGTATCTCCTCTAGATG
GAAGAAGATTCTCTTTGAAAAGTAGTTTTGTGCCATCTGCTAGAGCTTGAAGAATTCCCA
AAGGACCGGCATATTCAGGTCCAATACGTTGTTGTATCCCTGCGGATATTTCTCTTTCTA
ACCACACAATTGATAGTACACCTATTATGATTCCCAATACAGGAGTAAAAATAGGTACAA
GCATCCATATGATCCCATAAACCTCTTTTAAGTATTCCAATTGGGAAAAAGAATTGATAG
CTTGTA

Here's the code:

#!/usr/bin/env perl
# PURPOSE: for each mRNA declared in -gff, write to STDOUT its spliced CDS sequence taken from -fasta
# USAGE: gff2cdsfa.pl -fasta t/VV10.fa -gff t/t1.gff
# AUTHOR: malcolm.cook@gmail.com

use strict;
use warnings;

use Bio::SeqIO;
use Bio::DB::SeqFeature::Store;
use Bio::SeqFeatureI;
sub Bio::SeqFeatureI::spliced_cds {
    ## PURPOSE: for a (presumed) mRNA SeqFeature, create new Bio::Seq
    ## holding its spliced CDS, reverse complemented as needed, ided
    ## after the display_id, if available, else the load_id.
    my ($self,$db) = @_;
    my @CDS = $self->CDS;
    @CDS=sort {$a cmp $b} @CDS;    # don't depend upon input GFF lines being sorted 
    my $seqstr;
    $seqstr .= $_->seq->seq for @CDS;    
    my $seq = Bio::Seq->new( -id => $self->display_id || $self->load_id,
                 -seq => $seqstr);
    $self->strand == -1 ? $seq->revcom : $seq;
}

my %ARGV=@ARGV;
$ARGV{-adaptor}||='memory';
my $db = Bio::DB::SeqFeature::Store->new(%ARGV) or die $!;

my $ss = $db->get_seq_stream(-type =>  'mRNA');
my $seqout = Bio::SeqIO->new(-format=>'fasta');
while (my $mRNA = $ss->next_seq) {
    $seqout->write_seq($mRNA->spliced_cds)
}

0;
ADD COMMENTlink written 7.4 years ago by Malcolm.Cook1.1k

Dear Malcolm, thanks a lot! Just to be sure, because I see here some wrong genes (no ATG), where did you download the reference vitis fasta from?

ADD REPLYlink written 7.4 years ago by Federico Giorgi540
1

I took from http://www.plantgdb.org/ but I probably took the old assembly. I was just guessing anyway. Where should it come from?

ADD REPLYlink modified 7.4 years ago • written 7.4 years ago by Malcolm.Cook1.1k

Wouldn't it be easier and less error-prone to just use Bio::SeqFeature::spliced_seq on each RNA feature? edit: it would but it (still) doesn't work, see the other answer.

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by Michael Dondrup46k
6
gravatar for Federico Giorgi
7.4 years ago by
Columbia University
Federico Giorgi540 wrote:

And here's the answer!

Usage:

perl gff2fasta.pl file.fasta file.gff3 output

#!/usr/bin/perl
use strict;
use warnings;
use Bio::Seq;
use Bio::SeqIO;
use Bio::DB::Fasta;

$| = 1;    # Flush output
my $outfile_cds = Bio::SeqIO->new( -format => 'fasta', -file => ">$ARGV[2].cds.fasta" );
my $outfile_pep = Bio::SeqIO->new( -format => 'fasta', -file => ">$ARGV[2].pep.fasta" );



### First, index the genome
my $file_fasta = $ARGV[0];
my $db                   = Bio::DB::Fasta->new($file_fasta);
print ("Genome fasta parsed\n");



### Second, parse the GFF3
my @mRNA;
my $mRNA_name;
my $frame;
open GFF, "<$ARGV[1]" or die $!;
while ( my $line = <GFF> ) {
    chomp $line;
    my @array = split( "\t", $line );
    my $type = $array[2];
    next if ( $type eq 'gene' or $type eq 'UTR' );

    if ( ( $type eq 'mRNA' ) and ( $. > 2 ) ) {
        # Collect CDSs and extract sequence of the previous mRNA
        my $mRNA_seq;
        foreach my $coord (@mRNA) {
            my @cds_coord = split( " ", $coord );
            my $cds_seq = $db->seq( $cds_coord[0], $cds_coord[1], $cds_coord[2] );
            $mRNA_seq .= $cds_seq;
        }

        my $output_nucleotide = Bio::Seq->new(
            -seq        => $mRNA_seq,
            -id         => $mRNA_name,
            -display_id => $mRNA_name,
            -alphabet   => 'dna',
        );
        if ($frame eq '-') {
            $output_nucleotide = $output_nucleotide->revcom();
        }
        my $output_protein = $output_nucleotide->translate();
        $outfile_cds->write_seq($output_nucleotide);
        $outfile_pep->write_seq($output_protein);

        # Now initialize the next mRNA
        my @attrs = split( ";", $array[8] );
        $attrs[0] =~ s/ID=//;
        $mRNA_name = $attrs[0];
        $frame=$array[6];
        @mRNA = (); # Empty the mRNA
    }
    elsif ( $type eq 'mRNA' ) {    # First mRNA
        my @attrs = split( ";", $array[8] );
        $attrs[0] =~ s/ID=//;
        $mRNA_name = $attrs[0];
        $frame=$array[6];
    }
    elsif ( $type eq 'CDS' ) {
        my $cds_coord = $array[0] . " " . $array[3] . " " . $array[4];
        push( @mRNA, $cds_coord );
    }
}

close GFF;

EDIT: somehow the forum parser changes the code. I mean < GFF > at line 26, not [HTML] of course.

ADD COMMENTlink modified 7.4 years ago • written 7.4 years ago by Federico Giorgi540

Hi Federico, I am new to this and trying to extract the longest CDS in the right reading frame using gff3 file. This script works great to extract CDS. However, I am not able to tell if it is extracting the longest transcript or any one transcript for every gene. I see *s in the corresponding peptide sequences some of which I am guessing are stop codons. Can you guide me as to how I could use this code to extract the longest CDS starting with ATG and ending in stop codon? (some of the sequences generated now do not begin with ATG) Thank you.

ADD REPLYlink modified 20 months ago • written 20 months ago by shreyasibiswas8830
4
gravatar for Christian
7.4 years ago by
Christian2.8k
Cambridge, US
Christian2.8k wrote:

I hoped to squeeze this problem into only a few lines of code using Bio::DB::SeqFeature::Store and Bio::DB::SeqFeature:

use Bio::DB::SeqFeature::Store;
my $db = Bio::DB::SeqFeature::Store->new(-adaptor => 'memory', -gff => '/dir/containing/both/gff/and/fasta/');
foreach my $tr ($db->get_features_by_type('mRNA')) { 
    print ">",$tr->load_id,"\n",$tr->spliced_seq->seq,"\n"; 
}

Produces the following output:

------------- EXCEPTION: Bio::Root::NotImplemented -------------
MSG: Abstract method "Bio::SeqFeatureI::entire_seq" is not implemented by package Bio::DB::SeqFeature.
This is not your fault - author of Bio::DB::SeqFeature should be blamed!

STACK: Error::throw
STACK: Bio::Root::Root::throw /Bio/Root/Root.pm:472
STACK: Bio::Root::RootI::throw_not_implemented /Bio/Root/RootI.pm:748
STACK: Bio::SeqFeatureI::entire_seq /Bio/SeqFeatureI.pm:324
STACK: Bio::SeqFeatureI::spliced_seq /Bio/SeqFeatureI.pm:457
----------------------------------------------------------------

So unfortunately Bio::DB::SeqFeature::spliced_seq() was never properly implemented. A BioPerl developer discussion around this issue between Chris Fields and Jason Stajich dates back two years ago and can be found here:

http://bioperl.org/pipermail/bioperl-l/2010-March/032608.html

ADD COMMENTlink modified 7.4 years ago • written 7.4 years ago by Christian2.8k

...and depending on how/if it ever gets implemented, it might well have spliced in the UTR... which is why i implemented spliced_cds, above.

ADD REPLYlink written 7.4 years ago by Malcolm.Cook1.1k
4
gravatar for mister.neopolitan
6.0 years ago by
mister.neopolitan40 wrote:

Bumping an old thread because I just had the same problem.

The reasons the grape sequences didn't look right in Malcolm's example is the CDS segments weren't being sorted properly, resulting in a concatenation of incorrectly ordered CDS segments.

I made my own fasta and gff for testing. Gff file:

Test    blat    gene    1    280    1.0000    +    .    ID=Test0001;Name=Test0001
Test    blat    mRNA    79    235    1.0000    +    .    ID=Test0001;Name=Test0001
Test    blat    CDS    79    96    100    +    .    Parent=Test0001
Test    blat    CDS    118    134    100    +    .    Parent=Test0001
Test    blat    CDS    163    188    100    +    .    Parent=Test0001
Test    blat    CDS    218    235    100    +    .    Parent=Test0001
Test    blat    gene    1    280    1.0000    -    .    ID=Test0002;Name=Test0002
Test    blat    mRNA    79    235    1.0000    -    .    ID=Test0002;Name=Test0002
Test    blat    CDS    79    96    100    -    .    Parent=Test0002
Test    blat    CDS    118    134    100    -    .    Parent=Test0002
Test    blat    CDS    163    188    100    -    .    Parent=Test0002
Test    blat    CDS    218    235    100    -    .    Parent=Test0002

Fasta:

>Test
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

I cannibalized Malcolm's script. The result is a bit messy, but I found this worked better on my test pieces:

#!/usr/bin/perl -w
# PURPOSE: for each gene declared in -gff, write its spliced CDS sequence taken from -fasta
# USAGE: gff2cdsv2.pl file.fa file.gff

use strict;
use warnings;

use File::Basename;
use Bio::SeqIO;
use Bio::Seq;
use Bio::DB::SeqFeature::Store;
use Bio::SeqFeatureI;

my $fasta = $ARGV[0];
my $outfile = basename($fasta, ".fa") . "_cds.fa"; #formats name of infile for outfile

my $gff = $ARGV[1];

open (OUT, ">$outfile") or die "Can't open file for writing: $!"; 

my $db = Bio::DB::SeqFeature::Store->new(    -adaptor => 'memory',
                                            -fasta => $fasta,
                                            -gff => $gff) or die $!; # This is where gff and fasta are loaded

my $ss = $db->get_seq_stream(-type => 'gene'); #looks at gff, find genes and their associated information

while (my $gene = $ss->next_seq) #while looking at each gene
{
    print OUT ">" , $gene -> display_name , "\n"; #fasta id for CDS

    #print OUT $gene -> start , "\t", $gene -> end , "\t", $gene -> strand, "\n"; #test to see if script was reading gff correctly

    my @segments = $gene -> segments(-type => 'CDS');

    my %START;
    my %END;
    foreach my $segment(@segments)
    {
        $START{$segment} = $segment -> start; #create hash of CDS id and start nucleotide
        $END{$segment} = $segment -> end; #create hash of CDS id and end nucleotide
    }

    my @CDS;
    foreach my $segment (sort{$START{$a} <=> $START{$b}} keys %START) #sorts CDS segments by order of starting nucleotide
    {
        my $cds_for_seq = $db->fetch_sequence($gene -> seq_id , $START{$segment}, $END{$segment}); #extracts CDS segment sequence
        push(@CDS, $cds_for_seq); #puts CDS segment sequence to end of array

        #print OUT $segment , "\t" , $START{$segment} , "\t", $END{$segment} ,"\n"; #test to see if CDS segments are ordered correctly
        #print OUT $cds_for_seq , "\n";
    }

    my $CDS_seq = join('', @CDS); #concatenate array of CDS segment sequences

    my $for_obj = Bio::Seq -> new(-seq => $CDS_seq , -alphabet => 'dna'); #load sequence string as sequence object
    my $rev_obj = $for_obj -> revcom; #reverse complement sequence object
    my $rev_seq = $rev_obj -> seq; #convert sequence object to sequence string

    if ($gene -> strand == +1) #check if gene was on plus strand
    {
        print OUT $CDS_seq , "\n";
    }
    elsif ($gene -> strand == -1) #check if gene was on minus strand
    {
        print OUT $rev_seq , "\n";
    }
}

exit;

I haven't tested this on gff files with multiple mRNA predictions per gene. Switching the annotation filter to 'mRNA', rather than 'gene' works on the Brapa genome I've looked at, although it doesn't seem to work on my test gff file.

I still have a few bugs to work out.

ADD COMMENTlink written 6.0 years ago by mister.neopolitan40

I couldn't make it work. I got this error with my own data.

Possible precedence issue with control flow operator at /usr/share/perl5/Bio/DB/IndexedBase.pm line 845.

ADD REPLYlink written 5 weeks ago by dr3lostorage0
0
gravatar for lili3122300
4.5 years ago by
Japan
lili31223000 wrote:

I just want to ask a question, I have the same task as yours, but I can not use Bioperl, so if you see this comment , could you tell me how to deal with this task ?

ADD COMMENTlink written 4.5 years ago by lili31223000

Re-post this as a comment to original question and use bedtools getfasta.

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by PoGibas4.8k
0
gravatar for sutturka
3.1 years ago by
sutturka150
USA
sutturka150 wrote:

Try bedtools getfasta : http://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html

ADD COMMENTlink written 3.1 years ago by sutturka150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1151 users visited in the last hour