How To Fix A Tophat "Glist Error That Is Likely Caused By Incorrect Sequence Naming
1
0
Entering edit mode
9.5 years ago
Jirapong ▴ 20

I'm trying to map sample against Xenopus Laevis from Xenbase.org (latest version 6.0). They provided GFF3 and FASTA file. which look like following

FASTA

> 27051543
ATGGCGGATGTGAAGGTCTCGTTCCAGTGCCCAGGCCGGATGTACAGCCCCGCGTGGGTGGCACCTGAGGCGCTGCAGAA
ACGCCCAGAGGATATTAACCGTCGCTCTGCTGACATGTGGAGTTTTGCCGTTCTGCTTTGGGAGCTGGTGACCCGCGAGG
TTCCATTTGCCGACCTCTCAAACATGGAGATTGGCATGAAGGTTTCCCTTGAAGGCCTCCGTCCCACCATCCCCCCCGGG
ATCTCGCCCCATATCTGCAAGTTGATGAAGATTTGTATGAACGAAGACCCTGCCAAGCGACCCAAGTTTGATATGATCGC
CCCCATCCTGGAGAAGATGCAGGAGAAATAA
> 27051545
TTTGGACTGTGCGTGAATTTAAAGAAAGCAGACAAATTCTTCCCGCGTTGCTATAACCTGGCGGATAAAACAGGGAGAAT
GTTATTCACTGATGACTTCATGAAAACTGCAGCGTATAGTATCATAAAATGGGTTGTAACAAGAAACAGTACGCCTATTA
AAGCAGAAGCCAATGTAATTTTAATGGCTTTTATGGTCTGCAAAATGTTCATGATTCCCTCAGTAAATAAGGACATAGAC


GFF3

##gff-version 3
Scaffold100041    JGI_gene    gene    2092    20066    .    +    .    ID=XeXenL6RMv10000001m.g;Name=XeXenL6RMv10000001m.g
Scaffold100041    JGI_gene    mRNA    2092    20066    .    +    .    ID=PAC:27060736;Name=XeXenL6RMv10000001m;pacid=27060736;longest=1;Parent=XeXenL6RMv10000001m.g
Scaffold100041    JGI_gene    five_prime_UTR    2092    2223    .    +    .    ID=PAC:27060736.five_prime_UTR.1;Parent=PAC:27060736;pacid=27060736
Scaffold100041    JGI_gene    five_prime_UTR    2490    2505    .    +    .    ID=PAC:27060736.five_prime_UTR.2;Parent=PAC:27060736;pacid=27060736
Scaffold100041    JGI_gene    CDS    2506    2585    .    +    0    ID=PAC:27060736.CDS.1;Parent=PAC:27060736;pacid=27060736
Scaffold100041    JGI_gene    CDS    4114    4216    .    +    1    ID=PAC:27060736.CDS.2;Parent=PAC:27060736;pacid=27060736
Scaffold100041    JGI_gene    CDS    4370    4449    .    +    0    ID=PAC:27060736.CDS.3;Parent=PAC:27060736;pacid=27060736
Scaffold100041    JGI_gene    CDS    6233    6422    .    +    1    ID=PAC:27060736.CDS.4;Parent=PAC:27060736;pacid=27060736
Scaffold100041    JGI_gene    CDS    7542    7700    .    +    0    ID=PAC:27060736.CDS.5;Parent=PAC:27060736;pacid=27060736


So the GFF3 use PAC:XXXXXXX as the ID however, the FASTA didn't. On Tophat2 mapping process

Error is

[samopen] SAM header is present: 43025 sequences.
GList error (GList.hh:981):Invalid list index: 27078510


when i tried to convert to GTF. it have following error

Can't locate object method "display_text" via package "Bio::Annotation::SimpleValue" at /usr/local/share/perl5/Bio/SeqFeature/Annotated.pm line 703, <GEN0> line 2.

The convert code looks like this

#! /usr/bin/perl

use lib '/local/ensembl/bioperl-live';

use warnings;
use Bio::FeatureIO;

$in = Bio::FeatureIO->new(-file => "/tmp/Simbiot_HSS/index/Scaffold10.nucleotide.gff3" , -format => 'GFF');$out = Bio::FeatureIO->new(-file    => ">/tmp/Simbiot_HSS/index/test.gtf" ,
-format  => 'GTF');

while ( my $feature =$in->next_feature() ) {
$out->write_feature($feature);
}

exit(0);


Is i missing something?

gff fasta tophat2 • 2.7k views
0
Entering edit mode
9.5 years ago

First and foremost please note that a GTF file is not the same as a GFF file, so that is one possible problem.

Then if all you need to transform a file from GFF to GTF while removing the PAC prefix then you should post a question on just that not even mentioning Tophat.

To avoid delays the best would be to make sure that if you had a GTF file with the correct names the process would work. For that create a copy containing only the first few lines of the GFF and edit them manually to be a GTF file with the correct names. Run the pipeline on this data.

0
Entering edit mode

@Istvan Thank you very much. tophat itself pick GFF file. I only provide the prefix path like "/tmp/Simbiot_HSS/index/Scaffold10.nucleotide" then it auto pickup gff (may be version 2 or 3). I did also try to convert that GFF3 to GTF but got error. see above.