RNAseq - GBFF cannot be converted to GTF
Entering edit mode
8 weeks ago

Hello everyone,

I'm pretty new on the genomic and transcriptomic world, so any kind of help is much appreciated. My problem in a nutshell is that I can't use featureCounts with the GBFF file provided in NCBI for Periplaneta americana. I've tried to convert the GBFF file to GFF and GTF but always get error. I can align transcripts to FASTA genome, and generate BAM and SAM files but that's all. Any idea if the GBFF file is corrupt or what I'm doing wrong? Here below a more detailed explanation of the problem and command lines I'm using.

Many thanks!

I'm trying to do a differential expression analysis from RNAseq samples of cockroach Periplaneta americana. I did it before using a de novo transcriptome with Salmon without problems. Nevertheless, I'd like to do the same analysis using the genome of this species (available here, Li et al 2018).

I can align transcripts and generate SAM and BAM files using the genome (FASTA format, I downloaded here and I merged all bits together) with hisat2 and samtools following this tutorial (more or less):

example for a given sample.

hisat2-build genome_fasta.fa index/genome_hisat hisat2 -x index/genome_hisat -1 my_fastafile1 -2 my_fastafile2 -S aligned_reads/aligned_out.sam

samtools view -bS aligned_reads/aligned_out.sam > aligned_reads/aligned_out.bam

samtools sort aligned_reads/aligned_out.bam -o aligned_reads/aligned_out.sorted.bam

samtools index aligned_reads/aligned_out.sorted.bam

samtools flagstat aligned_reads/aligned_out.sorted.bam

samtools idxstats aligned_reads/aligned_out.sorted.bam

Nevertheless, I get stack after this step. I cannot use featureCounts or anything to generate counts. I tried to download the GTF file using NCBI Assembly (here), which downloads some json and jsonl files but no GTF file. So the only option is to convert the GBFF file they provide to GFF. I did it with bp-genbank2gff3.pl, and the resulting file looks a bit weird (see image below). I also tried the conversion from this GFF to GTF file using gffread but gives an empty output file.

Image of GFF file after conversion

I used this command line to generate a GFF file that looks 'better' but it also gives empty output file when converting to GTF.

cut -d';' -f1 GCA_002939525.1_ASM293952v1_genomic.gbff.gff > genome.gtf

GFF file after 'cut' command line

I've tried to run featureCounts with both GFF files but I always get this error:

ERROR: no features were loaded in format GTF. The annotation format can be specified by the '-F' option, and the required feature type can be specified by the '-t' option.. The program has to terminate.

The command line I use is this one, although I also tried to add several options (e.g., -F GFF, -t ID, ... ) but nothing worked:

featureCounts -a genome.gff -o featureCounts/sample1.counts aligned_reads/aligned_out.sorted.bam

Any idea of what I'm doing wrong? Is it even possible to do this analysis with the GBFF file provided or is there some format problem? I can provide a BAM/SAM file or any other file if they can help by email (sorry, they might be too heavy to upload them here).

Thanks a lot!

Genome GBFF GTF NCBI RNAseq • 358 views
Entering edit mode

This is not working because you likely have no gene annotations in your GBFF file. These are simply notifications of gaps in genome. Do you know if the gene models for this organism are done? Without gene, exon designations in the third column of a GFF/GTF file no counting program is going to work.

Entering edit mode

Hi, I was afraid there would be some problem like this. I don't think there's any gene model for this species (I was looking for it now, and nothing came up). Authors that published the genome did a general blast and compared to other insect species genomes, but there are no official annotations.

Actually, I have a question, which may seem a bit silly. If in my analysis with the transcriptome we provide start-end-length of transcript, sequence and annotation, can this be considered a gene model? I guess not, because it's the transcriptome and not the genome, so we cannot provide a proper exon... is it?


Entering edit mode

Within the limits of your transcriptome assembly those can be considered expressed transcripts. You will need to map them back on to the chromosomal assembly and see if you can locate the exon boundaries. This can then be followed up by creating a GTF file with gene models for reference assembly. As you can imagine this would be a lot of work.

If you are happy with your salmon results then just go with that.

Entering edit mode

That was very helpful, thanks a lot!


Login before adding your answer.

Traffic: 1216 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6