Question: Warning encountered while transcript abundance estimation using stringtie
0
gravatar for Vijay Lakhujani
15 months ago by
Vijay Lakhujani2.8k
India
Vijay Lakhujani2.8k wrote:

I am trying to run stringtie for transcript abundance estimation using below command:

for i in *.sorted.bam; do sample_name=`echo $i| awk -F "." '{print $1}'`; stringtie -e -B -p 60 -G stringtie-merged.gtf -o $sample_name.gtf $i;done

Warning encountered:

WARNING: no reference transcripts were found for the genomic sequences where reads were mapped! Please make sure the -G annotation file uses the same naming convention for the genome sequences.

The assembly was done using below commands:

#stringtie assembly

for i in *sorted.bam; do

sample_name=`echo $i| awk -F "." '{print $1}'`

date && time stringtie -p 60 -G reference.gff3 -o $sample_name.gtf $i -A $sample_name.gene.adundance -C $sample_name.fullycovered 

done

#merging assemblies
date && time stringtie --merge -p 60 -G reference.gff3 -o stringtie-merged.gtf merge_list.txt

where merge_list.txt contains the list of the gtf files produces as an output from stringtie asembly.

abundance rna-seq stringtie • 1.2k views
ADD COMMENTlink modified 6 months ago by geo.pertea60 • written 15 months ago by Vijay Lakhujani2.8k

Have you checked your gff3 file looks valid? Also was this the reference.gff3 used in your original alignment?

ADD REPLYlink written 15 months ago by andrew.j.skelton735.1k

Have you checked your gff3 file looks valid?

gffread -E reference.gff3

GFF: discarding unrecognized feature "chromosome" ID=chromosome:1
GFF: discarding unrecognized feature "chromosome" ID=chromosome:10
GFF: discarding unrecognized feature "chromosome" ID=chromosome:10_random
GFF: discarding unrecognized feature "chromosome" ID=chromosome:11
GFF: discarding unrecognized feature "chromosome" ID=chromosome:11_random

Additionally, no problem found with the stringtie-merged.gtf file

Also was this the reference.gff3 used in your original alignment?

Alignment was ran using hisat2 without using the gff file.

ADD REPLYlink modified 15 months ago • written 15 months ago by Vijay Lakhujani2.8k

Are the entries in the gff file synonymous with the annotation of the fasta files you used to build the HISAT index?

ADD REPLYlink written 15 months ago by andrew.j.skelton735.1k

Yes, absolutely! I confirmed the same by fetching the sequence from the reference fasta file from the gf file using bedtools getfasta

ADD REPLYlink written 15 months ago by Vijay Lakhujani2.8k

Are your GFF3 transcript lines noted as mRNA or as transcript?

ADD REPLYlink written 15 months ago by Macspider2.4k

"transcript"

#!genome-build-accession GCA_000003745.2
1       Genoscope       chromosome      1       23037639        .       .       .       ID=chromosome:1;Alias=FN597015.1
###
1       iggp    gene    10731   27597   .       +       .       ID=gene:VIT_01s0011g00010;biotype=protein_coding;description=Putative uncharacterized protein  [Source:UniProtKB/TrEMBL%3BAcc:F6HF07];gene_id=VIT_01s0011g00010;logic_name=iggp12x.v1
1       iggp    mRNA    10731   27597   .       +       .       ID=transcript:VIT_01s0011g00010.t01;Parent=gene:VIT_01s0011g00010;biotype=protein_coding;transcript_id=VIT_01s0011g00010.t01
1       iggp    exon    10731   10945   .       +       .       Parent=transcript:VIT_01s0011g00010.t01;Name=VIT_01s0011g00010.t01.exon1;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=VIT_01s0011g00010.t01.exon1;rank=1
1       iggp    five_prime_UTR  10731   10945   .       +       .       Parent=transcript:VIT_01s0011g00010.t01
1       iggp    five_prime_UTR  12798   12836   .       +       .       Parent=transcript:VIT_01s0011g00010.t01
1       iggp    exon    12798   12859   .       +       .       Parent=transcript:VIT_01s0011g00010.t01;Name=VIT_01s0011g00010.t01.exon2;constitutive=1;ensembl_end_phase=2;ensembl_phase=-1;exon_id=VIT_01s0011g00010.t01.exon2;rank=2
1       iggp    CDS     12837   12859   .       +       0       ID=CDS:VIT_01s0011g00010.t01;Parent=transcript:VIT_01s0011g00010.t01;protein_id=VIT_01s0011g00010.t01
ADD REPLYlink written 15 months ago by Vijay Lakhujani2.8k

They're actually noted as mRNA:

1       iggp    mRNA    10731   27597 ....

In the 9th field you have:

ID=transcript:VIT_01s0011g00010.t01;Parent=gene:VIT_01s0011g00010;biotype=protein_coding;transcript_id=VIT_01s0011g00010.t01

I don't think that the transcript names are correct in the ID field. They're like transcript:VIT_01s0011g00010.t01 but I assume they should be only VIT_01s0011g00010.t01, because the ":" could disrupt the program's analysis. The same can be said for the gene.

Basically, what you have in the Transcript_ID field should be in the ID field, because that is what these programs read.

ADD REPLYlink modified 15 months ago • written 15 months ago by Macspider2.4k

Should I then give it a try by removing the ":" ? Or the whole thing along with the colon sign?

ADD REPLYlink written 15 months ago by Vijay Lakhujani2.8k

I suggest you edit genes and transcripts in their ID and Parent fields removing the gene: or the transcript: part, like:

ID=gene:VIT_01s0011g00010
ID=transcript:VIT_01s0011g00010.t01;Parent=gene:VIT_01s0011g00010

Become:

ID=VIT_01s0011g00010
ID=VIT_01s0011g00010.t01;Parent=VIT_01s0011g00010
ADD REPLYlink modified 15 months ago • written 15 months ago by Macspider2.4k

Looking at the code here, the error is seen when:

if (guided && no_ref_used) {
        GMessage("WARNING: no reference transcripts were found for the genomic sequences where reads were mapped!\n"
                "Please make sure the -G annotation file uses the same naming convention for the genome sequences.\n");
 }

where guided

if (args.getOpt('G')) {
       guidegff=args.getOpt('G');
       if (fileExists(guidegff.chars())>1)
            guided=true;
       else GError("Error: reference annotation file (%s) not found.\n",
                 guidegff.chars());
}

where no_ref_used

if (gseq_id>=refseqCount) {
                     if (verbose)
                         GMessage("WARNING: no reference transcripts found for genomic sequence \"%s\"! (mismatched reference names?)\n",
                                refseqName);
                 }
                 else no_ref_used=false;
ADD REPLYlink modified 15 months ago • written 15 months ago by Vijay Lakhujani2.8k
0
gravatar for geo.pertea
6 months ago by
geo.pertea60
geo.pertea60 wrote:

The (belated) answer to the original problem reported here can be found in this post: A: stringtie cannot recognize Gencode GTF

ADD COMMENTlink written 6 months ago by geo.pertea60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 470 users visited in the last hour