Question

Warning encountered while transcript abundance estimation using stringtie

0

Entering edit mode

8.2 years ago

lakhujanivijay 5.9k

I am trying to run stringtie for transcript abundance estimation using below command:

for i in *.sorted.bam; do sample_name=`echo $i| awk -F "." '{print $1}'`; stringtie -e -B -p 60 -G stringtie-merged.gtf -o $sample_name.gtf $i;done

Warning encountered:

WARNING: no reference transcripts were found for the genomic sequences where reads were mapped! Please make sure the -G annotation file uses the same naming convention for the genome sequences.

The assembly was done using below commands:

#stringtie assembly

for i in *sorted.bam; do

sample_name=`echo $i| awk -F "." '{print $1}'`

date && time stringtie -p 60 -G reference.gff3 -o $sample_name.gtf $i -A $sample_name.gene.adundance -C $sample_name.fullycovered 

done

#merging assemblies
date && time stringtie --merge -p 60 -G reference.gff3 -o stringtie-merged.gtf merge_list.txt

where merge_list.txt contains the list of the gtf files produces as an output from stringtie asembly.

stringtie RNA-Seq abundance • 7.7k views

ADD COMMENT • link updated 7.5 years ago by geo.pertea ▴ 130 • written 8.2 years ago by lakhujanivijay 5.9k

0

Entering edit mode

Have you checked your gff3 file looks valid? Also was this the reference.gff3 used in your original alignment?

ADD REPLY • link 8.2 years ago by andrew.j.skelton73 6.6k

0

Entering edit mode

Have you checked your gff3 file looks valid?

gffread -E reference.gff3

GFF: discarding unrecognized feature "chromosome" ID=chromosome:1
GFF: discarding unrecognized feature "chromosome" ID=chromosome:10
GFF: discarding unrecognized feature "chromosome" ID=chromosome:10_random
GFF: discarding unrecognized feature "chromosome" ID=chromosome:11
GFF: discarding unrecognized feature "chromosome" ID=chromosome:11_random

Additionally, no problem found with the stringtie-merged.gtf file

Also was this the reference.gff3 used in your original alignment?

Alignment was ran using hisat2 without using the gff file.

ADD REPLY • link 8.2 years ago by lakhujanivijay 5.9k

0

Entering edit mode

Are the entries in the gff file synonymous with the annotation of the fasta files you used to build the HISAT index?

ADD REPLY • link 8.2 years ago by andrew.j.skelton73 6.6k

0

Entering edit mode

Yes, absolutely! I confirmed the same by fetching the sequence from the reference fasta file from the gf file using bedtools getfasta

ADD REPLY • link 8.2 years ago by lakhujanivijay 5.9k

0

Entering edit mode

Are your GFF3 transcript lines noted as mRNA or as transcript?

ADD REPLY • link 8.2 years ago by Matteo Schiavinato ★ 3.7k

0

Entering edit mode

"transcript"

#!genome-build-accession GCA_000003745.2
1       Genoscope       chromosome      1       23037639        .       .       .       ID=chromosome:1;Alias=FN597015.1
###
1       iggp    gene    10731   27597   .       +       .       ID=gene:VIT_01s0011g00010;biotype=protein_coding;description=Putative uncharacterized protein  [Source:UniProtKB/TrEMBL%3BAcc:F6HF07];gene_id=VIT_01s0011g00010;logic_name=iggp12x.v1
1       iggp    mRNA    10731   27597   .       +       .       ID=transcript:VIT_01s0011g00010.t01;Parent=gene:VIT_01s0011g00010;biotype=protein_coding;transcript_id=VIT_01s0011g00010.t01
1       iggp    exon    10731   10945   .       +       .       Parent=transcript:VIT_01s0011g00010.t01;Name=VIT_01s0011g00010.t01.exon1;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=VIT_01s0011g00010.t01.exon1;rank=1
1       iggp    five_prime_UTR  10731   10945   .       +       .       Parent=transcript:VIT_01s0011g00010.t01
1       iggp    five_prime_UTR  12798   12836   .       +       .       Parent=transcript:VIT_01s0011g00010.t01
1       iggp    exon    12798   12859   .       +       .       Parent=transcript:VIT_01s0011g00010.t01;Name=VIT_01s0011g00010.t01.exon2;constitutive=1;ensembl_end_phase=2;ensembl_phase=-1;exon_id=VIT_01s0011g00010.t01.exon2;rank=2
1       iggp    CDS     12837   12859   .       +       0       ID=CDS:VIT_01s0011g00010.t01;Parent=transcript:VIT_01s0011g00010.t01;protein_id=VIT_01s0011g00010.t01

ADD REPLY • link 8.2 years ago by lakhujanivijay 5.9k

0

Entering edit mode

They're actually noted as mRNA:

1       iggp    mRNA    10731   27597 ....

In the 9th field you have:

ID=transcript:VIT_01s0011g00010.t01;Parent=gene:VIT_01s0011g00010;biotype=protein_coding;transcript_id=VIT_01s0011g00010.t01

I don't think that the transcript names are correct in the ID field. They're like transcript:VIT_01s0011g00010.t01 but I assume they should be only VIT_01s0011g00010.t01, because the ":" could disrupt the program's analysis. The same can be said for the gene.

Basically, what you have in the Transcript_ID field should be in the ID field, because that is what these programs read.

ADD REPLY • link 8.2 years ago by Matteo Schiavinato ★ 3.7k

0

Entering edit mode

Should I then give it a try by removing the ":" ? Or the whole thing along with the colon sign?

ADD REPLY • link 8.2 years ago by lakhujanivijay 5.9k

0

Entering edit mode

I suggest you edit genes and transcripts in their ID and Parent fields removing the gene: or the transcript: part, like:

ID=gene:VIT_01s0011g00010
ID=transcript:VIT_01s0011g00010.t01;Parent=gene:VIT_01s0011g00010

Become:

ID=VIT_01s0011g00010
ID=VIT_01s0011g00010.t01;Parent=VIT_01s0011g00010

ADD REPLY • link 8.2 years ago by Matteo Schiavinato ★ 3.7k

0

Entering edit mode

Looking at the code here, the error is seen when:

if (guided && no_ref_used) {
        GMessage("WARNING: no reference transcripts were found for the genomic sequences where reads were mapped!\n"
                "Please make sure the -G annotation file uses the same naming convention for the genome sequences.\n");
 }

where guided

if (args.getOpt('G')) {
       guidegff=args.getOpt('G');
       if (fileExists(guidegff.chars())>1)
            guided=true;
       else GError("Error: reference annotation file (%s) not found.\n",
                 guidegff.chars());
}

where no_ref_used

if (gseq_id>=refseqCount) {
                     if (verbose)
                         GMessage("WARNING: no reference transcripts found for genomic sequence \"%s\"! (mismatched reference names?)\n",
                                refseqName);
                 }
                 else no_ref_used=false;

ADD REPLY • link 8.2 years ago by lakhujanivijay 5.9k

score 0 · Answer 1 · 2018-01-12

0

Entering edit mode

7.5 years ago

geo.pertea ▴ 130

The (belated) answer to the original problem reported here can be found in this post: A: stringtie cannot recognize Gencode GTF

ADD COMMENT • link 7.5 years ago by geo.pertea ▴ 130