Question: stringtie cannot recognize Gencode GTF
1
gravatar for dec986
3 months ago by
dec98690
United States
dec98690 wrote:

I aligned many fastq files with HISAT2 to grch38. This proceeded without problems.

But in the next step with StringTie, which I am trying to find novel transcripts and their counts with the Gencode27 GTF:

stringtie Donor1_IL2OKT3ZA.HISAT2.sort.bam -G /illumina/runs/RNASeq/Gencode27/gencode.v27.annotation.gtf -A try.tab -p 4 > stringtie.out 2> stringtie.err

However, I get an error

WARNING: no reference transcripts were found for the genomic sequences where reads were mapped!
Please make sure the -G annotation file uses the same naming convention for the genome sequences.

Why doesn't Stringtie recognize Gencode annotation? Do I have to do something to the gencode data?

Update: STAR works with this stringtie, but HISAT2 output doesn't. Strange.

stringtie's output from cut -f 3 try.tab | sort | uniq

looks like

703404669@ssxfisctimga004:~/RNASeq_benchmark/GSE96075/HISAT2$ cut -f 3 try.tab | sort | uniq
1
10
11
12
13
14
15
16
17
18
19
2
20
21
22
3
4
5
6
7
8
9
GL000008.2
GL000009.2
GL000194.1
GL000205.2
GL000214.1
GL000218.1
GL000219.1
GL000220.1
GL000221.1
GL000224.1
KI270442.1
KI270706.1
KI270711.1
KI270713.1
KI270721.1
KI270733.1
KI270734.1
KI270742.1
KI270744.1
KI270745.1
MT
Reference
X
Y
chr1
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr2
chr20
chr21
chr22
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chrM
chrX
chrY

UPDATE: I have found several other instances of this error, but no one ever addressed how to solve this:

https://github.com/gpertea/stringtie/issues/113

Warning encountered while transcript abundance estimation using stringtie

gencode rna-seq stringtie • 484 views
ADD COMMENTlink modified 3 months ago by geo.pertea60 • written 3 months ago by dec98690
1

Did you use the appropriate gencode HISAT2 index during mapping? In other words did you create a HISAT2 index using the gencode reference genome fasta file?

ADD REPLYlink written 3 months ago by Sinji2.6k

with which option can I link the gencode GTF? none of the options I can see offer this.

ADD REPLYlink written 3 months ago by dec98690

Are you sure about the path used for Gencode/GTF annotation is specified correctly ?

ADD REPLYlink modified 3 months ago • written 3 months ago by EagleEye5.0k

yes, the GTF file exists and is readable

ADD REPLYlink written 3 months ago by dec98690

Please, paste the output of cut -f 3 yourfile.gff | sort | uniq!

ADD REPLYlink written 3 months ago by Macspider2.3k

@Macspider thanks I've updated the question

ADD REPLYlink written 3 months ago by dec98690

That doesn't look at all like a GFF file, it should contain:

  • gene
  • mRNA
  • exon
  • intron
  • CDS

or other stuff like that!

ADD REPLYlink written 3 months ago by Macspider2.3k

hi Macspider, I'm getting the same error with the Gencode GFF as I am with the Gencode GTF.

ADD REPLYlink written 3 months ago by dec98690

Yes, but what you pasted is not at all the third column of a GFF/GTF file!

http://www.ensembl.org/info/website/upload/gff.html#fields

ADD REPLYlink written 3 months ago by Macspider2.3k

Hi Macspider, that's the output from stringtie, not the input. The input GTF and GFF were downloaded from Gencode.

ADD REPLYlink written 3 months ago by dec98690
3
gravatar for geo.pertea
3 months ago by
geo.pertea60
geo.pertea60 wrote:

StringTie gave you an advice there, did you follow up on it?

Please make sure the -G annotation file uses the same naming convention for the genome sequences.

In case you don't understand that message: it's about the chromosome names. The Gencode annotation uses chromosome names like chr1, chr2 ,chr3,... while the grch38 alignments have 1,2,3,.. instead. Hence that WARNING message. Sure for you it might be obvious that "1" is the same as "chr1". But StringTie is made to work on any assembly/genome data, not just the human genome and it's not going to second guess your use of mismatching chromosome names like this.. I would suggest to use the UCSC hg38 genome instead -- or, if you don't want to re-run hisat2 again, find an annotation which uses the 1,2,3.. naming convention for the chromosomes.. or find a reliable way to convert the Gencode genomic sequence names to the grch38 naming convention.. This might be as simple as removing the "chr" prefix for the chromosome names -- but this might only work for the main chromosome sequences, the naming convention for the additional/alternate contigs might be different.

ADD COMMENTlink written 3 months ago by geo.pertea60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 754 users visited in the last hour