stringtie cannot recognize Gencode GTF
3.9 years ago
dec986

I aligned many fastq files with HISAT2 to grch38. This proceeded without problems.

But in the next step with StringTie, which I am trying to find novel transcripts and their counts with the Gencode27 GTF:

stringtie Donor1_IL2OKT3ZA.HISAT2.sort.bam -G /illumina/runs/RNASeq/Gencode27/gencode.v27.annotation.gtf -A try.tab -p 4 > stringtie.out 2> stringtie.err


However, I get an error

WARNING: no reference transcripts were found for the genomic sequences where reads were mapped!
Please make sure the -G annotation file uses the same naming convention for the genome sequences.


Why doesn't Stringtie recognize Gencode annotation? Do I have to do something to the gencode data?

Update: STAR works with this stringtie, but HISAT2 output doesn't. Strange.

stringtie's output from cut -f 3 try.tab | sort | uniq

looks like

703404669@ssxfisctimga004:~/RNASeq_benchmark/GSE96075/HISAT2\$ cut -f 3 try.tab | sort | uniq
1
10
11
12
13
14
15
16
17
18
19
2
20
21
22
3
4
5
6
7
8
9
GL000008.2
GL000009.2
GL000194.1
GL000205.2
GL000214.1
GL000218.1
GL000219.1
GL000220.1
GL000221.1
GL000224.1
KI270442.1
KI270706.1
KI270711.1
KI270713.1
KI270721.1
KI270733.1
KI270734.1
KI270742.1
KI270744.1
KI270745.1
MT
Reference
X
Y
chr1
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr2
chr20
chr21
chr22
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chrM
chrX
chrY


UPDATE: I have found several other instances of this error, but no one ever addressed how to solve this:

Warning encountered while transcript abundance estimation using stringtie

Did you use the appropriate gencode HISAT2 index during mapping? In other words did you create a HISAT2 index using the gencode reference genome fasta file?

with which option can I link the gencode GTF? none of the options I can see offer this.

Are you sure about the path used for Gencode/GTF annotation is specified correctly ?

yes, the GTF file exists and is readable

Please, paste the output of cut -f 3 yourfile.gff | sort | uniq!

@Macspider thanks I've updated the question

That doesn't look at all like a GFF file, it should contain:

• gene
• mRNA
• exon
• intron
• CDS

or other stuff like that!

hi Macspider, I'm getting the same error with the Gencode GFF as I am with the Gencode GTF.

Yes, but what you pasted is not at all the third column of a GFF/GTF file!

Hi Macspider, that's the output from stringtie, not the input. The input GTF and GFF were downloaded from Gencode.

3.9 years ago
geo.pertea

StringTie gave you an advice there, did you follow up on it?

Please make sure the -G annotation file uses the same naming convention for the genome sequences.

In case you don't understand that message: it's about the chromosome names. The Gencode annotation uses chromosome names like chr1, chr2 ,chr3,... while the grch38 alignments have 1,2,3,.. instead. Hence that WARNING message. Sure for you it might be obvious that "1" is the same as "chr1". But StringTie is made to work on any assembly/genome data, not just the human genome and it's not going to second guess your use of mismatching chromosome names like this.. I would suggest to use the UCSC hg38 genome instead -- or, if you don't want to re-run hisat2 again, find an annotation which uses the 1,2,3.. naming convention for the chromosomes.. or find a reliable way to convert the Gencode genomic sequence names to the grch38 naming convention.. This might be as simple as removing the "chr" prefix for the chromosome names -- but this might only work for the main chromosome sequences, the naming convention for the additional/alternate contigs might be different.