Question: stringtie cannot recognize Gencode GTF
0
gravatar for dec986
9 days ago by
dec98630
United States
dec98630 wrote:

I aligned many fastq files with HISAT2 to grch38. This proceeded without problems.

But in the next step with StringTie, which I am trying to find novel transcripts and their counts with the Gencode27 GTF:

stringtie Donor1_IL2OKT3ZA.HISAT2.sort.bam -G /illumina/runs/RNASeq/Gencode27/gencode.v27.annotation.gtf -A try.tab -p 4 > stringtie.out 2> stringtie.err

However, I get an error

WARNING: no reference transcripts were found for the genomic sequences where reads were mapped!
Please make sure the -G annotation file uses the same naming convention for the genome sequences.

Why doesn't Stringtie recognize Gencode annotation? Do I have to do something to the gencode data?

Update: STAR works with this stringtie, but HISAT2 output doesn't. Strange.

stringtie's output from cut -f 3 try.tab | sort | uniq

looks like

703404669@ssxfisctimga004:~/RNASeq_benchmark/GSE96075/HISAT2$ cut -f 3 try.tab | sort | uniq
1
10
11
12
13
14
15
16
17
18
19
2
20
21
22
3
4
5
6
7
8
9
GL000008.2
GL000009.2
GL000194.1
GL000205.2
GL000214.1
GL000218.1
GL000219.1
GL000220.1
GL000221.1
GL000224.1
KI270442.1
KI270706.1
KI270711.1
KI270713.1
KI270721.1
KI270733.1
KI270734.1
KI270742.1
KI270744.1
KI270745.1
MT
Reference
X
Y
chr1
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr2
chr20
chr21
chr22
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chrM
chrX
chrY

UPDATE: I have found several other instances of this error, but no one ever addressed how to solve this:

https://github.com/gpertea/stringtie/issues/113

Warning encountered while transcript abundance estimation using stringtie

gencode rna-seq stringtie • 196 views
ADD COMMENTlink modified 6 days ago by geo.pertea30 • written 9 days ago by dec98630
1

Did you use the appropriate gencode HISAT2 index during mapping? In other words did you create a HISAT2 index using the gencode reference genome fasta file?

ADD REPLYlink written 9 days ago by Sinji2.6k

with which option can I link the gencode GTF? none of the options I can see offer this.

ADD REPLYlink written 9 days ago by dec98630

Are you sure about the path used for Gencode/GTF annotation is specified correctly ?

ADD REPLYlink modified 8 days ago • written 8 days ago by EagleEye4.9k

yes, the GTF file exists and is readable

ADD REPLYlink written 8 days ago by dec98630

Please, paste the output of cut -f 3 yourfile.gff | sort | uniq!

ADD REPLYlink written 8 days ago by Macspider2.0k

@Macspider thanks I've updated the question

ADD REPLYlink written 8 days ago by dec98630

That doesn't look at all like a GFF file, it should contain:

  • gene
  • mRNA
  • exon
  • intron
  • CDS

or other stuff like that!

ADD REPLYlink written 7 days ago by Macspider2.0k

hi Macspider, I'm getting the same error with the Gencode GFF as I am with the Gencode GTF.

ADD REPLYlink written 7 days ago by dec98630

Yes, but what you pasted is not at all the third column of a GFF/GTF file!

http://www.ensembl.org/info/website/upload/gff.html#fields

ADD REPLYlink written 7 days ago by Macspider2.0k

Hi Macspider, that's the output from stringtie, not the input. The input GTF and GFF were downloaded from Gencode.

ADD REPLYlink written 7 days ago by dec98630
3
gravatar for geo.pertea
6 days ago by
geo.pertea30
geo.pertea30 wrote:

StringTie gave you an advice there, did you follow up on it?

Please make sure the -G annotation file uses the same naming convention for the genome sequences.

In case you don't understand that message: it's about the chromosome names. The Gencode annotation uses chromosome names like chr1, chr2 ,chr3,... while the grch38 alignments have 1,2,3,.. instead. Hence that WARNING message. Sure for you it might be obvious that "1" is the same as "chr1". But StringTie is made to work on any assembly/genome data, not just the human genome and it's not going to second guess your use of mismatching chromosome names like this.. I would suggest to use the UCSC hg38 genome instead -- or, if you don't want to re-run hisat2 again, find an annotation which uses the 1,2,3.. naming convention for the chromosomes.. or find a reliable way to convert the Gencode genomic sequence names to the grch38 naming convention.. This might be as simple as removing the "chr" prefix for the chromosome names -- but this might only work for the main chromosome sequences, the naming convention for the additional/alternate contigs might be different.

ADD COMMENTlink written 6 days ago by geo.pertea30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 849 users visited in the last hour