Question

Stringtie with Gencode gtf file doesn't give any gene_ids or transcript_ids?

0

Entering edit mode

7.5 years ago

Biologist ▴ 290

Hi,

I'm using Hisat2, Stringtie, Stringtie-Merge, gffcompare and ballgown for RNAseq analysis. I wanted to detect differential expressed lncRNAs.

For Hisat2, I downloaded GRCh38 genome index "genome_snp_tran" from Hisat2 website Hisat2 Indexes. After running the hisat2 command with the available fastq files it gave a bam file as output. Using samtools sorted the bam files and used sorted.bam files as input for Stringtie.

stringtie -p 8 -G gencode_Release27_GRCh38.p10_PRI/gencode.v27.primary_assembly.annotation.gtf -o /dir/A133/A133_GRCh38.gtf -l /A133/A133 /dir/A133/A133.sorted.bam"

This gave the output A133_GRCh38.gtf file. It looks like following:

1   StringTie   transcript  14007   16679   1000    -   .   gene_id "/dir/A133/A133.1"; transcript_id "/dir/A133/A133.1.1"; cov "7.175807"; FPKM "0.578671"; TPM "1.022094";
1   StringTie   exon    14007   15038   1000    -   .   gene_id "/dir/STB133/STB133.1"; transcript_id "/dir/A133/A133.1.1"; exon_number "1"; cov "7.036415";
1   StringTie   exon    15796   15947   1000    -   .   gene_id "/dir/A133/A133.1"; transcript_id "/dir/A133/A133.1.1"; exon_number "2"; cov "9.265032";
1   StringTie   exon    16607   16679   1000    -   .   gene_id "/dir/A133/A133.1"; transcript_id "/dir/A133/A133.1.1"; exon_number "3"; cov "4.796226";

I don't understand why there are no gene_id and transcript_id? Why it shows PATH there?

I used the same command on the other samples A765, A987, A456. It gave A765_GRCh38.gtf, A987_GRCh38.gtf, A456_GRCh38.gtf files as outputs.

Next I used Stringtie-Merge. I wrote the path of all the Stringtie output gtf files in text file "mergelist.txt"

stringtie --merge -p 8 -G gencode_Release27_GRCh38.p10_PRI/gencode.v27.primary_assembly.annotation.gtf -o /stringtiemerge/stringtie_merged.gtf mergelist.txt”

Output "stringtie_merged_gtf" is like following:

# StringTie version 1.3.3
1       StringTie       transcript      13989   26963   1000    -       .       gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; 
1       StringTie       exon    13989   15038   1000    -       .       gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "1"; 
1       StringTie       exon    15796   15947   1000    -       .       gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "2"; 
1       StringTie       exon    16607   16765   1000    -       .       gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "3"; 
GL000218.1  StringTie   transcript  51867   54893   1000    -   .   gene_id "MSTRG.15857"; transcript_id "ENST00000612565.1"; gene_name "AL354822.1"; ref_gene_id "ENSG00000278384.1"; 
GL000218.1  StringTie   exon    51867   54893   1000    -   .   gene_id "MSTRG.15857"; transcript_id "ENST00000612565.1"; exon_number "1"; gene_name "AL354822.1"; ref_gene_id "ENSG00000278384.1"; 
chr1    ENSEMBL exon    1529223 1529331 .   +   .   gene_id "ENSG00000197785.13"; transcript_id "ENST00000536055.5"; exon_number "15"; gene_name "ATAD3A";
chr1    ENSEMBL exon    1533926 1534687 .   +   .   gene_id "ENSG00000197785.13"; transcript_id "ENST00000536055.5"; exon_number "16"; gene_name "ATAD3A";
chr1    HAVANA  transcript  1517070 1517775 .   +   .   gene_id "ENSG00000197785.13"; transcript_id "ENST00000439513.1"; gene_name "ATAD3A"; ref_gene_id "ENSG00000197785.13";
chr1    HAVANA  exon    1517070 1517412 .   +   .   gene_id "ENSG00000197785.13"; transcript_id "ENST00000439513.1"; exon_number "1"; gene_name "ATAD3A";
chr1    HAVANA  exon    1517716 1517775 .   +   .   gene_id "ENSG00000197785.13"; transcript_id "ENST00000439513.1"; exon_number "2"; gene_name "ATAD3A";
chr1    HAVANA  transcript  1517621 1522780 .   +   .   gene_id "ENSG00000197785.13"; transcript_id "ENST00000429957.1"; gene_name "ATAD3A"; ref_gene_id "ENSG00000197785.13";
chr6    ENSEMBL exon    32034860    32034992    .   +   .   gene_id "ENSG00000224389.8"; transcript_id "ENST00000425700.3"; exon_number "39"; gene_name "C4B";
chr6    ENSEMBL exon    32035136    32035417    .   +   .   gene_id "ENSG00000224389.8"; transcript_id "ENST00000425700.3"; exon_number "40"; gene_name "C4B";
chr6    HAVANA  transcript  32015581    32016297    .   +   .   gene_id "ENSG00000224389.8"; transcript_id "ENST00000490071.1"; gene_name "C4B"; ref_gene_id "ENSG00000224389.8";
chr6    HAVANA  exon    32015581    32015657    .   +   .   gene_id "ENSG00000224389.8"; transcript_id "ENST00000490071.1"; exon_number "1"; gene_name "C4B";
chr6    HAVANA  exon    32015867    32016297    .   +   .   gene_id "ENSG00000224389.8"; transcript_id "ENST00000490071.1"; exon_number "2"; gene_name "C4B";
chr6    HAVANA  transcript  32015619    32017136    .   +   .   gene_id "ENSG00000224389.8"; transcript_id "ENST00000485543.1"; gene_name "C4B"; ref_gene_id "ENSG00000224389.8";

What are those "MSTRG" for gene_id and transcript_id?
Why in the second column I see Stringtie, HAVANA, ENSEMBL?
In the first column what is "GL000218.1"?
Why the first column"Chromosome" is 1 sometimes and chr1 sometimes?

Please help me to understand these outputs. Did I do anything wrong? Is there any problem with gtf or genome index?

RNA-Seq stringtie lncRNA gtf gencode • 6.4k views

ADD COMMENT • link updated 5.6 years ago by yiren ▴ 10 • written 7.5 years ago by Biologist ▴ 290

score 1 · Answer 1 · 2018-02-04

1

Entering edit mode

7.5 years ago

Devon Ryan 105k

They're sequential identifiers, since GTF files aren't compliant withoutthat.
That's there in the original GTF from Gencode
That's an unplaced contig, there are a LOT of them.
This is actually the source of your problems. You seem to be mixing a fasta file with Ensembl chromosome names with a GTF file having Gencode or UCSC names (or something like that). Your best bet is to convert the GTF file to use the same chromosome names as your BAM files (you can do most of this by stripping chr from the beginning of each line).

ADD COMMENT • link 7.5 years ago by Devon Ryan 105k

0

Entering edit mode

Thank Devon. What about A133_GRCh38.gtf file output. Why there is no gene_id, transcript_id or genename? Why there is path?

ADD REPLY • link 7.5 years ago by Biologist ▴ 290

0

Entering edit mode

I have no idea what you're talking about. Try posting a few lines of the relevant file.

ADD REPLY • link 7.5 years ago by Devon Ryan 105k

0

Entering edit mode

It was already mentioned in the above question. I gave a A133_GRCh38.gtf file output.

ADD REPLY • link 7.5 years ago by Biologist ▴ 290

0

Entering edit mode

They're just arbitrary names, they have no importance in and of themselves.

ADD REPLY • link 7.5 years ago by Devon Ryan 105k

0

Entering edit mode

Hello Devon,

Sorry for asking you again. The "MSTRG" in the above stringtie_merged.gtf file are the novel transcripts? because I see some with Ensembl gene id and transcript id. I think that if stringtie doesn't find any ref gene locations from gtf file but finds rna-seq alignments in that location stringtie constructs that as novel transcripts? Am I right?

Another question: After stringtie-merge, I used gffcompare to examine how the transcripts compare with reference annotation. the reference annotation is with both protein coding and lncRNAs. How will I know which is protein coding gene and known lncRNAs from stringtie_merged.gtf file?

Sequencing is strand-specific (RF). How to get the counts for the pc genes and known lncRNAs from all samples? With counts I would like to do differential analysis.

ADD REPLY • link 7.4 years ago by Biologist ▴ 290

1

Entering edit mode

It's not entirely clear if they're purely novel transcripts or whether modified transcripts are also included.

You'd need to filter the GTF files for lncRNAs or protein coding genes to know for sure, I don't think gffcompare will report that.

You can use featureCounts to get counts (allegedly stringTie can produce some sort of counts too...but I'd be hesitant to trust that).

ADD REPLY • link 7.4 years ago by Devon Ryan 105k

0

Entering edit mode

I would like to use htseq-counts.

htseq-count -f bam -q sample1.bam gencode.v27.primary_assembly.annotation.gtf > sample1_counts.txt

Is this right for strand-specific (RF)? Is bam fine or I need to give sorted.bam?

And could you please tell how to strip "chr" from gtf file in linux.

ADD REPLY • link 7.4 years ago by Biologist ▴ 290

1

Entering edit mode

You'll want -s reverse I think. You can strip chr characters with sed.

ADD REPLY • link 7.4 years ago by Devon Ryan 105k

0

Entering edit mode

I tried with this:

htseq-count –s reverse –f bam –q sample1.sorted.bam gencode.v27.primary_assembly.annotation.gtf > sample1_counts.txt

I have this error:

/path/soft/apps/HTSeq/0.6.1p1-goolf-1.7.20-Python-2.7.11/bin/htseq-count: Error: Please provide two arguments.

Call with '-h' to get usage information.

ADD REPLY • link 7.4 years ago by Biologist ▴ 290

1

Entering edit mode

There's presumably an extra space somewhere (or in that version maybe -q is doing something different).

ADD REPLY • link 7.4 years ago by Devon Ryan 105k

0

Entering edit mode

Could you please tell something about my previous comment.

ADD REPLY • link 7.4 years ago by Biologist ▴ 290

score 0 · Answer 2 · 2019-12-11

0

Entering edit mode

5.6 years ago

yiren ▴ 10

Hi,biologist

did you solve this question? the question is in my pipeline for RNAseq,please what can I do next ? you can view my question from this link,C: Stringtie output gtf file only contains STRG,no original annotations

Thank you very much!

ADD COMMENT • link 5.6 years ago by yiren ▴ 10