Question: Stringtie with Gencode gtf file doesn't give any gene_ids or transcript_ids?
0
gravatar for Biologist
10 months ago by
Biologist110
Biologist110 wrote:

Hi,

I'm using Hisat2, Stringtie, Stringtie-Merge, gffcompare and ballgown for RNAseq analysis. I wanted to detect differential expressed lncRNAs.

For Hisat2, I downloaded GRCh38 genome index "genome_snp_tran" from Hisat2 website Hisat2 Indexes. After running the hisat2 command with the available fastq files it gave a bam file as output. Using samtools sorted the bam files and used sorted.bam files as input for Stringtie.

stringtie -p 8 -G gencode_Release27_GRCh38.p10_PRI/gencode.v27.primary_assembly.annotation.gtf -o /dir/A133/A133_GRCh38.gtf -l /A133/A133 /dir/A133/A133.sorted.bam"

This gave the output A133_GRCh38.gtf file. It looks like following:

1   StringTie   transcript  14007   16679   1000    -   .   gene_id "/dir/A133/A133.1"; transcript_id "/dir/A133/A133.1.1"; cov "7.175807"; FPKM "0.578671"; TPM "1.022094";
1   StringTie   exon    14007   15038   1000    -   .   gene_id "/dir/STB133/STB133.1"; transcript_id "/dir/A133/A133.1.1"; exon_number "1"; cov "7.036415";
1   StringTie   exon    15796   15947   1000    -   .   gene_id "/dir/A133/A133.1"; transcript_id "/dir/A133/A133.1.1"; exon_number "2"; cov "9.265032";
1   StringTie   exon    16607   16679   1000    -   .   gene_id "/dir/A133/A133.1"; transcript_id "/dir/A133/A133.1.1"; exon_number "3"; cov "4.796226";
  1. I don't understand why there are no gene_id and transcript_id? Why it shows PATH there?

I used the same command on the other samples A765, A987, A456. It gave A765_GRCh38.gtf, A987_GRCh38.gtf, A456_GRCh38.gtf files as outputs.

Next I used Stringtie-Merge. I wrote the path of all the Stringtie output gtf files in text file "mergelist.txt"

stringtie --merge -p 8 -G gencode_Release27_GRCh38.p10_PRI/gencode.v27.primary_assembly.annotation.gtf -o /stringtiemerge/stringtie_merged.gtf mergelist.txt”

Output "stringtie_merged_gtf" is like following:

# StringTie version 1.3.3
1       StringTie       transcript      13989   26963   1000    -       .       gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; 
1       StringTie       exon    13989   15038   1000    -       .       gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "1"; 
1       StringTie       exon    15796   15947   1000    -       .       gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "2"; 
1       StringTie       exon    16607   16765   1000    -       .       gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "3"; 
GL000218.1  StringTie   transcript  51867   54893   1000    -   .   gene_id "MSTRG.15857"; transcript_id "ENST00000612565.1"; gene_name "AL354822.1"; ref_gene_id "ENSG00000278384.1"; 
GL000218.1  StringTie   exon    51867   54893   1000    -   .   gene_id "MSTRG.15857"; transcript_id "ENST00000612565.1"; exon_number "1"; gene_name "AL354822.1"; ref_gene_id "ENSG00000278384.1"; 
chr1    ENSEMBL exon    1529223 1529331 .   +   .   gene_id "ENSG00000197785.13"; transcript_id "ENST00000536055.5"; exon_number "15"; gene_name "ATAD3A";
chr1    ENSEMBL exon    1533926 1534687 .   +   .   gene_id "ENSG00000197785.13"; transcript_id "ENST00000536055.5"; exon_number "16"; gene_name "ATAD3A";
chr1    HAVANA  transcript  1517070 1517775 .   +   .   gene_id "ENSG00000197785.13"; transcript_id "ENST00000439513.1"; gene_name "ATAD3A"; ref_gene_id "ENSG00000197785.13";
chr1    HAVANA  exon    1517070 1517412 .   +   .   gene_id "ENSG00000197785.13"; transcript_id "ENST00000439513.1"; exon_number "1"; gene_name "ATAD3A";
chr1    HAVANA  exon    1517716 1517775 .   +   .   gene_id "ENSG00000197785.13"; transcript_id "ENST00000439513.1"; exon_number "2"; gene_name "ATAD3A";
chr1    HAVANA  transcript  1517621 1522780 .   +   .   gene_id "ENSG00000197785.13"; transcript_id "ENST00000429957.1"; gene_name "ATAD3A"; ref_gene_id "ENSG00000197785.13";
chr6    ENSEMBL exon    32034860    32034992    .   +   .   gene_id "ENSG00000224389.8"; transcript_id "ENST00000425700.3"; exon_number "39"; gene_name "C4B";
chr6    ENSEMBL exon    32035136    32035417    .   +   .   gene_id "ENSG00000224389.8"; transcript_id "ENST00000425700.3"; exon_number "40"; gene_name "C4B";
chr6    HAVANA  transcript  32015581    32016297    .   +   .   gene_id "ENSG00000224389.8"; transcript_id "ENST00000490071.1"; gene_name "C4B"; ref_gene_id "ENSG00000224389.8";
chr6    HAVANA  exon    32015581    32015657    .   +   .   gene_id "ENSG00000224389.8"; transcript_id "ENST00000490071.1"; exon_number "1"; gene_name "C4B";
chr6    HAVANA  exon    32015867    32016297    .   +   .   gene_id "ENSG00000224389.8"; transcript_id "ENST00000490071.1"; exon_number "2"; gene_name "C4B";
chr6    HAVANA  transcript  32015619    32017136    .   +   .   gene_id "ENSG00000224389.8"; transcript_id "ENST00000485543.1"; gene_name "C4B"; ref_gene_id "ENSG00000224389.8";
  1. What are those "MSTRG" for gene_id and transcript_id?
  2. Why in the second column I see Stringtie, HAVANA, ENSEMBL?
  3. In the first column what is "GL000218.1"?
  4. Why the first column"Chromosome" is 1 sometimes and chr1 sometimes?

Please help me to understand these outputs. Did I do anything wrong? Is there any problem with gtf or genome index?

ADD COMMENTlink modified 10 months ago by Devon Ryan86k • written 10 months ago by Biologist110
1
gravatar for Devon Ryan
10 months ago by
Devon Ryan86k
Freiburg, Germany
Devon Ryan86k wrote:
  1. They're sequential identifiers, since GTF files aren't compliant withoutthat.
  2. That's there in the original GTF from Gencode
  3. That's an unplaced contig, there are a LOT of them.
  4. This is actually the source of your problems. You seem to be mixing a fasta file with Ensembl chromosome names with a GTF file having Gencode or UCSC names (or something like that). Your best bet is to convert the GTF file to use the same chromosome names as your BAM files (you can do most of this by stripping chr from the beginning of each line).
ADD COMMENTlink written 10 months ago by Devon Ryan86k

Thank Devon. What about A133_GRCh38.gtf file output. Why there is no gene_id, transcript_id or genename? Why there is path?

ADD REPLYlink written 10 months ago by Biologist110

I have no idea what you're talking about. Try posting a few lines of the relevant file.

ADD REPLYlink modified 10 months ago • written 10 months ago by Devon Ryan86k

It was already mentioned in the above question. I gave a A133_GRCh38.gtf file output.

ADD REPLYlink written 10 months ago by Biologist110

They're just arbitrary names, they have no importance in and of themselves.

ADD REPLYlink written 10 months ago by Devon Ryan86k

Hello Devon,

Sorry for asking you again. The "MSTRG" in the above stringtie_merged.gtf file are the novel transcripts? because I see some with Ensembl gene id and transcript id. I think that if stringtie doesn't find any ref gene locations from gtf file but finds rna-seq alignments in that location stringtie constructs that as novel transcripts? Am I right?

Another question: After stringtie-merge, I used gffcompare to examine how the transcripts compare with reference annotation. the reference annotation is with both protein coding and lncRNAs. How will I know which is protein coding gene and known lncRNAs from stringtie_merged.gtf file?

Sequencing is strand-specific (RF). How to get the counts for the pc genes and known lncRNAs from all samples? With counts I would like to do differential analysis.

ADD REPLYlink written 10 months ago by Biologist110
1

It's not entirely clear if they're purely novel transcripts or whether modified transcripts are also included.

You'd need to filter the GTF files for lncRNAs or protein coding genes to know for sure, I don't think gffcompare will report that.

You can use featureCounts to get counts (allegedly stringTie can produce some sort of counts too...but I'd be hesitant to trust that).

ADD REPLYlink written 10 months ago by Devon Ryan86k

I would like to use htseq-counts.

htseq-count -f bam -q sample1.bam gencode.v27.primary_assembly.annotation.gtf > sample1_counts.txt

Is this right for strand-specific (RF)? Is bam fine or I need to give sorted.bam?

And could you please tell how to strip "chr" from gtf file in linux.

ADD REPLYlink modified 10 months ago • written 10 months ago by Biologist110
1

You'll want -s reverse I think. You can strip chr characters with sed.

ADD REPLYlink written 10 months ago by Devon Ryan86k

I tried with this:

htseq-count –s reverse –f bam –q sample1.sorted.bam gencode.v27.primary_assembly.annotation.gtf > sample1_counts.txt

I have this error:

/path/soft/apps/HTSeq/0.6.1p1-goolf-1.7.20-Python-2.7.11/bin/htseq-count: Error: Please provide two arguments.

Call with '-h' to get usage information.

ADD REPLYlink written 10 months ago by Biologist110
1

There's presumably an extra space somewhere (or in that version maybe -q is doing something different).

ADD REPLYlink written 10 months ago by Devon Ryan86k

Could you please tell something about my previous comment.

ADD REPLYlink written 10 months ago by Biologist110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 808 users visited in the last hour