Question: How to quantify gene expression when stringtie considers two adjacent genes as single gene?
0
gravatar for kousi31
3 months ago by
kousi310
kousi310 wrote:

Hi all,
I am following hisat2, stringtie and Deseq2 pipeline. I obtained gene counts using HTSeq by aligning reads against stringtie merged gtf. I aligned reads with HISAT2 without reference GFF file (to avoid biasing alignments to annotated splice junctions). While mapping using Stringtie I used reference gff file.

When I see the results I could see that stringtie has assigned transcripts of two adjacent genes to same MSTRG ID.

For eg;

Genome GFF file from NCBI

NC_037550.1 Gnomon lnc_RNA 107569536 107577599 . - . ID=rna-XR_003110259.1;Parent=gene-LOC112585719;Dbxref=GeneID:112585719,Genbank:XR_003110259.1;Name=XR_003110259.1;gbkey=ncRNA;

NC_037550.1 Gnomon gene 107580147 107581316 . - . ID=gene-C6H1orf122;Dbxref=GeneID:102404938;Name=C6H1orf122;gbkey=Gene;gene=C6H1orf122;gene_biotype=protein_coding

Stringtie_merged GTF

NC_037550.1 StringTie transcript 107569504 107581424 1000 - . gene_id "MSTRG.10147"; transcript_id "MSTRG.10147.1";

So 1. Why does stringtie assign same ID to adjacent loci? Is that because I did not use reference annotation while aligning?

  1. How to quantify, when stringtie assigns reads of two adjacent loci to same ID?
rna-seq de deseq2 stringtie htseq • 296 views
ADD COMMENTlink modified 3 months ago • written 3 months ago by kousi310
2
gravatar for kristoffer.vittingseerup
3 months ago by
European Union
kristoffer.vittingseerup2.5k wrote:

One way of solving this is by for each transcript/isoform annotating which gene it belong to and then for each of the two genes sum the expression/count of those isoforms to get the new divided gene expression/counts.

This also means you will have to requantify with isoform resolution. You can read more about considerations and tools for isoform level quantification here.

ADD COMMENTlink written 3 months ago by kristoffer.vittingseerup2.5k

Thank you for the link. I used prepDE.py -i ballgown -g gene_count_matrix.csv -t transcript_count_matrix.csv to calculate transcript-level expression count.

Now I guess I should use summarizeToGene of tximport package to summarize the counts to each gene. Can you please tell or attach a link on how to import transcript_count_matrix.csv into tximport to summarize the transcript counts?. There are no examples on how to import stringtie generated transcript-count-matrix.

ADD REPLYlink modified 3 months ago • written 3 months ago by kousi310
2

It is a 4 step opperation: 1) read the "transcript_count_matrix.csv" file into R with read.table() or read_tsv(). 2) Read in the annotation (either using the same function as in step 1 or you can use rtracklayer::import() to import the merged GTF file. 3) Manually modify the annotation to split up the "merged" genes by giving them new gene-ids. 4) Use IsoformSwitchAnalyzeR::isoformToGeneExp() or tximport::summarizeToGene() to summarize the isoform expression to gene expression.

ADD REPLYlink written 3 months ago by kristoffer.vittingseerup2.5k

Doubt on assigning ids while manually modifying annotation to split the merged genes. For eg; for the transcript mentioned in my question, I planned to give MSTRG.10147 for transcripts falling between 107569536-107577599 and MSTRG.10147a for transcripts falling between 107580147-107581316

But there seems to be overlapping transcripts also in the merged gtf and I don't know how to name them.

NC_037550.1 StringTie transcript 107569504 107581424 1000 - . gene_id "MSTRG.10147"; transcript_id "MSTRG.10147.1";

NC_037550.1 StringTie exon 107569504 107569993 1000 - . gene_id "MSTRG.10147"; transcript_id "MSTRG.10147.1"; exon_number "1";

NC_037550.1 StringTie exon 107576917 107580625 1000 - . gene_id "MSTRG.10147"; transcript_id "MSTRG.10147.1"; exon_number "2";

NC_037550.1 StringTie exon 107580892 107581093 1000 - . gene_id "MSTRG.10147"; transcript_id "MSTRG.10147.1"; exon_number "3";

NC_037550.1 StringTie exon 107581325 107581424 1000 - . gene_id "MSTRG.10147"; transcript_id "MSTRG.10147.1"; exon_number "4";

NC_037550.1 StringTie transcript 107569536 107577599 1000 - . gene_id "MSTRG.10147"; transcript_id "rna27666"; gene_name "LOC112585719";
ref_gene_id "gene12951";

NC_037550.1 StringTie exon 107569536 107569993 1000 - . gene_id "MSTRG.10147"; transcript_id "rna27666"; exon_number "1"; gene_name
"LOC112585719"; ref_gene_id "gene12951";

NC_037550.1 StringTie exon 107576917 107577599 1000 - . gene_id "MSTRG.10147"; transcript_id "rna27666"; exon_number "2"; gene_name "LOC112585719"; ref_gene_id "gene12951";

NC_037550.1 StringTie transcript 107574189 107581316 1000 - . gene_id "MSTRG.10147"; transcript_id "MSTRG.10147.3";

NC_037550.1 StringTie exon 107574189 107574379 1000 - . gene_id "MSTRG.10147"; transcript_id "MSTRG.10147.3"; exon_number "1";

NC_037550.1 StringTie exon 107575840 107580625 1000 - . gene_id "MSTRG.10147"; transcript_id "MSTRG.10147.3"; exon_number "2";

NC_037550.1 StringTie exon 107580892 107581093 1000 - . gene_id "MSTRG.10147"; transcript_id "MSTRG.10147.3"; exon_number "3";

NC_037550.1 StringTie exon 107581184 107581316 1000 - . gene_id "MSTRG.10147"; transcript_id "MSTRG.10147.3"; exon_number "4";

NC_037550.1 StringTie transcript 107575841 107579877 1000 - . gene_id "MSTRG.10147"; transcript_id "MSTRG.10147.4";

NC_037550.1 StringTie exon 107575841 107575912 1000 - . gene_id "MSTRG.10147"; transcript_id "MSTRG.10147.4"; exon_number "1";

NC_037550.1 StringTie exon 107576917 107579877 1000 - . gene_id "MSTRG.10147"; transcript_id "MSTRG.10147.4"; exon_number "2";

NC_037550.1 StringTie transcript 107580147 107581316 1000 - . gene_id "MSTRG.10147"; transcript_id "rna27667"; gene_name "C6H1orf122"; ref_gene_id "gene12952";

NC_037550.1 StringTie exon 107580147 107580625 1000 - . gene_id "MSTRG.10147"; transcript_id "rna27667"; exon_number "1"; gene_name "C6H1orf122"; ref_gene_id "gene12952";

NC_037550.1 StringTie exon 107580892 107581093 1000 - . gene_id "MSTRG.10147"; transcript_id "rna27667"; exon_number "2"; gene_name "C6H1orf122"; ref_gene_id "gene12952";

NC_037550.1 StringTie exon 107581184 107581316 1000 - . gene_id "MSTRG.10147"; transcript_id "rna27667"; exon_number "3"; gene_name "C6H1orf122"; ref_gene_id "gene12952";

Pl. suggest me how to handle those transcripts

ADD REPLYlink modified 3 months ago • written 3 months ago by kousi310

I do not know how to handle transcripts that span both of them - which is the reason why StringTie merged the genes in the first place. There are two possibilities: random assignment or removal but before you start doing something like that I'd suggest you take a look at the raw data. A efficient way of doing that is loading it into a genome browser - there are many out there but I usually use the IGV browser since it is local and directly supports loading of bam and GTF files so you can take a closer look at the data to figure out if it is a problem with StringTie or the data actually supports this novel transcript. Also note that StringTie2 was just released and should be a good deal better so re-running with StringTie2 is also a possibility.

ADD REPLYlink written 3 months ago by kristoffer.vittingseerup2.5k

There is a same geneid for a region spanning 53kb.

There are different gene_ids in individual sample gtfs. Only in the merged gtf file the MSTRG IDS are same for few loci.

Previously I had used the following commands to generate sample gtfs and merged gtf. Kindly tell me if the options I used are ok.

StringTie version 1.3.5 Read length 101 bp.

./stringtie -G UOA_WB_1_genomic.gff sam_sorted.bam -o output.gtf -A gene_abund.tab -C cov_ref.gtf -v

./stringtie --merge -G UOA_WB_1_genomic.gff -c 3 -T 1 -o stringtiemerged.gtf mergelist.txt

ADD REPLYlink modified 3 months ago • written 3 months ago by kousi310

Loos good to me - the only thing you could try was increasing -f to 0.05 in the --merge run. But again it might be that there are RNAseq reads joining the two regions together - you will not know unless you look at the raw - which also mean that you still actually don't know if it is an error or not - could simply be the known annotation that was lacking...

ADD REPLYlink written 3 months ago by kristoffer.vittingseerup2.5k

Thank you for your advise. I viewed the raw data in IGV. I found that there are splice junctions connecting the two loci and in some cases there are overlapping genes. There are around 121 such loci. Is it ok to perform the differential expression analysis and report these merged loci with co-ordinates if they are differentially expressed?

ADD REPLYlink modified 3 months ago • written 3 months ago by kousi310

Nice you figured it out. And yes that is okay :-)

ADD REPLYlink written 3 months ago by kristoffer.vittingseerup2.5k

Thank you for your patient suggestions.

ADD REPLYlink written 3 months ago by kousi310
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1148 users visited in the last hour