Hello People,
The following is a problem that I am trying really hard for long time to resolve but can't come up to reasonable solution.
I execute the new tuxedo pipeline to analyze my RNA-seq samples. [Hisat2 -> StringTie2 -> TxImport -> Deseq2]
Now I extract the TPM counts for each transcript from TxImport's object txi$abundance then total them for each gene. I find that a few genes are in duplicate
Gene_ID Treated1    Treated2    Treated3    Contrl01    Contrl02    Contrl03  
ENSG00000233639:PANTR1  0.177826    0.023043    0.270347    0.027134    0.116557    0.165914 
ENSG00000233639:PANTR1  0   0   0   0   0   0 
ENSG00000235652:AL356599.1  1.218245    0.161862    1.835312    0.662362    0.355804    0.938479 
ENSG00000235652:AL356599.1  0   0   0   0   0   0  
ENSG00000242759:LINC00882   0.249292    0.367449    0.152916    0.162572    0.096571    0.120887 
ENSG00000242759:LINC00882   0   0   0   0   0   0  
ENSG00000253853:AC246817.1  1.071109    1.144176    0.726447    1.027518    0.747923    0.72971 
ENSG00000253853:AC246817.1  0.197055    0.056572    0.085629    0.214114    0.139897    0.172577 
ENSG00000257761:AC078860.2  0.888787    0.967599    0.529618    4.099847    3.220867    2.677969 
ENSG00000257761:AC078860.2  0.231985    0.190595    0.022277    0.052071    0.021828    0.070146
Well this might not be a big issue, because I can manually remove them as they are just counts.
Now I perform Deseq's analysis on MSTRG IDs...but on extracting results, I convert the MSTRG IDs to gene names and the problem starts
GeneID      Gene_names              baseMean          log2FoldChange       P-Value
MSTRG.10967 ENSG00000054654:SYNE2   5950.04530060011    0.012667982845789   0.959394280265825
MSTRG.10968 ENSG00000054654:SYNE2   1.41078707122937    -4.23070771833164   0.070293522379558
MSTRG.10175 ENSG00000068650:ATP11A  1833.81055560052    1.56556217491457    8.77220310293335E-10
MSTRG.10176 ENSG00000068650:ATP11A  4.61787244133165    -2.5207751366497    0.038542082326133
MSTRG.30937 ENSG00000070614:NDST1   658.6970054024  1.41142315136311    3.01295144450564E-14
MSTRG.30938 ENSG00000070614:NDST1   1.56573096638691    0.033948989808379   0.985488588961825
MSTRG.11506 ENSG00000100865:CINP    439.00088940909 0.091933520534186   0.786005173471876
MSTRG.11508 ENSG00000100865:CINP    5.63935284665922    -2.05575795178324   0.038659140045149
MSTRG.17621 ENSG00000101558:VAPA    2059.87434998168    0.005182706679134   0.982051464465322
MSTRG.17622 ENSG00000101558:VAPA    3.36185575110659    -2.37245905503554   0.090177894161586
MSTRG.27933 ENSG00000125388:GRK4    148.13570263131 0.922617617272806   0.004572310096685
MSTRG.27935 ENSG00000125388:GRK4    20.9983000186827    -2.74804985745791   2.69737977243546E-05
MSTRG.25158 ENSG00000128254:C22orf24    1.89318586903034    2.02010767441085    0.382214666408296
MSTRG.25159 ENSG00000128254:C22orf24    10.4722908231843    -1.6212485675741    0.026702940921691
MSTRG.15380 ENSG00000141510:TP53    2011.12079411078    1.01082518533637    6.02405522650678E-06
MSTRG.15381 ENSG00000141510:TP53    40.6951300890702    0.578072468755891   0.262835426053252
MSTRG.6817  ENSG00000149256:TENM4   902.877209604941    1.41048092112734    1.2491105757664E-05
MSTRG.6819  ENSG00000149256:TENM4   0.433135937501165   -2.51416321098229   0.532866041255987
MSTRG.4173  ENSG00000150867:PIP4K2A 715.089818003131    -0.452948374471478  0.099058893543589
MSTRG.4174  ENSG00000150867:PIP4K2A 1.59372578144667    -4.40786190776114   0.093908906761754
How can we resolve these redundancy events (as these involve some very important genes)? What is the popular practice?
Thanks in advance
As a comment, if you want gene-level differential analysis then simply skip this unnecessarily-complicated pipeline and use something like
salmonfor quantification of reads against a transcriptome, followed by tximport. Unless you specifically look for novel transcripts you do not need stringtie. It just makes the analysis more complicated, creating issues like this. Don't let yourself be impressed by the fact that it was published in a notable journal, for standard gene-level DE analysis this pipeline is imho not recommended. salmon-tximport is what I would recommend.