Ambiguous results of RNA-Seq data using stringtie with edgeR
1
0
Entering edit mode
3.8 years ago

Hi, I have human RNA-Seq (Total RNA) data of 24 paired samples. My samples are labelled as S1,S2,S3 and so on. S1-S2 form a pair and S3-S4 for another (odd represents Normal samples and even diseased samples). I have used STAR to align the reads and have obtained an overall alignment of > 90% for most cases. I have then used Stringtie for transcript assembly using the following command:

stringtie -p 3 -e -B -G reference_annotation_file -l S1 --rf -o ..S1.transcripts.gtf S1Aligned.sortedByCoord.out.quality.filtered.reads.bam

This command has been used individually for all the samples.

Then I have used the online script prepDE.py to convert the fpkm values of reads to raw counts. (http://ccb.jhu.edu/software/stringtie/dl/prepDE.py)

Thereafter I have used edgeR to perform paired differential analysis for the data.

I am facing several ambiguities with the data:

  1. The annotation file was downloaded from Gencode (hg19 version). I had used the complete annotation including contigs, scaffolds, patch. Now, I have seen that there are several genes which have names but different gene IDs, let's say gene1_ID1 and gene1_ID2. In case of some of these genes, I find that both gene1_ID1 and gene1_ID2 are differential, but their logFC and FDR values are differing significantly.

  2. My set of 24 samples have been classified into 2 groups, based on an additional factor being present and absent in the patients. Differential analysis was performed separately in the two groups and the results compared. In these I am finding that for some genes, gene_ID1 is differential in the total set (24 samples), while the same gene with different ID (gene_ID2) is found to be uniquely differential in one of the two groups (of 12 samples). How can this be?

  3. In case of a gene with 2 unique IDs, I find that one is upregulated and the other is downregulated, though both are significant. I shall give the example here and provide the links for the gene with 2 IDs.

    3.a. https://grch37.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000273266;r=12:54447661-54449814;t=ENST00000430889

    3.b. https://grch37.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000198353;r=12:54410715-54449813

  4. One important gene that is known to be upregulated in my disease, as supported by numerous publications, has been upregulated in my sample set. However the logFC of this gene is only 0.5.

These results have left me confused and wondering whether I have made some mistakes in my workflow. I am relatively new to RNA-Seq data. In case I have not been able to clarify my doubts properly, please feel free to probe me further.

RNA-Seq Stringtie edgeR • 1.5k views
ADD COMMENT
1
Entering edit mode

Could you clarify if you did any transcript assembly with StringTie? Also how did you take into account the paired nature of your samples in your DE analysis? Could you post a genome browser screenshot of the HOX4 gene problem?

ADD REPLY
0
Entering edit mode

Yes I did transcript assembly with StringTie using the command mentioned above. This generated the gtf files for each sample which were used as inputs for converting the fpkms to raw counts using PrepDE.py. Was I required to do anything else in this step?

The paired nature of the samples was known prior to designing the experiment. So there is no issue regarding that.

As for the Hoxc4 genes, right now I do not have access to my bam files as I am stuck at home due to lockdown. However I can state that the two HOXC4 genes (links given above) had logFC (-1.23 and 2.49) with FDRs (0.0038 and 0.048) respectively.

ADD REPLY
0
Entering edit mode

Is stringtie assembly still recommended for human genomes where you aren't expecting to find novel transcripts?

ADD REPLY
0
Entering edit mode

Did you run StringTie --merge as well? You can read more about the workflow here. And yes if you have a suspicion you migh have novel transcripts I recommend running StringTie.

ADD REPLY
2
Entering edit mode
3.6 years ago

Ensembl gene ENSG00000273266 is no longer in the database but it has been mapped to 1 deprecated identifier.

So I'd ignore it. Consider getting updated references. Yours are 10 years old.

ADD COMMENT

Login before adding your answer.

Traffic: 2004 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6