Question: What are good packages to find Differentially Expressed Genes from RNASeq RSEM and log2(x+1) transformed data?
1
gravatar for J. Smith
6 months ago by
J. Smith50
J. Smith50 wrote:

Hello.

I have obtained TCGA RNAseq - IlluminaHiSeq data from UCSC Xena. They have downloaded Level_3 data from TCGA DCC (file names: *.rsem.genes.normalized_results) and log2(x+1) transformed.

I want to find differentially expressed genes for particular cancer, say, for LIHC using primary solid tumour and solid tissue normal samples using the data obtained from there.

I know that packages like DESeq2 and EdgeR are widely used to find differentially expressed genes for RNASeq data. But they require raw counts data. I have also used R packages like limma and SAM to find differentially expressed genes from DNA Microarray data.

But I don't know which packages or methods are good to find differentially expressed genes from RNASeq rsem and log2(x+1) transformed data as I have collected above.

If anyone can help me out.

Thanks in advance.

ADD COMMENTlink modified 6 months ago by Kevin Blighe67k • written 6 months ago by J. Smith50
1

For RSEM check this Biostars post out. There is yet another link in that post where the creator of DESeq2 answers a similar question.

And for the log2(x+1) transformed ones, why can't you just exponentiate (and subtract 1)?

ADD REPLYlink written 6 months ago by Haci370

Ok. Thanks a lot. I got it...

ADD REPLYlink written 6 months ago by J. Smith50

This user also posted here: C: How to analyze TCGA data downloaded by TCGA2STAT using TCGAbiolinks package

ADD REPLYlink written 6 months ago by Kevin Blighe67k
2
gravatar for Kevin Blighe
6 months ago by
Kevin Blighe67k
Republic of Ireland
Kevin Blighe67k wrote:

J. Smith, if you want to use the HTseq data from Xena, then please follow the advice here:

For RSEM, follow the linked thread by Haci.

kevin

ADD COMMENTlink modified 6 months ago • written 6 months ago by Kevin Blighe67k
1

Thanks so much Kevin.

ADD REPLYlink written 6 months ago by J. Smith50

Hello Dr. Blighe. Actually I can download TCGA RNASeq data directly from the TCGA portal. But the problem is when I am downloading data using TCGABiolinks, I am getting ~56000 ensemble genes and their corresponding gene expressions. But I want to get gene expressions data with official gene symbols which I am getting from UCSC Xena (There are ~20000 official genes and their corresponding gene expressions). I know that conversion from ensemble gene id to the official gene symbol is possible. But then there will be many to many mapping problems. Can you please suggest me how to get RNASeq gene expression data using the official gene symbol with original counts values directly from TCGA?

ADD REPLYlink written 6 months ago by J. Smith50
1

Mapping between annotations was always an annoying 'stumbling block' for me, and delayed many analyses. Picking up code snippets that allow you to do this programmatically can really help. There are probably 2 main ways (in R):

biomaRt: create a lookup table

require('biomaRt')
mart <- useMart('ENSEMBL_MART_ENSEMBL')
mart <- useDataset('hsapiens_gene_ensembl', mart)

annotLookup <- getBM(
  mart = mart,
  attributes = c('ensembl_gene_id', 'entrezgene_id',
  'gene_biotype', 'hgnc_symbol'),
  uniqueRows = TRUE)

head(annotLookup)
  ensembl_gene_id entrezgene_id   gene_biotype hgnc_symbol
1 ENSG00000210049            NA        Mt_tRNA       MT-TF
2 ENSG00000211459            NA        Mt_rRNA     MT-RNR1
3 ENSG00000210077            NA        Mt_tRNA       MT-TV
4 ENSG00000210082            NA        Mt_rRNA     MT-RNR2
5 ENSG00000209082            NA        Mt_tRNA      MT-TL1
6 ENSG00000198888          4535 protein_coding      MT-ND1

Then use whichever commands you need to perform the actual mapping.

AnnotationDBI: easier for direct mapping of specific IDs

library(AnnotationDbi)
library(org.Hs.eg.db)

ens <- c("ENSG00000100601", "ENSG00000178826", "ENSG00000243663", "ENSG00000138231")
mapIds(org.Hs.eg.db, ens, column = 'SYMBOL', keytype = 'ENSEMBL')

ENSG00000100601 ENSG00000178826 ENSG00000243663 ENSG00000138231 
       "ALKBH1"       "TMEM139"              NA          "DBR1"

------------------

I think that I will create a tutorial post tonight about this, as it will greatly assist many users.

ADD REPLYlink written 6 months ago by Kevin Blighe67k

Many many thanks Dr. Blighe... Another question I want to ask, suppose corresponding to multiple ensemble gene ids, I get one official gene symbol. Then how should I calculate gene expression/ counts of that official gene symbol from the gene expressions /counts of multiple ensemble genes? And similarly, if an ensemble gene maps multiple official gene symbols, then what is the way out? If you suggest me some solution.

ADD REPLYlink modified 6 months ago • written 6 months ago by J. Smith50
1

This can happen frequently, too. There is no right or wrong procedure. You could take the average / mean of the cases where there are multi-mappings, or just take the transcript with the max summed value across all samples.

aggregate() function can allow one to summarise in this way, as can limma::avereps()

ADD REPLYlink written 6 months ago by Kevin Blighe67k
1

OK. Thanks a lot Dr. Blighe...

ADD REPLYlink written 6 months ago by J. Smith50

Dear Dr. Blighe, I have obtained TCGA RNASeq raw counts data through TCGAbiolinks for some cancers. In all cancers there are same number of ensemble genes (~56000) for which counts values are available. After following your code using AnnotationDbi, only ~26000 ensemble genes have matching official gene symbols. Around 40 official genes have more than one mapping to ensemble genes. Thus corresponding to ~30,000 ensemble genes, I got NA values for official gene symbols. Now, I don't know whether is it safe to discard these 30,000 rows from the original count matrix download through TCGAbiolinks for subsequent processing like differential gene expression or further downstream analysis? I can encounter those 40 repeating genes using the techniques you mentioned above. But what should I do with those 30000 genes? If I use all of those 56000 ensemble genes and the original count matrix for my subsequent analysis like differential gene expression or further downstream analysis, will it be correct?

In this context, if you please tell me, how UCSC Xena or RTCGA provided RSEM values using official gene symbols.

ADD REPLYlink modified 6 months ago • written 6 months ago by J. Smith50
1

Try the biomaRt approach and you will probably see that 99.99% of these 'discards' are non-protein coding genes (check the gene_biotype column). It is actually normal to exclude these from an analysis if primarily protein coding genes are the focus.

ADD REPLYlink written 6 months ago by Kevin Blighe67k

Thank you again Dr. Blighe. I found that among those 56493 Ensembl genes 19708 are protein-coding genes and 6513 ncRNA genes, 14086 lncRNA genes, 10074 processed_pseudogenes and 2464 unprocessed_pseudogenes. So, may I safely remove those 6513 ncRNA genes, 14086 lncRNA genes, 10074 processed_pseudogenes and 2464 unprocessed_pseudogenes from my subsequent analysis?

ADD REPLYlink modified 6 months ago • written 6 months ago by J. Smith50
1

Yep, you can proceed with just the protein coding, but record this is your methods and make it accessible to reviewers, keeping in mind that there are no standards for this process. Note that many lncRNAs are well-researched and have gene symbols, such as XIST, TSIX, MALAT1, etc. Also, some pseudogenes have been shown to have functions.

Other people would prefer a more 'unsupervised' approach and proceed with the original 50000+ genes as Ensembl gene IDs, and only at the very end annotate these to official gene symbols.

Really are no standards, though.

ADD REPLYlink written 6 months ago by Kevin Blighe67k
1

OK. Many many thanks Dr. Blighe.

ADD REPLYlink written 6 months ago by J. Smith50

Dear Dr. Blighe,

Greetings of the Day!!!

I once again read the description of the RNAseq - IlluminaHiSeq_RNASeqV2 LUAD dataset from the UCSC Xena.

Description:

The gene expression profile was measured experimentally using the Illumina HiSeq 2000 RNA Sequencing platform by the University of North Carolina TCGA genome characterization center. Level 3 data was downloaded from TCGA data coordination center. This dataset shows the gene-level transcription estimates, as in log2(x+1) transformed RSEM normalized count. Genes are mapped onto the human genome coordinates using UCSC Xena HUGO probeMap (see ID/Gene mapping link below for details).

wrangling:

Level_3 data (file names: .rsem.genes.normalized_results*) are downloaded from TCGA DCC, log2(x+1) transformed, and processed at UCSC into Xena repository.

As it is normalized results, I wonder whether it will be suitable for subsequent analysis with DESeq2 imported via tximport.

Also after searching the FireBrowse, I found that there are two types of files for RSEM RNASeqV2 1) illuminahiseq_rnaseqv2-RSEM_genes (MD5) 2) illuminahiseq_rnaseqv2-RSEM_genes_normalized (MD5)

I think out of these two files, 1) illuminahiseq_rnaseqv2-RSEM_genes file is the most suitable for subsequent analysis with DESeq2 imported through tximport.

There are 3 columns associated with each samples in the 1) illuminahiseq_rnaseqv2-RSEM_genes file: "raw_count", "scaled_estimate" and "transcript_id". I feel "raw_count" column will be more appropriate for DESeq2 analysis.

Please comment on whether my observation is right or not. Thanks in advance.

ADD REPLYlink modified 6 months ago • written 6 months ago by J. Smith50
1

Thread continues on Bioconductor: https://support.bioconductor.org/p/131066/#131074

ADD REPLYlink written 6 months ago by Kevin Blighe67k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1063 users visited in the last hour