Question: Rna-Seq Data Normalization With Spike-In Using Deseq
5
gravatar for ivivek_ngs
5.6 years ago by
ivivek_ngs4.8k
Seattle,WA, USA
ivivek_ngs4.8k wrote:

Hi,

I need some inputs in normalizing the RNA-Seq data with spike-ins and using the DESeq to retrieve differentially expressed genes from the samples. I have a condition where I have 7 samples out of which 4 samples are of peripheries that give tumor and 4 are centers of tumor. I want to normalize the raw fragment counts(which you use in DESeq) with spike-in and then compute the DEGs from it. my samples data set looks like

head(m) Sample_118p.0 Sample_132p2.0 Sample_91p.0 Sample_118rz.0 Sample_132rz1.0 Sample_132rz2.0 Sample_91rz.0 XLOC_000001 1534 2603 1764 1057 2889 3830 1684 XLOC_000002 175 304 208 144 428 367 222 XLOC_000003 80 195 109 916 2515 2314 1082 XLOC_000004 49 66 54 51 127 219 94 XLOC_000005 0 0 0 0 0 0 0 XLOC_000006 0 1 0 0 0 0 0

spike-in data set

head(sp) Sample_118p.0 Sample_132p2.0 Sample_91p.0 Sample_118rz.0 Sample_132rz1.0 Sample_132rz2.0 Sample_91rz.0 ERCC-00009 49 66 54 51 127 219 94 ERCC-00025 9 7 6 5 14 21 8 ERCC-00031 0 0 0 0 1 1 0 ERCC-00034 1 3 2 0 6 6 4 ERCC-00035 5 7 7 9 32 38 21 ERCC-00042 43 78 56 73 202 199 98

I am using the spike ins sub category B which have equal concentrations so that the consistency is maintained

Now I want to use this in DESeq.

So what is the best possible way to implement this normalization on my RNA-Seq data and create the Newcountdata set object and then estimate size factors and then the dispersion (per-gene variance) to get the Differentially expressed genes from there. Does anybody have any idea about this? It will be good if anyone has used such scenarios can give me some idea about this problem?

ADD COMMENTlink modified 4.7 years ago by Reza10 • written 5.6 years ago by ivivek_ngs4.8k
2

I'm assuming that you want to use the spike-ins simply for the size normalization, rather than estimating dispersion, correct? If so, you can actually manually set the size factors.

ADD REPLYlink written 5.6 years ago by Devon Ryan89k

What have you tried so far? Where are you getting stuck in the analysis?

ADD REPLYlink written 5.6 years ago by Josh Herr5.6k
1
gravatar for Reza
4.7 years ago by
Reza10
United States
Reza10 wrote:

For normalizing my RNA-seq with Sipke-In I did the Following:

Merged my reference FASTA and GTF files with Spike-In FASTA and GTF file then mapped my files with TopHat or STAR. Then:

TopHat -> Cufflinks -> Cuffmerge -> Cuffdiff

Then I divided my gene FPKMs with the sum of the total Spike-In FPKMs in the Cuffdiff output "gene_exp.diff" file for both the control and the treatment columns, let's call it finalized FPKMs. Then subtracted the finlized FPKM of the treatment from the control to see which lines are up or down regulated and then calculated a fold change for each line.

Now, I want to use the DEseq R package since I with cuffdiff I can do a DE analysis with finalized FPKMs. I am thinking of running DEseq in this way:

TopHat -> HTseq -> merge read_count_tables -> DEseq

But for normalizing by the Spike-In I run DEseq by the Spike-In Size factors and not the experiment size factors. 

I think I am in the right path, if not, can anybody let me know why? 

My codes were:

library(DESeq)
countsTable <- read.delim("merged_counts_SpikeIn.txt",header=TRUE)
rownames(countsTable) <- countsTable$gene
countsTable <- countsTable[,-1]
head( countsTable)
conds <- factor( c( "untreated", "untreated", "treated",  "treated", "treated" ) )
spikeincds <- newCountDataSet( countsTable, conds )
spikeincds <- estimateSizeFactors( spikeincds )
sizeFactors( spikeincds )

countsTable <- read.delim("merged_counts.txt",header=TRUE)
rownames(countsTable) <- countsTable$gene
countsTable <- countsTable[,-1]
head( countsTable)
conds <- factor( c( "untreated", "untreated", "treated",  "treated", "treated" ) )
cds <- newCountDataSet( countsTable, conds )
sizeFactors( cds ) = sizeFactors( spikeincds )
sizeFactors( cds )
head(counts(cds,normalized=TRUE))
cds <- estimateDispersions( cds )
res = nbinomTest( cds, "untreated", "treated" )
head(res)
write.table((res),file="outputFile_SikeIn_Normolized.txt",sep="\t")

ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by Reza10
0
gravatar for ivivek_ngs
5.6 years ago by
ivivek_ngs4.8k
Seattle,WA, USA
ivivek_ngs4.8k wrote:

Thanks , I have been able to normalize my RNA-Seq data with spike-ins and then used it to estimateDispersions to calculate the per gene variation and then use the negative binomial test to find the DEGs, but owing to the high complexity in my data set I cannot consider the result of DESeq as they fail the multiple testing correction and just on the basis of uncorrected p-val I dont see using those genes as there is another problem where I can see the read counts for some of my comparison is 0 so the mean is also 0 and hence the fold change despite of being statistically significant , it cant be considered. Have anyone of you faced such situations? I have alreay tried RankProd and Cuffdiff( not good results downstream). I am now trying DESeq, donot know what to do next.

ADD COMMENTlink written 5.6 years ago by ivivek_ngs4.8k

I'm not sure how highly complex your dataset is, it sounds fairly straight-forward. The presence of 0 counts isn't that uncommon, though you're most likely to see those when the counts overall are quite low, so they'll generally have crappy p-values. Without seeing enough of your data or any plots, it's rather difficult to give you any advice on how to proceed. In general, DESeq can deal with the 0-count scenario, but as you mentioned, the fold-change is not always the best metric to go by.

ADD REPLYlink written 5.6 years ago by Devon Ryan89k
0
gravatar for Jonathanjacobs
5.6 years ago by
Rockville, MD
Jonathanjacobs210 wrote:

How are you normalizing your data to ERCC spike in data? normalize.loess?

ADD COMMENTlink written 5.6 years ago by Jonathanjacobs210

normalize.loess is from the affy package, so it'd be pretty odd to use it with DESeq. The normal way to do this would be to use the spike-in data to calculate the size factor.

ADD REPLYlink written 5.6 years ago by Devon Ryan89k

Apologies - I was thinking RPKM, not count based data.

ADD REPLYlink written 5.6 years ago by Jonathanjacobs210

dpryan79 -- Could you elaborate on how you use spike-in data to calculate the size factor? I currently have four conditions (WT vs. EX1 vs. EX2 vs. EX3). What I have been trying is to create the classic MA plot with only the ERCC spike-ins. So, y-axis is log2(EX1/WT) and x-axis is 0.5log2(EX1WT). Then I use loess regression to fit the spike-in data and use that loess line to normalize the counts for the entire data set. Is there a more straightforward way to do this?

ADD REPLYlink written 5.5 years ago by jep0
0
gravatar for ivivek_ngs
5.6 years ago by
ivivek_ngs4.8k
Seattle,WA, USA
ivivek_ngs4.8k wrote:

Thanks I have already identified it and implemented the method. But I am having trouble in edgeR. I want to see my DEGs and compare it with DESEq. Now the manual seems to be a bit complez for me. I have a problem where I would like to understand how to use the same data set that I used in DESeq and then apply the edgeR normalizing the data with spike-ins and then test for DE in edgeR. I am not being able to formulate the script. Can any one help . I am giving some sample data sets below.

I have started using edgeR, I need some input about the method. I have a data set with read counts from two different patients ( lets say it looks like the below) head(m) Sample_118z.0 Sample_132z.0 Sample_118p.0 Sample_132p2.0 XLOC_000001 626 3516 1534 2603 XLOC_000002 82 342 175 304 XLOC_000003 361 2000 80 195 XLOC_000004 30 143 49 66 XLOC_000005 0 0 0 0 XLOC_000006 0 0 0 1

A sample of the data file I also have the spike ins data which should be used to normalized this counts head(sp) Sample_118z.0 Sample_132z.0 Sample_118p.0 Sample_132p2.0 ERCC-00009 30 143 49 66 ERCC-00025 2 13 9 7 ERCC-00031 0 0 0 0 ERCC-00034 1 9 1 3 ERCC-00035 11 35 5 7 ERCC-00042 37 186 43 78

I have used the DESeq earlier with the same experimental conditions and normalized the read counts with spike ins data and then ran the n.binom test for extract DEGs , but I cannot figure out how to proceed the same in case of the edgeR. Can anyone help me?

ADD COMMENTlink written 5.6 years ago by ivivek_ngs4.8k
1

The method for edgeR is pretty similar to that for DESeq. If you look at the help for calcNormFactors, you'll see that it returns the same DGElist object with the normalization factors supplied in object$samples$norm.factors. So, just fill them in there.

ADD REPLYlink written 5.6 years ago by Devon Ryan89k

Hello,

Can you show us your code or the complete method you aplly for normalization.

Sorr y I am new with ERCC and I don't understand the way you are normalizing the ERCC with DESEq.

 

Thanks a lot,

 

 

ADD REPLYlink written 4.2 years ago by k.mamoud10

There are many existing references and public code repositories for RNA-Seq analysis for you to investigate. How about you give the analysis a try on your own and then if you have a clear and defined question then please post it on the site.  We're here to help, but it will help us if you have a defined question.  Thanks!  

ADD REPLYlink written 4.2 years ago by Josh Herr5.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 644 users visited in the last hour