Question: Low Number Of Replicates Deseq
0
gravatar for federico.gaiti
5.3 years ago by
Brisbane
federico.gaiti70 wrote:

Hi all,

I am using DESeq for DGE analysis.

I have STRANDED RNA-Seq data for 4 developmental stages with no replicates. To have a more reliable DGE I should have replicates and so I obtained (from another lab member) UNSTRANDED RNA-Seq data with 3 replicates per stage.

Before doing a DGE, I thought to test the correlation between these samples, just to show that similar samples “cluster” together. If so, I can then use the unstranded data for my DGE analysis to have more replicates per each stage.

I mapped the raw reads to the genome using TOPHAT, sorted the bam files by name and used htseq-count to get the raw reads counts for both the data. For the stranded data I used the option -s yes and for the unstranded data I used -s no.

I used DESeq to include metadata and for normalization, and I removed the genes that always have a 0 value. I then calcualted the correlation which was really low.

I then tried to use htseq-count with the option -s reverse for the stranded data and still got really low correlation.

So I reran htseq-count on the stranded data selecting the option -s no and in this way I got a very similar number of total counts between the unstranded and stranded data (while both cases before the stranded ones were double in number). I then included metadata, estimated the new size factors, normalized and calculated the new correlation. Both Pearson and Spearman performed pretty well, confirmed by both a PCA and correlogram.

Though, I'd still like to figure out a way to use the stranded counts. I am not sure if I lose some information running htseq-count using -s no on the stranded data.

What I had in mind was using unstranded data to estimate the level of variation to get a threshold for DE detection but still use the stranded data as expression values. Not sure if I can do that though given one is stranded and the other is not.

I would like to hear from you if you have any thoughts about this.

Let me know if you need more information to better understand the issue.

Thanks a lot Federico

replicates R variation deseq • 2.4k views
ADD COMMENTlink modified 5.3 years ago by Michele Busby2.0k • written 5.3 years ago by federico.gaiti70
1
gravatar for Nicolas Rosewick
5.3 years ago by
Belgium, Brussels
Nicolas Rosewick7.7k wrote:

For me the better to do :

  • Count the stranded with -s yes (or -s reverse depending on your library type)
  • Count the unstranded with -s no
  • In DESeq write a experiment design data frame like that :

    Sample Condition LibType

    A condX stranded

    B condY stranded

    C condZ stranded

    D condX unstranded

    E condY unstranded

    F condZ unstranded

    G condX unstranded

and follow section 4. of DESeq vignette (http://bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf) about multi-factor design

ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by Nicolas Rosewick7.7k

Thanks. I'll give it a try and I'll compare it with the approach that jurgnjin suggested as well.

ADD REPLYlink written 5.3 years ago by federico.gaiti70
0
gravatar for jurgnjn
5.3 years ago by
jurgnjn70
United Kingdom
jurgnjn70 wrote:

There's heaps of small RNAs that are located on the opposite strand of protein-coding genes (literature: "antisense transcription"). Hence, the discrepancy between the stranded and unstranded expression estimates could be a real biological effect.

You can check this hypothesis by looking at the stranded alignments for a few individual of genes with large discrepancies in the stranded vs unstranded expression estimates, and checking whether the unstranded coverage conforms to the splicing structure of the protein-coding gene (it shouldn't).

If this really is the (main) reason for the discrepancies, you could use an unstranded alignment of the stranded library in conjunction with the three unstranded libaries for DGE. The caveat with this approach is that the unstranded expression estimates reflect the total expression at the given locus, not only the expression of protein-coding genes. You should also still compare DGE calls from all four libraries with DGE calls from the three unstranded libraries as a rough sanity check. After all, they were prepared by different labs, and using different protocols...

ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by jurgnjn70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 976 users visited in the last hour