Question: MDS plots with R
1
gravatar for Biogeek
3.4 years ago by
Biogeek240
Biogeek240 wrote:

Can someone give me a simple code so that I can plot an MDS plot for the example data below, I want to compare 9 conditions with a total of 125,000 genes (rows) and expression values. I am new to R and have as much clue as a headless chicken. I am however learning at the moment. Thanks.

X                      LOW1      LOW2       LOW3 -------> 9 conditions total
gene1               0.432       0.231          0.3211
gene2
gene3
gene4
gene5
gene6
gene7
gene8


 

statistics rna-seq mds plots • 11k views
ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by Biogeek240

I've carried out DE with EdgeR but all I want to do is express a simple graph of how different the samples are,regardless of D.E.G's... I'm novice hence I've just followed the Trinity/ RSEM/EdgeR pipeline. Regarding samples,there are no replicates and I have used the 'no replicates' option available in EdgeR.

ADD REPLYlink written 3.4 years ago by Biogeek240

In that case just reload the code you used to run edgeR. Find whatever variable you called  the output of calcNormFactors with and use plotMDS().

e.g.

y <- calcNormFactors(y)

plotMDS(y)

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by gtho123170

Still you need to explain why you have 125,000 rows.

ADD REPLYlink written 3.4 years ago by Michael Dondrup43k

Could they be transcripts, not genes. Isn't 125K transcripts a concievable number given a large genome and de novo assembly.

ADD REPLYlink written 3.4 years ago by gtho123170

Please see below,thanks.

 

ADD REPLYlink written 3.4 years ago by Biogeek240

Thanks for the advice although I'm still trying to understand. I utilised Trinity to perform a de novo transcriptome assemblage, read counts were provided by realigning the reads back to the transcriptome using RSEM, output from RSEM was then put in a matrix,normalised and fed into EdgeR . The methodology is published by Haas et al.

Basically my data is over 3 time points, day x,y,z. There are 3 conditions, so in essence 9 samples in total. EdgeR does Genes but also transcripts... as I am looking at functional annotation downstream I am using transcript expression levels rather than the genes because all the genes end with the identify of comp2132435_seq0 etc... So for this gene there are possibly 7 isoforms(transcripts). Only the transcripts can be identified as the genes have no actual sequence attached to them when I extract them from the .fasta file assembled by trinity. Does this make sense to you guys?

ADD REPLYlink written 3.4 years ago by Biogeek240

Please use the comment function, do not make a new answer.

ADD REPLYlink written 3.4 years ago by Michael Dondrup43k

Identifying DE features: No biological replicates

http://trinityrnaseq.sourceforge.net/analysis/diff_expression_analysis.html

This is the pipeline I have used... Can you please advise me now what file to use if I want to creat MA/MDSplots, do you guys have any good references where I can learn how to use the MDSplot function within Rstudio. Thanks.

 

ADD REPLYlink written 3.4 years ago by Biogeek240

Please look at the links in both answers, edgeR manual contains clear instructions for how to do this.

ADD REPLYlink written 3.4 years ago by Michael Dondrup43k
2
gravatar for gtho123
3.4 years ago by
gtho123170
New Zealand
gtho123170 wrote:

Since this is an RNA-Seq experiment you are probably going to want to do differential expression (DE). When you perform DE with a package like edgeR you can easily draw a MDS plot using the plotMDS() function (which is actually from the limma package).

To do this you need to use tables of counts, not expression values. This could be achieved using samtools.

Give this a read and it should help you out:

http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf

I don't really want to write the code for you as this is actually really achievable and rewarding if you follow along with the tutorial/vinegette, especially if you are new. Granted if you really have nine different conditions (i.e. none are replicates) it could get complicated but given that the first three are called LOW* this does not appear to be the case.

Other DE packages you could look at include DESeq and limma using voom:

http://bioconductor.org/packages/release/bioc/vignettes/DESeq/inst/doc/DESeq.pdf

http://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf

EDIT:

I notice from your other questions that you have been using edgeR. This is a good example of "always check the manual" (DO THIS!). However you seem confused about what to put into it so here is a quick run down about making count tables (in the terminal not R).

1. Map your reads to your reference (i.e. transcriptome, genome ...) using something like BWA MEM

2. Convert the SAM files to BAM files using samtools view -bS *.sam > *.bam

3. Sort your BAM files using samtools sort *.bam *_sort

4. Index your sorted BAM file using samtools index *_sort.bam

5. Get your count tables using samtools idxstats *_sort.bam >  *.txt

     (Output is gene name - gene length -  mapped reads - unmapped reads)

6. Go into R and combine your mapped reads into one count table

7. Run edgeR and make your MDS plot

Hope this helps.

 

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by gtho123170
1
gravatar for Michael Dondrup
3.4 years ago by
Bergen, Norway
Michael Dondrup43k wrote:

The first approach that comes to mind is to try with R: plot(cmdscale(dist(iris[,1:4])), col=iris[,5])

BUT: I think you might have too many rows to have a readily applicable MDS problem for scatter-plotting. No single organism has 125k genes, so you are possibly using a concatenation of gene expression measurements from different organisms. You should definitely consider that your problem is ill-posed, whether you could try to reduce the number of rows by filtering genes or annotation of known orthologs.

In any case you should consider the use of smoothScatter for the resulting data, because even for only 30k points, using plot just gives a black blob.

Edit: As you are using edgeR, you can use the function plotMDS() as described in the manual: http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf pp. 63(data preparation: calcNormFactors)-65

This produces a MDS plot of the colums instead of rows, similar to:

plot(cmdscale(dist(t(iris[,1:4]))), type = "n")
text(cmdscale(dist(t(iris[,1:4]))), labels=colnames(iris)[1:4])

 

 

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by Michael Dondrup43k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1243 users visited in the last hour