Could I use the ExpressionSet generated from GEOquery for Monocle3 or does it have to be formatted differently?
1
0
Entering edit mode
2.3 years ago
Pratik ▴ 840

Hello,

I hope you are safe and well.

I understand how to make the ExpressionSet using GEOquery.

Could I use the ExpressionSet generated from GEOquery for Monocle3 or does it have to be formatted differently?

I would really appreciate someone's help. I have been struggling with this for some time.

Very Respectfully, Pratik

RNA-Seq R monocle3 next-gen • 947 views
1
Entering edit mode
2.3 years ago

It would be better to follow the guidance and create a cell_data_set object - see here: https://cole-trapnell-lab.github.io/monocle3/docs/starting/

If you are starting from an ExpressionSet object, eset, then you should have the necessary components via:

exprs(eset) # expression data
fData(eset) # feature data
pData(eset) # pheno data


However, keep in mind that Monocle is designed for scRNA-seq; thus, your initial data would be more likely to be stored in a SingleCellExperiment object, or Seurat's format specification. Why are you coming from an ExpressionSet object?

Kevin

0
Entering edit mode

Hi Kevin,

Thank you very much for responding. I have been struggling for this for days so I really appreciate your response. Please bare with me as I'm only beginning in this field.

I am coming from the ExpressionSet, because I was told that creating an expression matrix would make my life easier for analyzing data in Monocle3. I could not find any solid information on how to create an expression matrix, the closest answer I found was creating an ExpressionSet object using GEOquery. My thought process is/was that using GEOquery would allow me to download the series_matrix.txt.gz file from NIH GEO and would allow me to create an ExpressionSet object to use in Monocle3.

Forgive me for my ignorance, if there is any here. I am sensing there is.

Ideally the data I want to use is here:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE110154

I was, originally, using the guidance on the link you provided (https://cole-trapnell-lab.github.io/monocle3/docs/starting/), however on the ENA page for the GSE110154 data above (https://www.ebi.ac.uk/ena/browser/view/PRJNA432959) I learned that there are TWO fastq files and not one, and there isn't a publication associated with the data yet to provide a tutorial on how the data was processed. My goal is actually to analyze the data before publication and share my results with the Primary Investigator (I wish to join her lab in the future). I learned how to download all the files successfully, thanks to this tutorial: Fast download of FASTQ files from the European Nucleotide Archive (ENA) However, now, I am stuck on how to get these files into the three files I need for Monocle3.

My understanding is that I will need to use 10x cellranger to generate the files. My next approach will be studying how Illumina NextSeq 500 works (the platform used for the data) so I can understand what was generated, why there are two files, and perhaps how to use it too.

Unless there is an easier way, such as using the series_matrix.txt.gz file?

Any clues?

Again thank you very much Kevin. I really do appreciate your response.

Very Respectfully, Pratik

1
Entering edit mode

I think I'm good. I'm just going to generate the matrixes I need using these tutorials:

https://davetang.org/muse/2018/08/09/getting-started-with-cell-ranger/

https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_ct

Apparently, the the first fastq file for per sample is "read 1 [which] contains the cell barcode and UMI tag" according to the davetang.org tutorial. The second fastq file per sample is "read 2 [which] contains cDNA sequence" according to the davetang.org tutorial.

Is scRNA-seq analysis using Monocle 3 possible using the series_matrix.txt.gz file for the NIH GEO?

I might ask this question in a new post.

Very Respectfully, Pratik

1
Entering edit mode

Sure, if you are okay regenerating the count matrices, that would be ideal. You should check the reads inside the FASTQs just to make sure which is which (barcode+UMI or cDNA sequence).

It looks like the count matrices per sample are already provided for that study, though (?)

0
Entering edit mode

Hi Kevin,

Thank you responding again. I do appreciate your wisdom.

Related to the count matrices:

Is the GSE110154_RAW.tar file, the count matrices per sample you are referring to?

I opened up one of the sample .csv files from one of the sample files in the GSE110154_RAW.tar file and did see genes that I am familiar with and a number next to them, which I'm assuming is the count of how many mRNAs were present within the cell of that specific gene or just generally the level of mRNA of that gene relative to the other gene's mRNA transcripts.

Here is a snippet from the file:

PDS5A,53
PDS5B,59
PDSS1,0
PDSS2,0
PDX1,136
PDXDC1,17
PDXDC2P,22
PDXK,0


I will ask the below question in a new post so someone else in my shoes will have a path to follow if those files, indeed, are the count matrices:

How would I use count matrices files per sample in Monocle3? I thought Monocle3 requires for input only three files (expression_matrix, cell_metadata, gene_annotation) or the three files generated from cellranger? How would I use the individual cell files in Monocle3?

Here is the link to the above question here: How would I input the count matrices per sample files into Monocle3?

I also have another question related to the two FASTQ files, but I'll make a new post for that one as well.

Very Respectfully, Pratik

1
Entering edit mode

I would aggregate the count matrices and then normalise in, e.g., Seurat. Then, from Seurat, transform the normalised data and use this as input to Monocle.

0
Entering edit mode

Awesome. Thank you Kevin.

I appreciate you being direct and to the point.

1
Entering edit mode

Sure, have to be aware of batch effects, but I think that Seurat manages these, if you follow the Seurat tutorial pages. From what it seems, this study used a technology that outputs just a single cell per sample. Batch effects are major issues in scRNA-seq studies, with no standard way to address this currently.

0
Entering edit mode

Hi Kevin,

So I made some progress. I got as far as aggregating the data using cbind. Any tips on how a Seurat count matrix should look? I posted it in a new question: What is a count matrix for input into Seurat supposed to look like?

Very Respectfully, Pratik