Question: Analyzing LogRPKM Counts Data for finding DEGs(Differential Expressed Genes)
0
gravatar for naseerkhan861
5 months ago by
naseerkhan8610 wrote:

I have a log2RPKM counts data that I want to analyze to find differentialy expressed genes from this data, In my case I am trying to analyze data from the GEO102741 dataset which has both control and subjects data

My questions are following

1) The data given is log2RPKM counts data so do I need to convert these counts to some other format using some technique, is there some free online tool that I can use to convert this data to some format that is required for Gene Expressions

2) How can I find DEGs(Deferentially Expressed Genes) From this dataset , is there some online state of the art tool for free that I can use to find DEGs

I am new to this DEGs and RNASeq datasset domain so I apologize if this question is too naive.

What is my GOAL?

My end goal is that I want to perform clustering of genes in the dataset in both control and autism separately and want to see how many clusters perform and will then dig deeper , for these clusters like how the clusters vary in size in two groups and then I will use DEGs genes and also cluster them and this time I will compare them across different datasets that is in different datasets I will perform these kind of operations and will compare DEGs across datasets.Please if somebody has some useful suggestions then please guide me.

Update

At the this link a huge SRA dataset is available but downloading and running the suggested software on data is not possible for me due to limited bandwidth, lack of storage and lack of computational power.

Regards

deg rna-seq • 277 views
ADD COMMENTlink modified 5 months ago by ATpoint31k • written 5 months ago by naseerkhan8610

Check if the dataset comes with raw read counts or not. I would suggest using EdgeR, Limma, DESeq2 for differential gene expression analysis. EdgeR / limma normalizes the count matrix based on the library size. It is not wise to use the normalized expression dataset you mentioned (RPKM - normalized to gene length), find the raw count matrix and then run the mentioned tools. If you don't have the count matrix, download the raw SRA files and run the alignment to count matrix generation pipeline to generate your count matrix.

ADD REPLYlink written 5 months ago by c.chakraborty160

The site has link to SRA but that dataset is about 650GB in size, I mean that is impossible for me to download , as I don't have so much computing and storage power?

ADD REPLYlink written 5 months ago by naseerkhan8610

You can do it chunk-wise. Download 10 samples, quantify, then delete fastq, repeat until finished. Not sure what your bandwidth for download is but this is in principle do-able, see my answer towards how to efficiently download fastq files from ENA.

ADD REPLYlink modified 5 months ago • written 5 months ago by ATpoint31k

EdgeR / limma normalizes the count matrix based on the library size.

I am not well familiar with limma but the edgeR default is TMM where library size is further corrected with a scaling factor that takes into account the library composition. Similar with DESeq2's RLE approach.

ADD REPLYlink written 5 months ago by ATpoint31k

Can you please explain your point, I did not get it fully?

ADD REPLYlink written 5 months ago by naseerkhan8610

Wanted to point out that it is not a naive per-million scaling that edgeR does.

ADD REPLYlink written 5 months ago by ATpoint31k

Is there some online tool or resource where I can download that huge SRA data, analyze them and find Gene Expressions across sample and finally find DEGs ?

ADD REPLYlink written 5 months ago by naseerkhan8610

Not that I know for RNA-seq. For arrays there is GEO2R within the NCBI GEO environment.

ADD REPLYlink written 5 months ago by ATpoint31k

Since they did not deposit RAW counts, you better reanalyze the FASTQ deposited files if you want to get good results. I had a look at the dataset, it was very interesting. But I have a small question since this is RNA-seq study, I would expect the samples to be collected before a certain time point (Post-mortem Interval, PMI) so that RNA degradation would not happen (also their PCA from supplementary figures 2 &4 does not look convincing). In this article and in the GEO repository, the authors did not provide any details on PMI. Just verify these details with the authors before you start the analysis.

ADD REPLYlink modified 5 months ago • written 5 months ago by EagleEye6.6k

Thanks for your reply, So for this dataset the authors have provided a file "GSE28521_RAW.tar" and a non-nromalized file also of size 3.9 MB and 22 MB respectively, for RAW file which upon extraction gave a file of extension .bgx file , I opened it in notepad ++ but it was kind of metadata file and not the counts as you suggested in a RAW file and for other non-normalized file , it was also not clear as to what it was for. So what can I do with these kind of RAW files and non-normalized file or they are useless. Please suggest.

ADD REPLYlink modified 5 months ago • written 5 months ago by naseerkhan8610
1
gravatar for ATpoint
5 months ago by
ATpoint31k
Germany
ATpoint31k wrote:

I will never understand why authors do not simply upload the raw count matrix. These RPKM-whatever data are utterly useless. Anyway, you can download the fastq files directly from ENA, see Fast download of FASTQ files from the European Nucleotide Archive (ENA)

From there on I suggest you use a leightweight-quantifier such as salmon to get transcript abundance estimations for each sample. This is computationally-inexpensive and very fast. Then use tximport to summarize these to the gene level (= to get a count matrix, see https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html ) followed by a DEG tool of your choice for the differential analysis. For inspiration see e.g. https://www.bioconductor.org/packages/devel/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html

ADD COMMENTlink modified 5 months ago • written 5 months ago by ATpoint31k

Thanks for your detailed reply. Thank you very much indeed!

ADD REPLYlink modified 5 months ago • written 5 months ago by naseerkhan8610
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 764 users visited in the last hour