Question

What's the criteria to simulate RNA-seq read counts table

0

Entering edit mode

7.6 years ago

statfa ▴ 790

Hi,

As my recent previous posts shows, I'm looking to simulate a read count table using Polyester. Polyester has three different ways to simulate data:

1- You have already a count table from a real study: you simply give this matrix to Polyester as input and Polyester estimates the parameters from the matrix and simulates a count matrix for you. As I know, in this scenario you need to know the expression fold changes of your real count data beforehand. Right? So, since I don't know expression FCs of my genes in my real count data, I can't use this function. Can I?

2- You have already a count table from a real study and the FASTA file from your reference genome: In this case, Polyester simulates RNA read FASTA files and you can then use any known workflow to obtain the simulated count table. Here, you again have to know the expression fold changes of your real count data beforehand.

3- You define a matrix of your desired FCs and give it as input to Polyester. You also provide Polyester with the reference FASTA file of an organism. Polyester then creates RNA read FASTA files using the reference FASTA file and the expression FC matrix. And then you use any known workflow to obtain the simulated count table. This scenario may be helpful to me.

Now, my questions are:

1- In scenario number 3, I don't use any real data set but the main FASTA file of the reference genome. Is simulation correct then? In the papers I have seen, they always use a real dataset to estimate the parameters and obtain a simulated read count table. If I use scenario 3, is it scientifically correct to publish a paper?

2- How should I define the expression FC matrix? I thought that defining 10000 genes with 3000 of them to be equally expressed and the rest 7000 of them to be DE would be good. I want to divide my DE genes into subgroups according to my need. I haven't seen any criteria for that. In each paper they do a different thing.

3- It isn't wrong to post these information on biostars, is it?

I thank you very much.

Polyester is an R package designed to simulate an RNA sequencing experiment. Given a set of annotated transcripts, polyester will simulate the steps of an RNA-seq experiment (fragmentation, reverse-complementing, and sequencing) and produce files containing simulated RNA-seq reads. Simulated reads can be analyzed using any of several downstream analysis tools.

simulation RNA-Seq count table • 3.6k views

ADD COMMENT • link 7.6 years ago by statfa ▴ 790

0

Entering edit mode

Comment 1 : No you can't use this function.
Comments 2 & 3: You appear to have studied Polyester exhaustively. We will take your word for this. Though simulating fasta files seems to contradict the description of polyester package below.
Question1 : Simulation can only be correct as far as the parameters you set and if they match real world data or not. I doubt you can publish a paper on simulated data alone using a published package (polyester is published?)
Question 2: Having 7000 genes changing out of 10000 would be bad. Assumption with microarrays/expression studies is that majority of genes are not changing. There is also the issue of amount of RNA being produced.
Question3 : No it is not.

ADD REPLY • link 7.6 years ago by GenoMax 152k

0

Entering edit mode

Thanks a lot

Yes this package is published here: https://www.ncbi.nlm.nih.gov/m/pubmed/25926345/

I have read this package's manual, vignette, rdocument, etc. for days. I know that you can give it a matrix of your real count table or you can simply define the FCs on your own to simulate data using a reference FASTA file. Please see the link below for an example. You can have a quick review on the main points of this package here:

https://bioconductor.org/packages/release/bioc/vignettes/polyester/inst/doc/polyester.html

I feel so sick and tired. It's been weeks that I'm trying to simulate RNA-seq data and the more I go, the more confused I feel. And unfortunately there is no one expert around me to help me with it. Or at least I don't know if there is anybody. I'm a biostatistics master student and don't have adequate background in bioinformatics and the bioinformaticians I know seem not to know how to sim data either. I study on my own but now I feel like hitting my head against a wall :(( Feel so desperate. But why on earth have they developed this package then? If we knew the FCs from a real study then we didn't need to simulate data. I can estimate parameters from a real data but I still don't find a way to introduce it to this package and define the fold changes by myself. Is there anyone familiar with this package? Or do you know any book or tutorial to help me know how to sim data? Is 2000 DE genes out 10000 genes ok? How can I decide the number?

Thank you

ADD REPLY • link 7.6 years ago by statfa ▴ 790

0

Entering edit mode

Let us go back to the basics and ask what exactly are you trying to do? If this was covered in a past post then link it here.

Polyester appears to be primarily written to simulate RNAseq reads from a reference and associated annotation. Using the simulated sequence data you then align/generate a count table and use it for DE analysis. All other functionality is advanced usage. Ability to provide a counts table is when you want to control how the reads are generated to get output similar to the table. A statistician like you may be interested in this but a common biologist would find this difficult to use.

Let us hear what you have to say for the question above and then we can go from there.

ADD REPLY • link 7.6 years ago by GenoMax 152k

0

Entering edit mode

Is 2000 DE genes out 10000 genes ok?

Remember that DE analysis generates hypothesis that have to be independently tested, experimentally. Testing 2000 genes is well beyond the physical/fiscal means of most. People may be happy with low double digit genes for a start.

Or do you know any book or tutorial to help me know how to sim data?

Simulation is the simple part that has been done by polyester (and other packages) already. What kind of simulated data you need is the question here.

I can estimate parameters from a real data but I still don't find a way to introduce it to this package and define the fold changes by myself.

Isn't the section labeled simulate_experiment_countmat example doing this?

ADD REPLY • link 7.6 years ago by GenoMax 152k

0

Entering edit mode

Thank you very much for kindness to help me. I need to compare some statistical methods for detection of DE genes. So I need to simulate some count data. I want to keep it simple so I wanna simulate a count table of four conditions with 3 replicates in each condition. I also want to divide my DE genes into some groups according to some parameters. That's why I decided to have 7000 DE genes out of 10000. Because then I could have an adequate number of DE genes in each group to further evaluate my methods. In order to simulate data, I downloaded the reference FASTA file from http://www.gencodegenes.org/releases/27.html . Since the FASTA file has the transcript annotation in it, I don't need to download the GTF file as I know so far. Then I wanted to give this FASTA file to Polyester and define the matrix of FCs by my own and simualte RNA reads using Polyester and then use htseq and hisat to obtain the reads from those simulated RNA reads. From this pipeline, It seems that I don't need to have a real dataset for simulation and that is what makes me doubt the steps. Functions simulate_experiment or simulate_experiment_countmat seem to do it. I'm new to scientific papers so maybe I shouldn't have publicized my plan

About your question, get_params can estimate the parameters from a real count table. Or simulate_experiment_empirical can be used to simulate data based on a real dataset. In simulate_experiment_countmat section example, they have made the count matrix using the reference FASTA file not a real dataset.

ADD REPLY • link 7.6 years ago by statfa ▴ 790

0

Entering edit mode

So I need to simulate some count data.

Only so that you would know the truth? Otherwise you could use any one of the hundreds of datasets in NCBI GEO/SRA that could fit the experimental setup you have described. Four conditions x 3 replicates. You could use microarray data instead of RNAseq, if you don't specifically need RNAseq.

define the matrix of FCs by my own and simualte RNA reads using Polyester and then use htseq and hisat to obtain the reads from those simulated RNA reads. From this pipeline, It seems that I don't need to have a real time course data for simulation and that is what makes me doubt the steps.

That seems to be correct. Since you are generating a synthetic count table where you know the truth you will have to define the fold changes on your own for specific genes in the count table. Since you would be normalizing data in DE step it may not be as simple as generating FC X counts but it may be close enough to get a FC you expect.

Have you tried to do this?

ADD REPLY • link 7.6 years ago by GenoMax 152k

0

Entering edit mode

Thanks a lot. Yeah there are many real RNA-seq data which I can use. I just can't use them for simulation. So maybe I have to use Polyester without a real dataset to generate a synthetic count table. And as you said, that'd be correct.

yes I have done this before in a small size and obtained the simulated RNA reads FASTA files. I didn't try HISAT and htseq on those simulated FASTA files though. But to generate 10000 transrcipt, my laptop is not responding.