As my recent previous posts shows, I'm looking to simulate a read count table using Polyester. Polyester has three different ways to simulate data:
1- You have already a count table from a real study: you simply give this matrix to Polyester as input and Polyester estimates the parameters from the matrix and simulates a count matrix for you. As I know, in this scenario you need to know the expression fold changes of your real count data beforehand. Right? So, since I don't know expression FCs of my genes in my real count data, I can't use this function. Can I?
2- You have already a count table from a real study and the FASTA file from your reference genome: In this case, Polyester simulates RNA read FASTA files and you can then use any known workflow to obtain the simulated count table. Here, you again have to know the expression fold changes of your real count data beforehand.
3- You define a matrix of your desired FCs and give it as input to Polyester. You also provide Polyester with the reference FASTA file of an organism. Polyester then creates RNA read FASTA files using the reference FASTA file and the expression FC matrix. And then you use any known workflow to obtain the simulated count table. This scenario may be helpful to me.
Now, my questions are:
1- In scenario number 3, I don't use any real data set but the main FASTA file of the reference genome. Is simulation correct then? In the papers I have seen, they always use a real dataset to estimate the parameters and obtain a simulated read count table. If I use scenario 3, is it scientifically correct to publish a paper?
2- How should I define the expression FC matrix? I thought that defining 10000 genes with 3000 of them to be equally expressed and the rest 7000 of them to be DE would be good. I want to divide my DE genes into subgroups according to my need. I haven't seen any criteria for that. In each paper they do a different thing.
3- It isn't wrong to post these information on biostars, is it?
I thank you very much.
Polyester is an R package designed to simulate an RNA sequencing experiment. Given a set of annotated transcripts, polyester will simulate the steps of an RNA-seq experiment (fragmentation, reverse-complementing, and sequencing) and produce files containing simulated RNA-seq reads. Simulated reads can be analyzed using any of several downstream analysis tools.