Question

Simulating rna-seq data using Polyester package

0

Entering edit mode

6.4 years ago

statfa ▴ 760

Hi,

I asked two questions some days ago here and here.

SimSeq package can take a matrix as input to simulate rna-seq data which is what I'm looking for. But the problem is it looks at the treatments as independent conditions so in time course studies it doesn't take into account the dependence over time. So I can't use it. Polyester package can get as input a count matrix and simulate time course studies which is my ideal. But my problem is I don't understand somethings about it and that is why I'm here to ask you some questions, please. And believe me, I have searched a lot to get the info but I still have problems understanding them. So please don't blame me for asking these simple question, or if you'd like to blame me, do it but please answer my questions too. Thanks a lot.

Please take a look at this context from Polyester's manual,

"Required Input:

You'll need to provide transcript annotation from which reads should be simulated. There are several public data repositories where you can download this annotation. You can simulate reads from any organism for which annotation is available.

Annotation must be provided in one of two formats:

-FASTA: text file containing names and sequences of transcripts from which reads should be simulated. Known transcripts from human chromosome 22 (hg19 build) are available in extdata/chr22.fa.

-GTF format + FASTA sequence files. The GTF file should denote the transcript structures, and you'll need a FASTA file of the full DNA sequence for each chromosome in the GTF file. All the chromosome-specific FASTA files should be in the same directory. "

1- For the second format, is it correct if I download the GTF file and the FASTA file from here? Which one should I download? The first GTF and FASTA files in each section?

Now, take a look at this:

"If you're an experienced user requiring more flexibility, you can use the simulate_experiment_countmat function to directly specify the number of reads you'd like to simulate for each transcript and each replicate in the data set. This function takes a count matrix as an argument.

This function creates FASTA files containing RNA-seq reads simulated from provided transcripts, with optional differential expression between two groups (designated via read count matrix)"

2- Now if I give this function the GTF file, FASTA file and the count matrix from a real experiment, it provides me with some FASTA files. How can I use these FASTA files? Should I align them again to the reference genome using HISAT and then obtain read counts using htseq?

Do I understand everything correctly? Thank you.

simulation polyester RNA-seq • 2.1k views

ADD COMMENT • link 6.4 years ago by statfa ▴ 760

score 1 · Answer 1 · 2017-12-19

1

Entering edit mode

6.4 years ago

GenoMax 141k

For number 1: You can get the first files in both categories.

For number 2: Since you are simulating the reads using a pre-defined count matrix you should recover something similar once you do the alignments and count the reads. It would be interesting to see how the matrix correlates with what you get in reality. Perhaps using different aligners (besides HISAT2) would be a good exercise.

ADD COMMENT • link 6.4 years ago by GenoMax 141k

0

Entering edit mode

Thank you very much for answering my questions. I'm grateful to you for your patience to read my long post. May I know please what's the difference between the two formats?

And what makes me doubt is that if I use HISAT and htseq on the Polyester's provided FASTA files, I will lose some information because these two tools are not 100% accurate. And it might affect the accuracy of the simulated count matrix. For example, I adjust gene X to be DE using Polyester. Then, when Polyester gives me the FASTA files, using HISAT and htseq may lead to loss of some reads and then gene X may not be DE anymore. Could it happen?

And by the way, if I give it as input the raw count matrix, should I normalized the simulated matrix I obtain later for my further analysis?

ADD REPLY • link 6.4 years ago by statfa ▴ 760

1

Entering edit mode

May I know please what's the difference between the two formats?

Fasta is plain sequence. Files can contain multiple sequences (called multi-fasta in that case). GTF files are annotation. Format is described here. Way those two are correlated is by using common names for sequences (chromosomes in this case).

You are right in thinking that you are not likely to recover the simulated matrix 100% after alignments/counting. If the DE level you set for gene X is significant enough then you should be able to recover it. One would expect biases in alignments to apply to all reads in your data set.

You should use raw count matrix since that is what you will need for eventual DESeq2/edgeR analysis.