Question

Difficulty in understanding the experimental design ofa RNASeq project

0

Entering edit mode

7.8 years ago

nazaninhoseinkhan ▴ 520

Dear all,

I am trying to download raw RNASeq data (GSE55005) from SRA (NCBI).

When I open the GSMs of the GSE55005, there are several distinct small size srr file ( around 135 Mb).

I expected each GSM contains a single srr file with the size of at least 1 Giga byte.

Can any one help me to understand the design of this experiment?

Should I merge these data before starting the analysis?

Thank you in advance

Nazanin

RNA-Seq • 2.0k views

ADD COMMENT • link updated 7.8 years ago by BioinfGuru ★ 1.7k • written 7.8 years ago by nazaninhoseinkhan ▴ 520

0

Entering edit mode

Design should be described in the associated publication: http://www.ncbi.nlm.nih.gov/pubmed?LinkName=gds_pubmed&from_uid=200055005

ADD REPLY • link 7.8 years ago by GenoMax 141k

score 0 · Answer 1 · 2016-07-07

0

Entering edit mode

7.8 years ago

BioinfGuru ★ 1.7k

Each SRR file is 1 RNAseq run. I would use SRAtookit (http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc) to get the files from the SRA in the required format. DO NOT merge the runs. These are replicates important to the experimental design. From http://www.ncbi.nlm.nih.gov/sra?term=SRP037775 I can see that there are 5 groups each with multiple runs - the paper should tell you the difference between the groups. Within each group is probably sub-groups (maybe treated v controls?).

First step: Figure out how the runs are grouped. Second step: Download the runs you require for your analysis Third step: Choose the pipeline for the analysis.

The merging of replicates does not occur until late in the analysis after each individual replicate has been analysed independently. If you merge before this, then you are losing the purpose replicating runs (validation) and introducing noise and variability and the resulting analysis results will have zero statistical reliability. The following reviews should point you in the right direction:

http://www.ncbi.nlm.nih.gov/pubmed/26108229

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4728800/

TopHat/Cufflinks pipeline is the traditional approach but is slow and there are others that may be more suitable. However, the pipeline is well supported (especially on this forum) if you run into any problems.

ADD COMMENT • link 7.8 years ago by BioinfGuru ★ 1.7k

0

Entering edit mode

Theoretically, each SRR is one run. However, in reality, they could be split for a variety of reasons that may or may not make sense.

ADD REPLY • link 7.8 years ago by igor 13k

0

Entering edit mode

Thank you so much for your explanation.

I have another question. Suppose I have test(4 technical replicates) and control groups( 7 technical replicates) of RNASeq runs.

Can I use all of these 11 RNASeq runs in cuffDiff to find differential expressed genes?

Thank you in advance

Nazanin

ADD REPLY • link 7.8 years ago by nazaninhoseinkhan ▴ 520

0

Entering edit mode

I am currently analysing 3 treated replicates and 4 control replicates. For this I have run tophat seperately on each. Then I ran cufflinks seperately on each with a reference genome. Then I used cuffmerge to merge the 3 treated replicates. Then I used cuffmerge to merge the 4 control replicates. Next step is to put the 2 merged files into cuffdiff.

ADD REPLY • link 7.8 years ago by BioinfGuru ★ 1.7k

score 0 · Answer 2 · 2016-07-07

By default, Illumina FASTQs are limited to 4M reads (and are usually not split by modifying the default CASAVA/bcl2fastq settings). You can see all the SRRs (derived from FASTQs) in that project are at exactly 4M reads or lower. The only reason to keep them separate is to keep the file sizes smaller. Otherwise, they should be merged.