Difficulty in understanding the experimental design ofa RNASeq project
3
0
Entering edit mode
7.8 years ago

Dear all,

I am trying to download raw RNASeq data (GSE55005) from SRA (NCBI).

When I open the GSMs of the GSE55005, there are several distinct small size srr file ( around 135 Mb).

I expected each GSM contains a single srr file with the size of at least 1 Giga byte.

Can any one help me to understand the design of this experiment?

Should I merge these data before starting the analysis?

Thank you in advance

Nazanin

RNA-Seq • 2.0k views
ADD COMMENT
0
Entering edit mode

Design should be described in the associated publication: http://www.ncbi.nlm.nih.gov/pubmed?LinkName=gds_pubmed&from_uid=200055005

ADD REPLY
0
Entering edit mode
7.8 years ago
BioinfGuru ★ 1.7k

Each SRR file is 1 RNAseq run. I would use SRAtookit (http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc) to get the files from the SRA in the required format. DO NOT merge the runs. These are replicates important to the experimental design. From http://www.ncbi.nlm.nih.gov/sra?term=SRP037775 I can see that there are 5 groups each with multiple runs - the paper should tell you the difference between the groups. Within each group is probably sub-groups (maybe treated v controls?).

First step: Figure out how the runs are grouped. Second step: Download the runs you require for your analysis Third step: Choose the pipeline for the analysis.

The merging of replicates does not occur until late in the analysis after each individual replicate has been analysed independently. If you merge before this, then you are losing the purpose replicating runs (validation) and introducing noise and variability and the resulting analysis results will have zero statistical reliability. The following reviews should point you in the right direction:

http://www.ncbi.nlm.nih.gov/pubmed/26108229

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4728800/

TopHat/Cufflinks pipeline is the traditional approach but is slow and there are others that may be more suitable. However, the pipeline is well supported (especially on this forum) if you run into any problems.

ADD COMMENT
0
Entering edit mode

Theoretically, each SRR is one run. However, in reality, they could be split for a variety of reasons that may or may not make sense.

ADD REPLY
0
Entering edit mode

Thank you so much for your explanation.

I have another question. Suppose I have test(4 technical replicates) and control groups( 7 technical replicates) of RNASeq runs.

Can I use all of these 11 RNASeq runs in cuffDiff to find differential expressed genes?

Thank you in advance

Nazanin

ADD REPLY
0
Entering edit mode

I am currently analysing 3 treated replicates and 4 control replicates. For this I have run tophat seperately on each. Then I ran cufflinks seperately on each with a reference genome. Then I used cuffmerge to merge the 3 treated replicates. Then I used cuffmerge to merge the 4 control replicates. Next step is to put the 2 merged files into cuffdiff.

ADD REPLY
0
Entering edit mode
7.8 years ago
igor 13k

By default, Illumina FASTQs are limited to 4M reads (and are usually not split by modifying the default CASAVA/bcl2fastq settings). You can see all the SRRs (derived from FASTQs) in that project are at exactly 4M reads or lower. The only reason to keep them separate is to keep the file sizes smaller. Otherwise, they should be merged.

ADD COMMENT

Login before adding your answer.

Traffic: 1901 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6