Question: How and what are the steps for performing RNAseq Time point analysis 3 biological replicates and 4 Technical replicates for each
gravatar for macmath
2.6 years ago by
macmath140 wrote:

Condition A, B, C I have 3 biological replicates, Each biological replicates have 4 Technical replicates

Below is just one Example 1KO-1a.fastq.gz 1KO-1b.fastq.gz 1KO-1c.fastq.gz 1KO-1d.fastq.gz 2KO-1a.fastq.gz 2KO-1b.fastq.gz 2KO-1c.fastq.gz 2KO-1d.fastq.gz 3KO-1a.fastq.gz 3KO-1b.fastq.gz 3KO-1c.fastq.gz 3KO-1d.fastq.gz

I tried to perform this analysis as below:

Genome Alignment : STAR 4 Technical replicates same time for each biological replicate

STAR --genomeDir reference/mm10ens84/STARindex/ --readFilesIn data/1KO-1a.fastq.gz,data/1KO-1b.fastq.gz,data/1KO-1c.fastq.gz,data/1KO-1d.fastq.gz,data/2KO-1a.fastq.gz,data/2KO-1b.fastq.gz,data/2KO-1c.fastq.gz,data/2KO-1d.fastq.gz,data/3KO-1a.fastq.gz,data/3KO-1b.fastq.gz,data/3KO-1c.fastq.gz,data/3KO-1d.fastq.gz --readFilesCommand zcat --outFileNamePrefix alignment_STAR/HFD-KO1 --outFilterMultimapNmax 1 --outReadsUnmapped Fastx --outSAMtype BAM SortedByCoordinate --twopassMode Basic --runThreadN 15

Result : 1 Bam aligned file for each biological replicate

HTSEQ : Count

htseq-count -s no -r pos -t exon -f bam -s no -a 0 alignment_STAR/KO1hq.bam reference/mm10ens84/mm10ens84.gtf >KO1hq.counts

Correct me if I am wrong and suggest me how to proceed ahead.

Thank you in advance for your suggestion and time

rna-seq time point • 879 views
ADD COMMENTlink modified 2.6 years ago by Michael Dondrup47k • written 2.6 years ago by macmath140

How does STAR distinguish and group biological and technical replicates?

This is an interesting example, because if that works, it is not clear from the documentation how to achieve this, and should be clarified.

From the manual:

default: Read1 Read2
string(s): paths to files that contain input read1 (and, if needed, read2)

The only mention of that STAR might accept comma separated lists of replicates is found under a different option:

--outSAMattrRGline [...] Comma separated RG lines correspons to different (comma separated) input
files in -readFilesIn. Commas have to be surrounded by spaces, e.g.

I have seen the comma separated input work before, but I think the manual needs some clarification.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Michael Dondrup47k

Yeah, pretty sure OP's example doesn't actually work.

ADD REPLYlink written 2.6 years ago by Devon Ryan93k

Another comment: are you sure that your protocol is not strand specific? -s no -r pos -t exon -f bam -s no also you have doubled the parameter, be careful when trying out different combinations, to specify each parameter only once to avoid confusion, like in -s no -r pos -t exon -f bam -s yes, which one takes precedence?

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Michael Dondrup47k

Thank you for all your valuable insight's. I'm naive in this analysis and with many experts in the field viewing these queries, I would be extremely grateful if any of you can suggest me the workflow with commands to analyse Technical and biological replicates from Fastq -> DEGs in different time points or different links (tutorials) associated with similar situation.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by macmath140

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

ADD REPLYlink written 2.6 years ago by genomax77k

Take a look at DESeq2 vignette for examples of how to analyze RNAseq data.

ADD REPLYlink written 2.6 years ago by genomax77k
gravatar for Michael Dondrup
2.6 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

As we have written in comments, we are not sure if the grouping of replicates will work correctly, it might if the RG are defined well, but you could also make a big mistake already here. You need to experiment a bit, but I think that it is advisable to stick as close as possible to the documented behavior and to run the STAR alignment process separately over each biological replicate, separating input files by commas.

STAR ... --readFilesIn data/1KO-1a.fastq.gz,data/1KO-1b.fastq.gz,data/1KO-1c.fastq.gz,data/1KO-1d.fastq.gz --outFileNamePrefix alignment_STAR/HFD-1KO ... # Assuming 1KO, 2KO, 3KO are biological replicates and a,b,c,d are technical replicates

STAR ... --readFilesIn data/2KO-1a.fastq.gz,data/1KO-2b.fastq.gz,data/2KO-1c.fastq.gz,data/2KO-1d.fastq.gz --outFileNamePrefix alignment_STAR/HFD-2KO ... 


Other remarks:

--outFilterMultimapNmax 1 # is that not a bit strict?
  • Make sure, you are counting from the right strand, most modern stranded-protocols are reverse sense
  • you are lacking any QC, out of painful experiences, you should QC every step otherwise there could be a big blunder. You can use MultiQC for that. Run
    • fastQC on the fastq files, check for adapter contamination, if high you need to trimm adapters in addition
    • include the STAR Log final out files (mapping rates should be >80%)
    • run bamtools stats on the bam files and finally
    • check for that the numbers of aligned sequences in the bam files match with the number of seqs in the input
    • check the assignment rate of reads to genes from HTseq or featureCounts, any low percentage (~5%) should trigger a warning (should be >70%).
ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by Michael Dondrup47k

Thank you Michael Dondrup for your suggestion and valuable time. I performed fastQC on the data set. Regarding the outFilterMutimapNmax, what is your ideal suggestion while performing the alignment of such files. In short what will you personally do with these files? I am curious to know how can I make it as correct as possible without doing any blunder.

ADD REPLYlink written 2.6 years ago by macmath140

Kindly correct me if I am wrong, I checked my data alignment all were >80% and I could see alignment on both strands. Then I used samtools with the below command

samtools view -h -b -q 20 -F 4 AlignedKO1.bam >KO1hq.bam

HTSEQ : Count

htseq-count -r pos -t exon -f bam -a 0 alignment_STAR/KO1hq.bam reference/mm10ens84/mm10ens84.gtf >KO1hq.counts

Followed by DESeq2

sampleFiles<- c("KO1a.counts", "KO1b.counts", "KO1c.counts", "KO1d.counts", "KO2a.counts", "KO2b.counts", "KO2c.counts", "KO2d.counts", "KO3a.counts", "KO3b.counts", "KO3c.counts", "KO3d.counts", "WT1a.counts", "WT1b.counts", "WT1c.counts", "WT1d.counts", "WT2a.counts", "WT2b.counts", "WT2c.counts", "WT2d.counts", "WT3a.counts", "WT3b.counts", "WT3c.counts", "WT3d.counts")
sampleNames <- c("KO1a", "KO1b", "KO1c", "KO1d", "KO2a", "KO2b", "KO2c", "KO2d", "KO3a", "KO3b", "KO3c", "KO3d", "WT1a", "WT1b", "WT1c", "WT1d", "WT2a", "WT2b", "WT2c", "WT2d", "WT3a", "WT3b", "WT3c", "WT3d")
sampleCondition <- c("KO1", "KO1", "KO1", "KO1", "KO2", "KO2", "KO2", "KO2", "KO3", "KO3", "KO3", "KO3", "WT1", "WT1", "WT1", "WT1", "WT2", "WT2", "WT2", "WT2", "WT3", "WT3", "WT3", "WT3")
sampleTable <- data.frame(sampleName = sampleNames, fileName = sampleFiles, condition = sampleCondition)
treatments <- c("KO1", "KO2", "KO3", "WT1", "WT2", "WT3")
ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
                                       directory = directory,
                                       design = ~ condition)
colData(ddsHTSeq)$condition <- factor(colData(ddsHTSeq)$condition,
                                      levels = treatments)

I am not sure how to merge the technical replicates eg. "KO1a", "KO1b", "KO1c", "KO1d" into KO1 and how should the design look like. Looking forward for your suggestions.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by macmath140
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1782 users visited in the last hour