Hi, I am in bit of a confusion regarding different comparisons in RNASeq data analysis.
Here is what i have and already done.
Dataset 1:
I have a mutant-strain and its wildtype. these were used to infact plant
. Then RNA was extracted and sequenced with Illumina Paired-end 150bp
sequencing with 3-replicates
each
Analysis 1: DEGs analysis of plant-infection-mutants(A) vs plant-infection-wildtype(B)
(A-vs-B)
- Quality check of fastq files with FastQC
- Quality trimming with FastP ( I did not perform any Quality Trimming as Data already seemed quality trimmed)
- Read Split to get reference specific reads using bbsplit from bbmap package (with this i got plant specific reads and mutant-strain specific reads)
- Reads alignment using Hisat2 version 2.2.1 to ref-genome using paired-end reads
- Quantification using featureCounts with paired-end flags
- DEG analysis using DESeq2
Dataset 2:
Again, a mutant-strain and its wildtype. these were grown in flask
. Then RNA was extracted and sequenced with Illumina Single-end 75bp
sequencing with 4-replicates
each. (this is an old data)
Analysis 2: DEGs analysis of flask-grown-mutants(C) vs flask-grown-wildtype(D)
(C-vs-D)
- Quality check of fastq files with FastQC
- Quality trimming with FastP ( I did not perform any Quality Trimming as Data already seemed quality trimmed)
- Reads alignment using Hisat2 version 2.2.1 to ref-genome using paired-end reads
- Quantification using featureCounts with paired-end flags
- DEG analysis using DESeq2
Now their are two more analysis which i want to do.
Analysis 3: DEGs analysis of plant-infection-mutants(A) vs flask-grown-mutants(C)
(A-vs-C)
For Analysis-3 i did try this approach so far,
- extracted the plant-infection-mutants(A)
featurecounts
data from the Analysis-1 featurecount matrix which is based on illumina paired-end 150bp sequencing (first 6 columnsGeneID Chr Start End Strand Length
+ 3 columns which contain mutant expression values in plants) - extracted the flask-grown-mutants(C) feature counts data from the Analysis-2 featurecount matrix which is based on illumina single-end 75bp sequencing (first 6 columns
GeneID Chr Start End Strand Length
+ 4 columns which contain mutant expression values in flask)- Merged both based on the GeneIDs, (Note: A has 3 replicates, C has 4 replicates)
- DEG analysis using DESeq2 using the same commands as in A-vs-B and C-vs-D
Analysis 4: DEGs analysis of plant-infection-wildtype(B) vs flask-grown-wildtype(D)
(C-vs-D)
For this i went with similar approach to Analysis-3
Questions
- For analysis-1 and analysis-2 is my approach correct ? As far as i know, DESeq2 itself performs Median Ratio Normalization (MRN) so i didnot perfrom any other normalization.
- I am confused about the analysis-3 and analysis-4. are they correct? or the Different sequence-type (paired vs single), difference in read-length (150bp-x2 vs 75bp-x1) will have any technical or batch effect ?
- if analysis 3 and 4 are not correct, what should i do ? do i need to normalize them? by what method ?
- Any other thoughts or points you have to raise.
Your thoughts and suggestions will be really helpful.
Regards
For analyses 3 and 4, you will need to account for batch effects, which I expect to be substantial given the differences between datasets—e.g., read length, read type (paired-end vs. single-end), sequencing machines, and sample preparation by different technicians...etc.
Hi! Thank you for your response. Can you share some information on how to remove this batch effect ?
See the post.
You cannot remove batch effects where they overlap perfectly with a biological condition of interest.