Hello Biostars,
I have been learning RNAseq data analysis for a month, and I referred the answers in this website a l lot, so I really appreciate all the contributors! Now I have some questions that I can't really find the answers.
Here is how I process my data. I started with fastq files (PE, 2*150bp) with adaptor sequence, so I used Trimmamotic pair end mode to trim my files:
./trimmomatic-0.39.jar PE 1_S1_L001_R1_001.fastq.gz 1_S1_L001_R2_001.fastq.gz 1-l1-r1-p.fq.gz 1-l1-r1-up.fq.gz 1-l1-r2-p.fq.gz 1-l1-r2-up.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36.
I got forward and reverse paired and unpaired files. Then I fed the paired trimmed files kallisto to do pseudoalignment. The index was built using Mus_musculus.GRCm38.cdna.all.fa (Ensembl) and then quantified. My run_info of kallisto showed:
{
"n_targets": 119414,
"n_bootstraps": 100,
"n_processed": 11953151,
"n_pseudoaligned": 9887661,
"n_unique": 3751532,
"p_pseudoaligned": 82.7,
"p_unique": 31.4,
"kallisto_version": "0.46.1",
"index_version": 10,
"start_time": "Tue Sep 22 12:53:24 2020",
"call": "./kallisto quant -i ./Mus_musculus.GRCm38.cdna.all.index -o ./16-Result -b 100 16-r1-p.fq.gz 16-r2-p.fq.gz"
}
p_pseudoaligned can be 80-90% or more, but p_unique was usually around 30-40%. Is this normal rate of unique alignment? If not, what can be the issue causing this? How I can improve the situation? And will this affect downstream DEG analysis? Do I need to align unpaired files too?
Next, I imported abundance.h5 files to DEseq2 to do downstream analysis. I have mice with 2 genotypes (mut and wt), 2 ages (a and b) and 2 treatments in mut mice (lv, sham), 3 treatments in wt mice (lv, ld, sham). 5 biological replicates per group. Basically we want to compare groups as followed:
mu_a_lv vs mu_a_sham
mu_a_lv vs wt_a_lv
wt_a_ld vs wt_a_lv
wt_a_lv vs wt_a_sham
wt_a_sham vs mut_a_sham
Same as age b. Also will do comprisons across age when controlling the genotype and treatment.
So I used paste function to combine genotype, age and treatment into a new column, since we actually just compare 1 variable between 2 groups. I constructed all 70 samples in one object using design ~0 + condition, so that I can pick any 2 groups in use constrast function. But few genes had padj < 0.1, the number of DEGs can be less than 20. I also tried to subset 2 groups of samples into a new object and compare. It turned out to be slightly (very slightly) better, maybe 30 or so DEGs. I think I must do something wrong during the process. I hope I interpreted my questions clear enough, please forgive any vagueness. Could somebody give me some advice? Any help would be appreciared!
Thank you for you time and patience!
shiqi
I can't address the DESeq2 analysis, but the kallisto pseudoalignment looks great to me. p_unique looks fine; it's normal that only a small fraction of fragments can be pseudoaligned to a unique transcript (in contrast to genome aligners where you'd usually expect a high fraction of reads uniquely mapped to a genomic region).
Thank you so much! Good to know p_unique is fine! So these multiple mapping reads wont affect downstream analysis?
Correct -- in fact, it is not surprising to have multiple transcripts that a read is 'compatible' with. The whole idea of kallisto is based on having "transcript-compatibility read counts" and then figuring out how to probabilistically estimate "how many reads came from transcript X" from there.
Really appreciate your answer! ☺️☺️☺️