Question

Few genes had padj < 0.1

0

Entering edit mode

5.1 years ago

guangshiqi.csu • 0

Hello Biostars,

I have been learning RNAseq data analysis for a month, and I referred the answers in this website a l lot, so I really appreciate all the contributors! Now I have some questions that I can't really find the answers.

Here is how I process my data. I started with fastq files (PE, 2*150bp) with adaptor sequence, so I used Trimmamotic pair end mode to trim my files:

./trimmomatic-0.39.jar PE 1_S1_L001_R1_001.fastq.gz 1_S1_L001_R2_001.fastq.gz 1-l1-r1-p.fq.gz 1-l1-r1-up.fq.gz 1-l1-r2-p.fq.gz 1-l1-r2-up.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36.

I got forward and reverse paired and unpaired files. Then I fed the paired trimmed files kallisto to do pseudoalignment. The index was built using Mus_musculus.GRCm38.cdna.all.fa (Ensembl) and then quantified. My run_info of kallisto showed:

{
    "n_targets": 119414,
    "n_bootstraps": 100,
    "n_processed": 11953151,
    "n_pseudoaligned": 9887661,
    "n_unique": 3751532,
    "p_pseudoaligned": 82.7,
    "p_unique": 31.4,
    "kallisto_version": "0.46.1",
    "index_version": 10,
    "start_time": "Tue Sep 22 12:53:24 2020",
    "call": "./kallisto quant -i ./Mus_musculus.GRCm38.cdna.all.index -o ./16-Result -b 100 16-r1-p.fq.gz 16-r2-p.fq.gz"
}

p_pseudoaligned can be 80-90% or more, but p_unique was usually around 30-40%. Is this normal rate of unique alignment? If not, what can be the issue causing this? How I can improve the situation? And will this affect downstream DEG analysis? Do I need to align unpaired files too?

Next, I imported abundance.h5 files to DEseq2 to do downstream analysis. I have mice with 2 genotypes (mut and wt), 2 ages (a and b) and 2 treatments in mut mice (lv, sham), 3 treatments in wt mice (lv, ld, sham). 5 biological replicates per group. Basically we want to compare groups as followed:

mu_a_lv vs mu_a_sham

mu_a_lv vs wt_a_lv

wt_a_ld vs wt_a_lv

wt_a_lv vs wt_a_sham

wt_a_sham vs mut_a_sham

Same as age b. Also will do comprisons across age when controlling the genotype and treatment.

So I used paste function to combine genotype, age and treatment into a new column, since we actually just compare 1 variable between 2 groups. I constructed all 70 samples in one object using design ~0 + condition, so that I can pick any 2 groups in use constrast function. But few genes had padj < 0.1, the number of DEGs can be less than 20. I also tried to subset 2 groups of samples into a new object and compare. It turned out to be slightly (very slightly) better, maybe 30 or so DEGs. I think I must do something wrong during the process. I hope I interpreted my questions clear enough, please forgive any vagueness. Could somebody give me some advice? Any help would be appreciared!

Thank you for you time and patience!

shiqi

RNA-Seq kallisto deseq2 padj unique alignment rate • 1.8k views

ADD COMMENT • link updated 5.1 years ago by swbarnes2 15k • written 5.1 years ago by guangshiqi.csu • 0

0

Entering edit mode

I can't address the DESeq2 analysis, but the kallisto pseudoalignment looks great to me. p_unique looks fine; it's normal that only a small fraction of fragments can be pseudoaligned to a unique transcript (in contrast to genome aligners where you'd usually expect a high fraction of reads uniquely mapped to a genomic region).

ADD REPLY • link 5.1 years ago by dsull ★ 7.8k

0

Entering edit mode

Thank you so much! Good to know p_unique is fine! So these multiple mapping reads wont affect downstream analysis?

ADD REPLY • link 5.1 years ago by guangshiqi.csu • 0

0

Entering edit mode

Correct -- in fact, it is not surprising to have multiple transcripts that a read is 'compatible' with. The whole idea of kallisto is based on having "transcript-compatibility read counts" and then figuring out how to probabilistically estimate "how many reads came from transcript X" from there.

ADD REPLY • link 5.1 years ago by dsull ★ 7.8k

0

Entering edit mode

Really appreciate your answer! ☺️☺️☺️

ADD REPLY • link 5.1 years ago by guangshiqi.csu • 0

score 0 · Answer 1 · 2020-10-18

0

Entering edit mode

5.1 years ago

swbarnes2 15k

If your conditions don't differ much, there might not be many changing genes.

It would help if you put up results of PCA, that would give you an idea if there are strong differences between your conditions.

I've never seen a DESeq tutorial or vignette that had the design <-~ 0 + something format, usually it would just be design <- ~ condition, so possibly that's a problem. Or it might not make a difference.

ADD COMMENT • link 5.1 years ago by swbarnes2 15k

0

Entering edit mode

I saw ~ 0 + something in EdgeR vignette, but I changed it back to ~condition. The situation was not better. But I did plot a PCA last night before I went to sleep. They were so bad at clustering, and had some overlaps. So I guess this is the reason! This is the PCA plot.

When this happens, how can I optimize by data? Can I drop some samples? Thank you very much!

ADD REPLY • link 5.1 years ago by guangshiqi.csu • 0