Question: DESeq2 using samples with different unique alignment rates
gravatar for Tash
21 months ago by
Tash0 wrote:

My aim is to do a differential expression analysis using paired end data with two conditions: controls and cases.

Unfortunately, I was having some issues with low unique alignment rates in all my 5 cases (approx. 40% mapping rate) due to rRNA contamination as confirmed by BBduk and BLASTing. I ran DESeq2 with 4 controls that have good alignment rates (90%) versus the 5 cases. I do end up with a fair amount of DEG's and a pathway analysis in IPA shows some promising results, but now I am unsure if I can trust these results. I did not control for anything as we only have females in the same age group. I have tried to change the design to include alignment rates but I get an error:

"Error in checkFullRank(modelMatrix) : the model matrix is not full rank, so the model cannot be fit as specified. One or more variables or interaction terms in the design formula are linear combinations of the others and must be removed".

Since I do have more than double aligned reads in my controls I am wondering if I can do some normalisation to control for this discrepancy in DESeq2 or at an earlier step in the analysis?

My pipeline is:

- Trimmomatic

java -jar /opt/apps/bioinformatics/trimmomatic/0.36/trimmomatic-0.36.jar PE ${INPUTFILE_R1} ${INPUTFILE_R2} ${OUTPUTFILE_PAIRED_R1} ${OUTPUTFILE_UNPAIRED_R1} ${OUTPUTFILE_PAIRED_R2} ${OUTPUTFILE_UNPAIRED_R2} ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:8:TRUE SLIDINGWINDOW:4:15 MINLEN:25


STAR --runThreadN 4 --genomeDir path/genome_dir --readFilesCommand zcat --readFilesIn ${INPUTFILE_R1} ${INPUTFILE_R2} --outReadsUnmapped Fastx --outSAMtype BAM Unsorted --outFileNamePrefix ${OUTPUTFILE}

- htseq

htseq-count -f bam -a 10 -s reverse --idattr=gene_name $INPUTFILE path/gencode.v28.annotation.gtf > $OUTPUTFILE

- DESeq2

dds<-DESeqDataSetFromMatrix(countData, colData, design = ~ condition) 
dds$condition <- relevel(dds$condition, ref = "control")
ddsB <- estimateDispersions(dds)
variancedata <- varianceStabilizingTransformation(ddsB)

results_RNAseq_DEGs = (results(dds,contrast=c("condition", "cases", "control")))
res <- results(dds)
res_filtered=results_RNAseq_DEGs %>% subset(padj<0.05 & abs(log2FoldChange) > 0.5)

I hope that I have provided you with sufficient information.

Thanks in advance.

rna-seq R • 630 views
ADD COMMENTlink written 21 months ago by Tash0

Does a PCA plot indicate any significant difference in the high rRNA samples?

ADD REPLYlink written 21 months ago by Devon Ryan95k

I linked to the PCA plot. There seems to be quite a difference.

ADD REPLYlink written 21 months ago by Tash0

Please color that by rRNA level rather than group.

ADD REPLYlink written 21 months ago by Devon Ryan95k

Thanks for your replies Devon.

I coloured by unique alignment rates since I haven't checked exact rRNA content using bbduk for all the "badly" aligned files, only two. Pretty sure all of the 5 have similar contamination though because when I BLAST the overrepresented sequences they all give hits for rRNA.

It makes sense that this plot looks like the previous one as all the samples who have high RNA content are cases.

ADD REPLYlink written 21 months ago by Tash0

OK, from that it's clear that rRNA contamination is not a main driver of any of the clustering. I wouldn't worry about incorporating that into your design then. If you need to do that in the future then ensuring it's a continuous variable will probably solve the error message you received.

ADD REPLYlink written 21 months ago by Devon Ryan95k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1733 users visited in the last hour