Question: RNASeq - low correlation between samples
gravatar for pentium3-user
3 months ago by
pentium3-user20 wrote:

Hi @ all,
I'm comparing MicroArray based expression data with RNASeq expression data. For this I have data of two fungi strains, one named B36 the other named N402, each with 3 replicates.

I observed quite low correlation between the the RNASeq samples. In the plot you can see the Pearson correlation of the normalized read counts and MA intensities (not log2 transformed). correlation microarray vs rnaseq

The correlation between the MicroArray samples is quite high but between RNASeq samples it is quite low. The N402 RNASeq sample shows an even higher correlation to both MicroArray samples than to B36 RNASeq sample.

I used the same mapping and counting procedure for both samples (STAR mapper + R package bamsignals). I observed a similar number of mapped reads and a similar sum of reads counts in both samples. The correlation between the RNASeq B36 and N402 replicates is near to 1. The fastqc report is quite good for all samples.

Does anyone have an explanation for this? Is my RNASeq B36 data just rubbish? Are there any other quality control steps I can do?
Or do you think the data is ok and RNASeq is much more sensitive to expression changes?

Would be very nice if someone could help :)

rna-seq correlation R • 378 views
ADD COMMENTlink modified 12 weeks ago • written 3 months ago by pentium3-user20

Could you post the code? If you have three replicates for each treatment, why do you have just one cell? Were the RNA samples sequenced to an appropriate depth? What is the mapping rate (look at STAR logs)?

Or do you think the data is ok and RNASeq is much more sensitive to expression changes?

We don't have enough information to know if the RNAseq data is fine, but RNAseq has greater dynamic range than microarrays, provided samples are sequenced to an adequate depth.

By the way, you know STAR can map and quantify simultaneously?

ADD REPLYlink modified 3 months ago • written 3 months ago by h.mon22k

Thanks for your reply. I just took the mean of the replicates. The correlation between the replicates is near to 1.

The sequencing was processed by the same company with the same conditions for all samples.

% of mapped reads is 86-89%.

This is the code for read counting

library (biomaRt)

## load transcript information
database <- useMart('fungi_mart', host='', dataset='aniger_eg_gene')
genes <- getBM(attributes=c("ensembl_gene_id","strand","chromosome_name", "transcript_start", "transcript_end", "transcript_length"), mart = database)
genes$strand <- ifelse (all_genes$strand == 1, "+", "-") # map strand info from 1/-1 to +/-
genes_gr <- makeGRangesFromDataFrame (genes, start.field="transcript_start", end.field="transcript_end", keep.extra.columns=T)

## count reads falling into gene regions
counts <- list()
for (i in list.dirs (".", recursive=FALSE)) # each sample has its own folder
  # count reads and add pseudo count of +1
  counts[[gsub ("./", "", i)]] <- bamCount ( paste (i, "Aligned.out.sorted.bam", sep="/"), genes_gr ) + 1

count_table <- data.frame ( matrix (unlist(counts), nrow=length(counts[[1]]), byrow=FALSE))
colnames (count_table) <- names (counts)
rownames (count_table) <- genes_gr$ensembl_gene_id

And this I use for MicroArray anlysis

library (affy)
library (affycoretools)
library (dsmmanigeraancollcdf)

## read cel files
prefix = "101053-0"
cels <- ReadAffy ( filenames = paste ( paste (prefix, c(4,5,6,1,2,3), sep=""), ".CEL", sep="" ),
                   sampleNames = c("B36_rep_1", "B36_rep_2", "B36_rep_3", "N402_rep_1", "N402_rep_2", "N402_rep_3") )

results_ma <- expresso (cels, bgcorrect.method = "rma", normalize.method = "quantiles", pmcorrect.method = "pmonly", summary.method="medianpolish")
ADD REPLYlink modified 3 months ago • written 3 months ago by pentium3-user20

I don't see a major issue here... one certainly cannot make a conclusion that a sample's data is 'rubbish' based just on a single correlation.

You have a higher correlation with the microarray data because it is log base 2 transformed. If you went ahead to produce regularised log transformations of your RNA-seq data, then I am confident that you would observe a similar high correlation. In fact, your RNA-seq correlation coefficients are in line with what I have seen in the past when correlating such data on the negative binomial distribution, a scale on which raw and normalised RNA-seq counts are measured.

ADD REPLYlink written 3 months ago by Kevin Blighe35k

Hi, thanks for your reply. In the correlation plot I posted neither Microarray (MA) nor RNAseq data are log scaled (used 2^exprs (results_ma)).
I also made a version with log scaled values. In this one RNASeq gets correlation of ~0.8 with RNASeq samples and ~0.6 to the MA samples.

What let me struggle is the fact that 1) the correlation within MA samples is that different than within RNASeq samples and 2) that RNASeq 402 has much better correlation to both MA samples than to RNASeq B36.

I would expect that cor (RNASeq B36, RNASeq N402) and cor (MA B36, MA N402) are similar OR cor (RNASeq B36, MA B36) and cor (RNASeq N402, MA N402) are similar.

ADD REPLYlink modified 3 months ago • written 3 months ago by pentium3-user20

I see. What is the origin of these samples?

ADD REPLYlink written 3 months ago by Kevin Blighe35k

N402 is a wild type of Aspergillus Niger.
B36 a Glycoamylase-overproducer strain of Aspergillus Niger.

For each strain 3 cultures were grown and for each culture a sample was taken for MA and one for RNASeq analysis.

RNASeq data is strand-specific Paired End. NEBNext Ultra Directional RNA Library Prep Kit for Illumina (NEB #7420S/L) and Illumina HiSeq 2500 were used for sequencing.

For MA analysis this platform was used: DSM Aspergillus niger CBS513.88 by Affymetrix

Don't hesitate to ask if you need any further information.

ADD REPLYlink written 3 months ago by pentium3-user20

Thank you. Have you not done anything else in terms of QC? Are you literally paused here because you feel that there is an issue with the quality of the samples? You should at least proceed with your further analyses and then decide later if there may have been some problem.

ADD REPLYlink written 3 months ago by Kevin Blighe35k

Do you have any suggestions for any other quality control steps I can do?

I'm writing my bachelor thesis with the goal to compare MA and RnaSeq. I want to find out if there are systematical biases in one/both methods, if there are classes of genes which one method suggests to be higher/lower expressed, etc.
But I could not find any of the biases described in literature (GC content, transcript length, high amount of SNPs, ...) in my data. They seem to be completely random.
Even the differential expression analysis shows completely different results for both methods.

So I went back and looked for any issues in the data.

ADD REPLYlink modified 3 months ago • written 3 months ago by pentium3-user20
gravatar for pentium3-user
12 weeks ago by
pentium3-user20 wrote:

I finally found the problem(s).

The main issue is the gene GlaA which is highly overexpressed in B36 (and also quite high in N402). If I take it out correlation gets as expected. I guess the MA cannot measure the gene correctly because of the saturation of the probes.

Another issue was the read counting. I incorrectly counted the reads for both strands but it turned out the data is (reverse) stranded. So counting only the reads for the right strand also improved the correlation.

Here you can see the correlation plot without GlaA and with correct read counts enter image description here

Here you can see that GlaA very high expressed in B36 enter image description here

ADD COMMENTlink modified 12 weeks ago • written 12 weeks ago by pentium3-user20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1942 users visited in the last hour