Question

RNASeq - low correlation between samples

1

Entering edit mode

5.5 years ago

pentium3-user ▴ 30

Hi @ all,
I'm comparing MicroArray based expression data with RNASeq expression data. For this I have data of two fungi strains, one named B36 the other named N402, each with 3 replicates.

I observed quite low correlation between the the RNASeq samples. In the plot you can see the Pearson correlation of the normalized read counts and MA intensities (not log2 transformed). correlation microarray vs rnaseq

The correlation between the MicroArray samples is quite high but between RNASeq samples it is quite low. The N402 RNASeq sample shows an even higher correlation to both MicroArray samples than to B36 RNASeq sample.

I used the same mapping and counting procedure for both samples (STAR mapper + R package bamsignals). I observed a similar number of mapped reads and a similar sum of reads counts in both samples. The correlation between the RNASeq B36 and N402 replicates is near to 1. The fastqc report is quite good for all samples.

Does anyone have an explanation for this? Is my RNASeq B36 data just rubbish? Are there any other quality control steps I can do?
Or do you think the data is ok and RNASeq is much more sensitive to expression changes?

Would be very nice if someone could help :)

RNA-Seq correlation R • 3.4k views

ADD COMMENT • link 5.5 years ago by pentium3-user ▴ 30

0

Entering edit mode

Could you post the code? If you have three replicates for each treatment, why do you have just one cell? Were the RNA samples sequenced to an appropriate depth? What is the mapping rate (look at STAR logs)?

Or do you think the data is ok and RNASeq is much more sensitive to expression changes?

We don't have enough information to know if the RNAseq data is fine, but RNAseq has greater dynamic range than microarrays, provided samples are sequenced to an adequate depth.

By the way, you know STAR can map and quantify simultaneously?

ADD REPLY • link 5.5 years ago by h.mon 35k

1

Entering edit mode

Thanks for your reply. I just took the mean of the replicates. The correlation between the replicates is near to 1.

The sequencing was processed by the same company with the same conditions for all samples.

% of mapped reads is 86-89%.

This is the code for read counting

library(GenomicRanges)
library(bamsignals)
library (biomaRt)

## load transcript information
database <- useMart('fungi_mart', host='fungi.ensembl.org', dataset='aniger_eg_gene')
genes <- getBM(attributes=c("ensembl_gene_id","strand","chromosome_name", "transcript_start", "transcript_end", "transcript_length"), mart = database)
genes$strand <- ifelse (all_genes$strand == 1, "+", "-") # map strand info from 1/-1 to +/-
genes_gr <- makeGRangesFromDataFrame (genes, start.field="transcript_start", end.field="transcript_end", keep.extra.columns=T)

## count reads falling into gene regions
counts <- list()
for (i in list.dirs (".", recursive=FALSE)) # each sample has its own folder
{
  # count reads and add pseudo count of +1
  counts[[gsub ("./", "", i)]] <- bamCount ( paste (i, "Aligned.out.sorted.bam", sep="/"), genes_gr ) + 1
}

count_table <- data.frame ( matrix (unlist(counts), nrow=length(counts[[1]]), byrow=FALSE))
colnames (count_table) <- names (counts)
rownames (count_table) <- genes_gr$ensembl_gene_id

And this I use for MicroArray anlysis

library (affy)
library (affycoretools)
library (dsmmanigeraancollcdf)

## read cel files
prefix = "101053-0"
cels <- ReadAffy ( filenames = paste ( paste (prefix, c(4,5,6,1,2,3), sep=""), ".CEL", sep="" ),
                   sampleNames = c("B36_rep_1", "B36_rep_2", "B36_rep_3", "N402_rep_1", "N402_rep_2", "N402_rep_3") )

results_ma <- expresso (cels, bgcorrect.method = "rma", normalize.method = "quantiles", pmcorrect.method = "pmonly", summary.method="medianpolish")

ADD REPLY • link 5.5 years ago by pentium3-user ▴ 30

0

Entering edit mode

I don't see a major issue here... one certainly cannot make a conclusion that a sample's data is 'rubbish' based just on a single correlation.

You have a higher correlation with the microarray data because it is log base 2 transformed. If you went ahead to produce regularised log transformations of your RNA-seq data, then I am confident that you would observe a similar high correlation. In fact, your RNA-seq correlation coefficients are in line with what I have seen in the past when correlating such data on the negative binomial distribution, a scale on which raw and normalised RNA-seq counts are measured.

ADD REPLY • link 5.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi, thanks for your reply. In the correlation plot I posted neither Microarray (MA) nor RNAseq data are log scaled (used 2^exprs (results_ma)).
I also made a version with log scaled values. In this one RNASeq gets correlation of ~0.8 with RNASeq samples and ~0.6 to the MA samples.

What let me struggle is the fact that 1) the correlation within MA samples is that different than within RNASeq samples and 2) that RNASeq 402 has much better correlation to both MA samples than to RNASeq B36.

I would expect that cor (RNASeq B36, RNASeq N402) and cor (MA B36, MA N402) are similar OR cor (RNASeq B36, MA B36) and cor (RNASeq N402, MA N402) are similar.

ADD REPLY • link 5.5 years ago by pentium3-user ▴ 30

0

Entering edit mode

I see. What is the origin of these samples?

ADD REPLY • link 5.5 years ago by Kevin Blighe 87k

0

Entering edit mode

N402 is a wild type of Aspergillus Niger.
B36 a Glycoamylase-overproducer strain of Aspergillus Niger.

For each strain 3 cultures were grown and for each culture a sample was taken for MA and one for RNASeq analysis.

RNASeq data is strand-specific Paired End. NEBNext Ultra Directional RNA Library Prep Kit for Illumina (NEB #7420S/L) and Illumina HiSeq 2500 were used for sequencing.

For MA analysis this platform was used: DSM Aspergillus niger CBS513.88 by Affymetrix

Don't hesitate to ask if you need any further information.

ADD REPLY • link 5.5 years ago by pentium3-user ▴ 30

0

Entering edit mode

Thank you. Have you not done anything else in terms of QC? Are you literally paused here because you feel that there is an issue with the quality of the samples? You should at least proceed with your further analyses and then decide later if there may have been some problem.

ADD REPLY • link 5.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Do you have any suggestions for any other quality control steps I can do?

I'm writing my bachelor thesis with the goal to compare MA and RnaSeq. I want to find out if there are systematical biases in one/both methods, if there are classes of genes which one method suggests to be higher/lower expressed, etc.
But I could not find any of the biases described in literature (GC content, transcript length, high amount of SNPs, ...) in my data. They seem to be completely random.
Even the differential expression analysis shows completely different results for both methods.

So I went back and looked for any issues in the data.

ADD REPLY • link 5.5 years ago by pentium3-user ▴ 30

score 0 · Answer 1 · 2018-10-25

I finally found the problem(s).

The main issue is the gene GlaA which is highly overexpressed in B36 (and also quite high in N402). If I take it out correlation gets as expected. I guess the MA cannot measure the gene correctly because of the saturation of the probes.

Another issue was the read counting. I incorrectly counted the reads for both strands but it turned out the data is (reverse) stranded. So counting only the reads for the right strand also improved the correlation.

Here you can see the correlation plot without GlaA and with correct read counts enter image description here

Here you can see that GlaA very high expressed in B36 enter image description here