Question

dUTP sequencing -- different strand-ratios for different samples

0

Entering edit mode

4.6 years ago

mschmid ▴ 180

I am analysing rna-seq data which is supposed to have strandness information via dUTP sequencing.

After having mapped the Illumina data to the ref. genome using HISAT2 (applying parameter to keep strandness info) I analysed the data. I calculated the ration between the flags for plus-strand and minus-strand. I get the following ratios:

Ratios

The groups are: C(control),H(high),L(low). The sample "H1" is off anyway, you can ignore this one.

It is weird, that the ratios are so different. They are all above one (ignoring "H1") and go up to almost two. How would you interprete this result? Did anything go wrong while performing dUTP-treatment?

The thing is also, that if I do a PCA the PC1 reflects exactly the ratios above. PC2 is then the treatment (C,H,L).

What would you do in this case? How would you test if there was a problem with dUTP? Or do you see another reason for this picture?

rna-seq dUTP HISAT2 • 961 views

ADD COMMENT • link updated 4.6 years ago by Charles Warden 8.2k • written 4.6 years ago by mschmid ▴ 180

score 1 · Answer 1 · 2019-09-17

So like you know the UTP method ensures, that you know the orientation of the read. So whether the read has the same orientation as the sequence transcript or the reversed. It doesn't give you information about whether it originates from the plus or minus strand. This is determined during your mapping. In human the amount of genes on the plus vs. minus strand is 50:50, and the transcript length is similar. So the ratio you mention should be around 1 in case that every gene is expressed equally. I don't know if this is ever the case, but it seems logical to me, that gene expression is not everywhere the same. So theoretically when promotors are active, which affect more genes on the one strand than on the other you could have higher expression of genes on a specific strand (I saw this in my data once too). Even with problems during library preperation, the effect could not be such huge, because I think only around 10% of loci overlap, if we are talking about human data.

So in case, you have not only a cell line which is treated differently, but biological replicates using different individuals, the aberations between the replicates could maybe explained by individual-specific gene expression, which can be different, due to e.g. different SNPs. However, if you meassured gene expression in a cell population which results in simiular genetic information witin your samples (like when you work with pooled primary cells which are splitted befor the treatment) than the aberations within the samples are way to high (from experience in my data around a few percent). For example I have data of pooled primary endothelial cells from human. The ratio you describe in normal cells reached from 1.00 to 1.04, however after treatment and stress, this ratio ranges from 1.19 to 1.21 within the same treatment.

So basically, what I'm trying to say is, that I think, that this ratio doesn't always have to be 1. But if you deal with a homogenous constellation of biological replicates (like always the same cell line, or pooled primary cells) the difference inbetween samples of the same group should be way lower! And even with biological different individuals I see this ratio in another set of data from 0.82 to 1.03.

score 0 · Answer 2 · 2019-09-17

If you use infer-experiment-py from RSeQC (for housekeeping genes), I would usually expect the strand percentages to be within ~3% of each other. For example, fraction the unstranded library is probably usually between 0.48 and 0.52, and the strand fraction is usually above 0.97 for a stranded library.

I think you have gotten a good response (in terms of making sure it is the gene and not genome strand you are checking), and here is the documentation for RSeQC:

http://rseqc.sourceforge.net/#infer-experiment-py