Question

DESeq2 rlog-transformed RNA-seq data for machine learning input

2

Entering edit mode

3.9 years ago

Mariah.Hoffman ▴ 20

I am interested in using rlog-transformed RNA-seq data for machine learning applications. One concern that I have is that rlog does not account for gene length.

Would it be unadvisable to scale rlog-transformed data by log-gene length?

A couple notes on my approach:

1) I am planning on using a frozen rlog transformation for my validation/test sets 2) Because of my deep learning architecture, I would prefer to account for gene length in preprocessing rather than downstream

Thank you for your time!

rlog deeplearning DESeq2 machinelearning • 2.2k views

ADD COMMENT • link updated 3.9 years ago by Kevin Blighe 89k • written 3.9 years ago by Mariah.Hoffman ▴ 20

1

Entering edit mode

If you happen to use salmon + tximport for your processing the resulting aggregate gene scores will be corrected for transcript/gene size.

ADD REPLY • link 3.9 years ago by rpolicastro 13k

0

Entering edit mode

Thank you for the suggestion! I am using data that has already been summarized from transcript to the gene level by summation (specifically from the ArchS4 project), and I would rather not re-align the samples if possible. To your point, though, what exactly I would use as a 'gene length' in this context is another troubling issue.

ADD REPLY • link 3.9 years ago by Mariah.Hoffman ▴ 20

0

Entering edit mode

what exactly I would use as a 'gene length' in this context is another troubling issue.

I suppose the length in base-pair of the cDNA of the gene / transcript in question

It seems that you want to have data that is adjusted for both library size (across samples) and gene length (within sample), right?

ADD REPLY • link 3.9 years ago by Kevin Blighe 89k

0

Entering edit mode

Thank you for your response! With regard to your points:

I suppose the length in base-pair of the cDNA of the gene / transcript in question

I do agree that using the gene length is probably the most straightforward since I do not know how much each transcript is contributing to the total count for the gene, nor do I have any idea of an effective length (per transcript, per experiment).

It seems that you want to have data that is adjusted for both library size (across samples) and gene length (within sample), right?

I do want to adjust for both so that my gene expression data going into my model has meaning between samples as well as some meaning within samples. Of these two, differences between samples is more critical to my application, but I am wondering if accounting for both may ultimately lead to a stronger model, particularly for a fully connected deep learning model.

ADD REPLY • link 3.9 years ago by Mariah.Hoffman ▴ 20

1

Entering edit mode

I think that the variance-stabilised expression levels from DESeq2 are as good as anything else. You could further standardise these to Z- [or some other] scores, if you wished.

This is, nevertheless, akin to what things like glmnet do to your data by default, i.e., standardise the variables to have mean 0 and standard deviation of 1. There's a page on it here: https://statisticaloddsandends.wordpress.com/2018/11/15/a-deep-dive-into-glmnet-standardize/

ADD REPLY • link 3.9 years ago by Kevin Blighe 89k