Entering edit mode
21 months ago
Mariah.Hoffman ▴ 20
I am interested in using rlog-transformed RNA-seq data for machine learning applications. One concern that I have is that rlog does not account for gene length.
Would it be unadvisable to scale rlog-transformed data by log-gene length?
A couple notes on my approach:
1) I am planning on using a frozen rlog transformation for my validation/test sets 2) Because of my deep learning architecture, I would prefer to account for gene length in preprocessing rather than downstream
Thank you for your time!
If you happen to use salmon + tximport for your processing the resulting aggregate gene scores will be corrected for transcript/gene size.
Thank you for the suggestion! I am using data that has already been summarized from transcript to the gene level by summation (specifically from the ArchS4 project), and I would rather not re-align the samples if possible. To your point, though, what exactly I would use as a 'gene length' in this context is another troubling issue.
I suppose the length in base-pair of the cDNA of the gene / transcript in question
It seems that you want to have data that is adjusted for both library size (across samples) and gene length (within sample), right?
Thank you for your response! With regard to your points:
I do agree that using the gene length is probably the most straightforward since I do not know how much each transcript is contributing to the total count for the gene, nor do I have any idea of an effective length (per transcript, per experiment).
I do want to adjust for both so that my gene expression data going into my model has meaning between samples as well as some meaning within samples. Of these two, differences between samples is more critical to my application, but I am wondering if accounting for both may ultimately lead to a stronger model, particularly for a fully connected deep learning model.
I think that the variance-stabilised expression levels from DESeq2 are as good as anything else. You could further standardise these to Z- [or some other] scores, if you wished.
This is, nevertheless, akin to what things like glmnet do to your data by default, i.e., standardise the variables to have mean 0 and standard deviation of 1. There's a page on it here: https://statisticaloddsandends.wordpress.com/2018/11/15/a-deep-dive-into-glmnet-standardize/