I've put this off for a while and trawled biostars and bioconductor for a consensus for which there appears to be done.
The short question is I am exploring expression evolution across a variety of bird species (6-10 species) and need to normalise my read counts and use an appropriate expression metric for downstream comparisons. What are peoples recommendation?
I am getting my counts from paired end reads using Salmon, identifying orthologs from the CDS with BLAST and importing transcripts/merging to genes using tximport.
Everything I read tells me the TMM will bias based on differences in library compositions, gene lengths, gene content etc between species. Likewise, raw RPKM is also contentious. Brawand et al. (2011) use a median centring method on a subset of orthologs with conserved expression patterns. Likewise I have seen the use of zFPKM for clustering inter-species expression counts.
I realise this issue is still a bioinformatically complex problem, but thank you for any help that can be offered!
What types of analyses are you planning on doing? Salmon's estimates can be corrected for exactly the problems that you mention -- gene lengths, GC content and so on. Depending on the type of question you want to address, it may also make sense to simply operate on ranks rather than the actual counts/abundance estimates.
Hi Friederike, thanks for the response
I'm going to be applying various expression evolution models (OU/BM etc.) to study sexual dimorphism evolution between species.
Thanks for the advice on salmon but I'm concerned with how to deal with library composition differences between species which may not be resolved solely through GC, gene length adjustments.