Question

RNA-Seq analysis across different species

1

Entering edit mode

10.2 years ago

Nicolas Rosewick 11k

Hi,

I have to analyze RNA-Seq data from multiple species (human, bovine,...) and different cell types, to detect differentially expressed genes (DEGs) and to select common DEGs between species. I've two ideas in mind :

1. align each species to its related genome, and count the number of reads per gene using its related annotation (e.g. ENSEMBL). Then use DESeq2 to assess differential expression.

2. Or align each sample against a common transcriptome (typically the human transcriptome in order to use post-hoc analysis such as GO enrichment analysis)

What do you think ? advices ?

Thanks

RNA-Seq species • 7.3k views

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 10.2 years ago by Nicolas Rosewick 11k

0

Entering edit mode

Not sure your second strategy would work. I recently did some simulations to see how many RNA-seq reads from fly would map to the mouse mm10 reference. Answer: less than 1% of fly reads will map to mouse.

ADD REPLY • link 10.2 years ago by Ryan Dale 5.0k

0

Entering edit mode

I'd align each species to their respective genomes and map genes between species through orthology. Without establishing orthology you can't be sure you're comparing the same genes, if you're not comparing the same genes you're not calculating relevant DE values. You could be comparing two genes with different functions.

ADD REPLY • link 10.2 years ago by pld 5.1k

0

Entering edit mode

Hi NicoBxl,

Did you find any solution for this. Im also going to do a similar type of analysis

ADD REPLY • link 9.8 years ago by ifudontmind_plzz ▴ 200

Ram · Answer 1 · 2015-04-21

This is what I would for RNA-seq

Map the reads on both genomes (human and bovine)
Take only those reads for further analysis which mapped on both genomes (biasness of insertions and deletions of sequences between the genomes are removed and moreover, you get orthologous regions from your reads)
Now count the reads over gene features (featureCount)and remove those genes which has low counts in all samples. ( you would loose lot of them)
Assign mean of counts over different transcripts to their respective gene, transform it on log scale.
Now you have rownames as your genes colnames as your samples, now merge both species data into one dataframe
Normalize them by their quantiles or surrogate variances.
Calculate relative expression of each gene across the sample ( assign the relative value to the rowmeans to each gene of each sample)
Calculate spearman's correlation between the samples, and see which of them are forming clusters.
If they are clustering expectantly then go for differentially expressed genes

hth

Ram · Answer 2 · 2015-04-21

I am not an expert on differential expression, but DESeq2 makes the assumption that the expression of a majority of genes stay comparable. This won't be always true in your case...

For the second proposition, well, I never considered you could align transcripts to the genome of a different species. I don't know if it is possible at all, and would appreciate an answer.

For comparing different cell types, you can try a relative quantification approach: get a few genes with similar expression across all your species and cell types and use them to normalize the others. I am thinking about RT-qPCR here, which is quite precise but only works on one or two dozen target genes at most per run.....

You may also consider restricting yourself to a few GO terms before analyzing the expression levels, if you have some expectations.

score 0 · Answer 3 · 2015-04-21

If possible, do a comparison of the cell types within each species, and then do a meta-analysis of those DE lists across species. That way you're not directly comparing samples from different species.

Alternately you can try to get a sense of at least what genes are expressed at all by using something like the "UPC" functionality of SCAN.UPC (Bioconductor package) which operates on each sample individually and provides a value on a 0-1 scale indicating confidence in a given transcript having expression or not- you can threshold those values to get something like a present/absent call. I've also used the values to get a sense of what genes with known homology in human & mouse are expressed in certain cell types/tissues.