Question

Genes with identical reads mapping values across all samples - Kallisto

0

Entering edit mode

3.0 years ago

andres.firrincieli 3.6k

Dear All,

For the first time I wanted to try kallisto for gene expression quantification of RNA-seq data (bacterial strain). I noticed that 5 genes shared the same number of reads mapping across all samples (n = 36). The same behavior was observed for other tRNA genes

Gene_1,49,49.4,71.2,80.6,62.4,61.8,52.6,68.2,105.2,118.6,113.2,117.6,98.8,90.8,133.2,102.6,97.2,100.2,115,139,103.2,84,82,59.6,104.8,112,63,67.6,112.8,95.6,87.6,68.2,81
Gene_2,49,49.4,71.2,80.6,62.4,61.8,52.6,68.2,105.2,118.6,113.2,117.6,98.8,90.8,133.2,102.6,97.2,100.2,115,139,103.2,84,82,59.6,104.8,112,63,67.6,112.8,95.6,87.6,68.2,81 
Gene_3,49,49.4,71.2,80.6,62.4,61.8,52.6,68.2,105.2,118.6,113.2,117.6,98.8,90.8,133.2,102.6,97.2,100.2,115,139,103.2,84,82,59.6,104.8,112,63,67.6,112.8,95.6,87.6,68.2,81   
Gene_4,49,49.4,71.2,80.6,62.4,61.8,52.6,68.2,105.2,118.6,113.2,117.6,98.8,90.8,133.2,102.6,97.2,100.2,115,139,103.2,84,82,59.6,104.8,112,63,67.6,112.8,95.6,87.6,68.2,81  
Gene_5,49,49.4,71.2,80.6,62.4,61.8,52.6,68.2,105.2,118.6,113.2,117.6,98.8,90.8,133.2,102.6,97.2,100.2,115,139,103.2,84,82,59.6,104.8,112,63,67.6,112.8,95.6,87.6,68.2,81

This is how the expression matrix was exported:

files <- file.path(base_dir, "kallisto", samples$sample, "abundance.h5")
names(files) <- paste0("sample", 1:36)
txi.kallisto <- tximport(files, type = "kallisto", txOut = TRUE)
write.table(txi.kallisto$counts, file = "countData")

Each gene encode for a tRNA-Glu, they are found on different chromosomal location (I have the full genome sequence), and all share the same sequence.

Since I am building a co-expression network, what should I do?

Thank you for your time!

Andrea

gene Kallisto expression • 715 views

ADD COMMENT • link updated 3.0 years ago by Michael 54k • written 3.0 years ago by andres.firrincieli 3.6k

score 2 · Accepted Answer · 2021-04-20

If the transcripts are of fully identical sequence there is no information that could help to distinguish them for Kallisto, so in a sense the result is correct as you are presenting it. This might be a marginal case, however, so it might be ok to do a de-duplication of the transcriptome before running Kallisto, or simply collapse identical sequences before network construction to avoid these highly correlated nodes. On the other hand, this might only affect a handful of genes and tRNAs are maybe not the most exiting genes either (excuse me if I offended you as a tRNA researcher). So I guess it could be simply ignored. If you get a module with a lot of highly correlated tRNAs you would know where that comes from.