Question

Why am I getting different results with kallisto?

0

Entering edit mode

9 months ago

bioinfo ▴ 150

Hello,

I created a reference for my species of interest using the cDNA file from ensembl and then aligned my data to it using kallisto. Then, i was asked to add the ncRNA to the reference and a egfp sequence. I concatenated the 3 files using cat and then created the reference. I aligned my samples again, used tximport to get gene counts and now I have a few questions.

When I aligned to the cDNA only I had gene 18SrRNA-Psi with 150560 counts. After aligning to the cDNA+ncRNA that gene now has 90 counts. However, in the new reference there are more genes that start with 18SrRNA. Could some of the counts be assigned to the other 18srRNA genes and that is why the new reference has less counts? I only checked the first 10 rows of my files and noticed that. The other 9 genes have the same counts.
The txi.kallisto$abundance and txi.kallisto$counts files following the alignment with the cDNA+ncRNA have an extra line at the beginning that has only 0. This was not happening with the reference made only with cDNA. What could be causing that?

Thank you

kallisto • 861 views

ADD COMMENT • link 9 months ago by bioinfo ▴ 150

score 3 · Accepted Answer · 2023-11-09

3

Entering edit mode

9 months ago

dsull ★ 6.8k

This makes sense. This is because a read may end up mapping to multiple 18SrRNAs -- thus the counts need to be divided amongst them somehow. So each individual 18SrRNA will get a smaller count. This isn't a problem -- this enables you to study different isoforms of the 18SrRNA and how they differ in expression (if you don't care about all the different 18SrRNA isoforms, you can just sum the counts up among all of them).
There shouldn't be a line with only 0 in abundance.tsv. The first line will always be target_id length eff_length est_counts tpm. Perhaps show us the first few lines of the abundance.tsv file and the R code you're writing for tximport.

ADD COMMENT • link 9 months ago by dsull ★ 6.8k

0

Entering edit mode

Thank you for the help. The abundance files seem fine. It is the txi.kallisto file that has this issue. The txi.kallisto$counts looks like below when aligning to the cdna+ncRNA: enter image description here

The abundance file looks like this:

enter image description here

When I align just to cDNA i get the below:

enter image description here

My code is below:

## For multiple samples, each named as a folder in the kallisto directory (can be abundance.h5 or abundance.tsv file)
accessions <- list.dirs(full.names=FALSE)[-1]
kallisto.dir<-paste0(accessions)
kallisto.files<-file.path(kallisto.dir,"abundance.tsv") #can also be abundance.tsv
names(kallisto.files)<- accessions
# standard ensembl biomaRt attributes
transcript_id <- "ensembl_transcript_id"
gene_id <- "ensembl_gene_id"
gene_name <- "external_gene_name"




mart <- biomaRt::useMart("ensembl", "dmelanogaster_gene_ensembl", host = "ensembl.org")
t2g2 <- biomaRt::getBM(attributes = c(transcript_id, gene_name), mart = mart)
t2g2 <- dplyr::rename(t2g2, target_id = transcript_id, ext_gene = gene_name)
t2g2[nrow(t2g2) + 1,] <- list('EGFP', 'EGFP')


head(t2g2)
tail(t2g2)
txi.kallisto <- tximport(kallisto.files, type = "kallisto", tx2gene = t2g2[,c(1,2)], ignoreTxVersion = TRUE)
head(txi.kallisto$counts)
gene_counts = round(txi.kallisto$counts)
head(gene_counts)
getwd()
gene_tpm = txi.kallisto$abundance
write.table( gene_counts, file="est_counts_genes_kallisto.txt", sep="\t", row.names=T, quote=F)
write.table( gene_tpm, file="tpm_values_genes_kallisto.txt", sep="\t", row.names=T, quote=F)

The only difference in the code between the cDNA and cDNA+ncRNA alignment is that in the latter I added t2g2[nrow(t2g2) + 1,] <- list('EGFP', 'EGFP') to the code.

ADD REPLY • link 9 months ago by bioinfo ▴ 150

0

Entering edit mode

I think you are right and the issue is with the abundance files. I tried to use the same tximport with the EGFP addition with the abundance files from just the cDNA alignment and that does not produce the line with 0. I did txi.kallisto$length and I actually get a value of 15.666069 for all the samples but when I grep that value in the abundance files I do not find anything. I also noticed that in my t2g2 file there are some target_ids without ext_genes. All of those genes seem to be in the ncRNA fasta file.Could that be causing this?

ADD REPLY • link 9 months ago by bioinfo ▴ 150

0

Entering edit mode

I found the reason and I am adding it here. It seemed to be caused by the target_ids without ext_genes. When Iadded the target_ids as ext_genes for the empty ext_genes I did not get that row on top.

ADD REPLY • link 9 months ago by bioinfo ▴ 150