Question: Using Gencode transcript annotation in Kallisto / Sleuth pipeline
gravatar for Jordan
7 months ago by
Jordan0 wrote:

Hey there,

I am using the Kallisto / Sleuth pipeline for differential expression of a two condition (WT and KO) RNA-seq dataset (mouse) and I am having a problem with performing the gene-level differential expression in Sleuth. Part of the problem is that I am unsure how to integrate the gene_ids and transcript_ids into the Sleuth Object (I can get transcript-level differential expression to work, complete code for that not shown). Any help would be fantastic!


1) For pseudo-alignment using Kallisto (kallisto/0.45.1) I generated a transcriptome.idx file using Gencode gencode.vM23.transcripts.fa (Nucleotide sequences of all transcripts on the reference chromosomes). I used Gencode rather than ENSEMBL because the ENSEMBL cDNA FASTA file didn't contain noncoding RNAs.

2) Pseudo-alignment (Kallisto) for all of the FASTQ files works well (High TPMs were found for the target gene in WT abundance.tsv files and zero TPMs were found for the target gene in the KO abundance.tsv files)


I am having an issue associating transcripts to genes in order to do gene-level quantification in Sleuth (v0.30.0). I think there are a few factors at play:

3) Because I am using the Gencode transcriptome index, the output of my Kallisto run in the abudnance.tsv files contain a column that has gene_id, transcript_id, gene_symbol, biotype, etc all separated by a pipe (example shown below).

target_id ENSMUST00000193812.1|ENSMUSG00000102693.1|OTTMUSG00000049935.1|OTTMUST00000127109.1|4933401J01Rik-201|4933401J01Rik|1070|TEC|

Whereas the Pachter lab tutorial shows the Kallisto output with the first column (header=target_ID) having only ENSEMBL transcript IDs. (

4) When I build the Sleuth Object with the Ensembl Biomart IDs I get an error:

#load libraries

#set base directory
base_dir <- "path_to_base_directory"

#get sample ids which is just the names of kallisto run folders
sample_ids <- dir(file.path(base_dir, "path_to_folders"))

#get directories where each kallisto run is located 
kal_dir <- file.path(base_dir, "path_to_folders", sample_ids)

metadata <- read.table(file.path(base_dir,
     "meta_data/meta_data_hsc.txt"), header = TRUE, stringsAsFactors = FALSE)

#add file paths to the metadata table
metadata <- mutate(metadata, path = kal_dirs)

#check merge of metadata

##associating transcripts to genes
#to perform gene-level analysis sleuth must parse a gene annotation. use using biomaRt and Ensembl
mart <- biomaRt::useMart(biomart = "ENSEMBL_MART_ENSEMBL",
     dataset = "mmusculus_gene_ensembl",
     host = "")
ttg <- biomaRt::getBM(
     attributes = c("ensembl_transcript_id", "transcript_version",
     "ensembl_gene_id", "external_gene_name", "description",
      mart = mart)
ttg <- dplyr::rename(ttg, target_id = ensembl_transcript_id,
     ens_gene = ensembl_gene_id, ext_gene = external_gene_name)
ttg <- dplyr::select(ttg, c('target_id', 'ens_gene', 'ext_gene'))

##building a sleuth object for gene-level analysis
so <- sleuth_prep(metadata, target_mapping = ttg,
     aggregation_column = 'ens_gene', extra_bootstrap_summary = TRUE)

Notes from Pachter lab: The sleuth object contains specification of the experimental design, a map describing grouping of transcripts into genes (or other groups), and a number of user specific parameters. In the example that follows, metadata is the experimental design and target_mapping describes the transcript groupings into genes previously constructed. Furthermore, we provide an aggregation_column, the column name of in ‘target_mapping’ table that is used to aggregate the transcripts. When both ‘target_mapping’ and ‘aggregation_column’ are provided, sleuth will automatically run in gene mode, returning gene differential expression results that came from the aggregation of transcript p-values.

5) Error I get when building the Sleuth Object:

Error in check_target_mapping(tmp_names, target_mapping, !is.null(aggregation_column)) : couldn't solve nonzero intersection
3. stop("couldn't solve nonzero intersection")
2. check_target_mapping(tmp_names, target_mapping, !is.null(aggregation_column))
1. sleuth_prep(metadata, target_mapping = ttg, aggregation_column = "ens_gene", extra_bootstrap_summary = TRUE)
rna-seq gencode kallisto sleuth • 490 views
ADD COMMENTlink modified 6 months ago by brown.annaleigh10 • written 7 months ago by Jordan0
gravatar for ATpoint
7 months ago by
ATpoint38k wrote:

I see two options:

1) -- the more general one: Reformat the initial fasta file to drop every name after the first pipe so that you only have left:

(The Sequence)
(The Sequence)


2) -- the tedious one, manually "parse away" everything than the transcript name in your abundance files. Maybe this can be done directly in R, but I am not a kallisto user so I do not know how exactly these files look. I personally use salmon and they even have a --gencode flag for index generation to do what I described in 1) to avoid exactly this issue (not for Sleuth though, but in general, avoiding super-long names and reducing to only the txname).

ADD COMMENTlink written 7 months ago by ATpoint38k
gravatar for brown.annaleigh
6 months ago by
brown.annaleigh10 wrote:

Was googling around for the same question, found an answer on the Kallisto/Sleuth google users group. Their solution was to remove the extra information from the hd5 files!searchin/kallisto-and-applications/gencode%7Csort:date/kallisto-and-applications/KQ8782UD35E/hbqqMOgGBwAJ


files <- list.files(<insert working="" directory="">, pattern=".h5", recursive=TRUE, full.names=TRUE)

for (currentFile in files) { oldids <- h5read(currentFile, "/aux/ids") newids <- gsub("\|.*", "", oldids) h5write(newids, currentFile, "/aux/ids")


ADD COMMENTlink modified 6 months ago • written 6 months ago by brown.annaleigh10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 825 users visited in the last hour