Analyses

Question

Aberrant splicing in bulk RNAseq

4

Entering edit mode

14 months ago

txema.heredia ▴ 110

Hi,

I need to analyze a project about aberrant splicing in mRNA-seq (3 cases vs 3 controls). We are interested in detecting aberrant transcription events like intron retention, exon skipping, and alternative 5'/3' splice sites.

I've been recently learning about pseudoalignment tools for transcript quantification like kallisto and salmon. However, I couldn't find anywhere information about their usefulness for non-annotated isoforms and aberrant splicing events.

I've found a couple of tools able to analyze aberrant splicing: rMATS (https://rnaseq-mats.sourceforge.net/) and FRASER (https://www.nature.com/articles/s41467-020-20573-7).

Analyses

What we would like to analyze is:

Differentially Expressed Genes
Differentially Expressed Transcripts
Differentially Expressed Exons
Aberrant splicing

Tools:

I know how to analyze DE Genes with STAR+htseq-count+DESeq2, and I am learning how to do it with kallisto+DESeq2 and kallisto+swish.

I'm learning how to do DE Transcripts with kallisto+swish.

I know hot to analyze DE Exons with STAR + htseq-count + DEXSeq. Can DEXSeq do exon analysis using kallisto's input?

Aberrant splicing: from what I read from the manuals, rMATS uses either fastq or aligned bams (STAR), while FRASER uses bams as input.

UMIs

The sequencing that is currently being generated includes UMIs for the bulk RNA-seq. I know how to process them for alignments. However, all I find about UMIs in kallisto only mentions single-cell RNA-seq, and only on single-paired reads.

Robustness

As you see, we have quite a few research questions and analyses. For the sake of robustness, I would like to use the same input files and "compatible"/"equivalent" software whenever it is possible, instead of starting from scratch on each step.

Questions

We are in a hurry because of a competing paper and I am trying to prepare the analyses ahead of time to anticipate pitfalls and minimize delays before we get the sequencing data. I have a few doubts and questions that I would appreciate if someone could address:

What would you recommend as tools to use for each of the analyses?
Do you have a better recommendation for the aberrant splicing analysis?
Is kallisto much better than STAR+htseq-count for gene-level quantification? (I suspect yes)
Can kallisto work with UMIs and paired-end reads (R1, R2, R3.fastq.gz) ? (I don't see such thing in the manual)
Will the DEG results be significantly different when run with/without UMIs?
What would be the "less bad" way to compare DE-genes with DE-transcripts? STAR (with UMIs)+htseq+DESeq2 (genes) vs kallisto(no UMIs)+swish (transcripts) or kallisto(no UMIs)+swish for both genes&transcripts ?
Can Kallisto run exon-level analysis? If not, what is "less bad", DESeq2 (genes, with UMIs) vs swisch (tx, no UMIs) vs DEXSeq (exon, with UMIs), or swish (genes+tx, noUMIs) vs DEXSeq (exon, with UMIs) ?
If I cannot use UMIs on each analysis, should I just ignore them and not use them?

Do you have any extra suggestion on how to improve and speed up this analysis?

Thanks in advance,

Txema

aberrant-splicing transcript isoform exon • 3.0k views

ADD COMMENT • link updated 11 months ago by sbt_gvs • 0 • written 14 months ago by txema.heredia ▴ 110

1

Entering edit mode

I'll be honest: Bulk RNAseq with UMIs is very uncommon and has not been benchmarked (really by any tool). There are many considerations for processing such data.

It is definitely very possible to run kallisto with UMIs on bulk data (it's not documented) but it'll require some work and a few options to consider. Happy to talk more about it. That said, I'm not sure how concordant your results will be between UMI-based quantification and non-UMI-based quantification. I tried this with smart-seq3 (which has UMI-based reads as well as non-UMI reads) and observed poor concordance (though a lot of downstream results were pretty similar, albeit less so than I wanted). It might actually be better to NOT use the UMIs because paired-end bulk RNAseq based quantification (without UMIs) is, albeit not perfect, well-established in the field. Your data actually is really interesting (paired-end bulk with UMIs) and I'd really love to take a closer look myself once you publish it.

You can also check whether UMIs really make a difference (if a ton of reads in your dataset have the same UMI sequence, then there's some amplification bias). It's (like with all of RNAseq) very dataset/library dependent.

Apologies for this sort of nonanswer but the tl;dr is that your analysis is more complicated than one might think. If anything, I'd try out a couple of different tools and a few different options for each tool and check the agreement, and also using different tools for different purposes is not a bad idea either.

ADD REPLY • link 14 months ago by dsull ★ 5.8k

0

Entering edit mode

UMIs in bulk RNA-seq are used differently than in single-cell RNA-seq. (I only learned this last week after asking here in a different post).

In bulk, UMIs are added after fragmentation but before amplification. Thus, each fragment of the same original molecule will carry a different UMI barcode. However, they work wonders to remove PCR duplicates. I have processed data from a few projects with them.

These are the sum of counts in all genes from htseq-count:

before UMI deduplication

after UMI deduplication

The blue-green-gray-black samples are (poly-A enriched) mRNA-seq with target sequencing ~20M. The red-orange-pink are total RNA-seq (rRNA depleted) target ~60M reads.

I was mistaken in my original post. For this project (aberrant splicing) they used a different sequencing facility without UMIs.

However, my point/question stands for the many other projects I am currently working with.

ADD REPLY • link 14 months ago by txema.heredia ▴ 110

1

Entering edit mode

Super interesting -- I've seen that sort of library prep for a few single-cell technologies as well. Makes sense because full-length-spanning reads cannot be 5'/3' end-tagged.

I suppose we should call these unique fragment identifiers (UFIs) instead of UMIs. Nonetheless, I think the UMI counting procedure used by kallisto and other tools should work just fine, but keeping in mind that longer genes will get more counts because they generate more fragments (at least amplification bias will be removed though).

Keep me posted on this, but kallisto can definitely handle this type of data (bustools can do the UMI collapsing and kallisto can leverage the final UMI count matrix to do its traditional quantification with effective length normalization) and this data would be super interesting to look at.

Feel free to reach out to me for this type of data -- I am one of the main contributors to the kallisto project (twitter handle in bio) and we can certainly figure out an optimal workflow to easily handle this type of data. :)

ADD REPLY • link 14 months ago by dsull ★ 5.8k

0

Entering edit mode

Hi,

I'm finally dealing with this type of data: UMI (or Unique Fragment Identifier) bulk RNA. I am using kallisto-bus, but I am having trouble to make it run.

My input samples come split into 1-5 lanes/libraries/fastq files. Each of these blobs comes in 3 fastq.gz files: R1 (first read of pair), R2(UMI), R3 (second read of pair). I used zcat + umitools to collapse all the libraries of my samples into a single "fastq" per sample, and using UMItools to add the UMI to the read header. This way I end up having only a pair of R1 / R2 fastq.gz files per sample.

I tried running kb with:

kb count -i ${refdir}/transcriptome.idx \
     -g ${refdir}/transcripts_to_genes.txt \
     -o ${outdir} \
     -t 8 -m 32G -x BULK --h5ad --report --parity paired  \
     ${indir}/${sample}_R1.merged.withUMIs.fastq.gz \
     ${indir}/${sample}_R2.merged.withUMIs.fastq.gz

However, it gives this error message:

[2023-03-29 11:21:21,579] WARNING [main] FASTQs were provided for technology `BULK`. Assuming multiplexed samples. For demultiplexed samples, provide a batch textfile.
[...]
[2023-03-29 11:21:24,079]    INFO [count]        ....../merged_fastq_withumi/001_R1.merged.withUMIs.fastq.gz
[2023-03-29 11:21:24,079]    INFO [count]        ....../merged_fastq_withumi/001_R2.merged.withUMIs.fastq.gz
[2023-03-29 11:21:25,182]   ERROR [count] 
[bus] Note: Strand option was not specified; setting it to --unstranded for specified technology
Error: Number of files (2) does not match number of input files required by technology BULK (4)

I cannot find anywhere in the documentation of kallisto what are those 4 files supposed to be. Should I be using 2 separate files for the reads and the UMIs? In which order should they be?

Thanks in advance

ADD REPLY • link 13 months ago by txema.heredia ▴ 110

1

Entering edit mode

First thing: kallisto cannot interface with UMI-tools (kallisto doesn't use read headers and therefore you're stuck with "the kallisto way" of handling UMIs). I'm a bit busy with other developments to kallisto right now, but I might consider adding UMI-tools interfacing with kallisto, especially if it's a desired feature. For now, if you really want to use UMI-tools, you should take those read header sequences and put them back into a separate FASTQ file. If you have 3 FASTQ files, that's fine -- kallisto can handle that.

OK, now comes the tricky part if you want to give it a try (I just tried it and it seems to work on a minimal example; I'll document it once I streamline the workflow a bit):

Install kb_python (the "D" version)

pip install git+https://github.com/pachterlab/kb_python@D

Install kallisto (the "D" version) from source:

git clone https://github.com/pachterlab/kallisto-D
cd kallisto-D && mkdir build && cd build && cmake .. && make && cd ../../

Recreate the index using the "D" version of kb-python and of kallisto (you need to supply genome.fasta and genome.gtf):

kb ref --kallisto=kallisto-D/build/src/kallisto -i index.idx -g t2g.txt -f1 cdna.fa genome.fasta genome.gtf

Now, run the quantification (where R1 = first read in pair, R2 = second read in pair, R3 = 8-bp UMI)

kb count --tcc --verbose --overwrite --kallisto=kallisto-D/build/src/kallisto -x " -1,0,0:2,0,8:0,0,0,1,0,0" --tcc --parity=paired -i index.idx -o output_dir/ -g ../../indices/Mus_musculus_aj.t2g R1.fq R2.fq R3.fq

Your results will be in output_dir/counts_unfiltered/. Now let's use the EM algorithm to get estimated counts:

kallisto-D/build/src/kallisto quant-tcc -o final_output_dir/ -g t2g.txt -i output_dir/index.saved -f output_dir/flens.txt -e output_dir/counts_unfiltered/cells_x_tcc.ec.txt output_dir/counts_unfiltered/cells_x_tcc.mtx

Your output of interest will be the file: final_output_dir/matrix.abundance.tpm.mtx (this contains your estimated TPMs as determined by the probablistic EM algorithm). The second column (starting from the third line) of the mtx file contains your transcript numbers with their associated counts (shown in the third column). The transcript numbers correspond to the line numbers (one-indexed) in final_output_dir/transcripts.txt.

You can also get gene-level abundances the same way (look at final_output_dir/matrix.abundance.gene.tpm.mtx) and then look at the gene line numbers in final_output_dir/genes.txt.

Yikes! I will try to streamline this workflow soon! Still a work in progress but I think the output will be consistent -- just need to make it user-friendly. Happy to explain any of the commands that were run or any of the files that were generated. If you notice anything funky, message me.

If it ends up working and you want to describe the methods in the manuscript, just write out the code above and say: "we used kb_python commit number: 73c62d77e894a39f2fd99370878ce895984186b8, and kallisto (version 0.49.0) commit 51a22e92e9c81203c53a596a8e3f95ae85e77865 and bustools (version 0.42.0)" so others can reproduce the results.

ADD REPLY • link 13 months ago by dsull ★ 5.8k

0

Entering edit mode

Thanks!

I finally went around running all this. It worked smoothly.

Could you guide me how to build "traditional" abundance.tsv files parting from "matrix.abundance.tpm.mtx" files so I can load the data with tximport? Is there a built-in method or should I manually build them from the output files?

Thanks!

Edit: is TPM even a valid metric after UMI deduplication? Why should I normalize by total number of reads when the PCR-amplification-bias "should" be gone? Is this just to compare between samples when merging them and then I should skip the normalization steps in DESeq2? Also, is the non-tpm file length-corrected? Why don't they give integer counts?

ADD REPLY • link 12 months ago by txema.heredia ▴ 110

1

Entering edit mode

Glad to hear!

Hmm, you need to manually build abundance.tsv files -- I don't have code written for that yet. Alternatively, you can just load things into a data frame and use tximport (type="none") by manually specifying the columns and data -- that's probably the easiest route to take. You have the estimated counts matrix, you have the TPM matrix, you have the transcript+gene names, and you have the transcript lengths (by looking at the cdna.fa file and counting the number of bp's for each transcript).

ADD REPLY • link 12 months ago by dsull ★ 5.8k

0

Entering edit mode

So... the "matrix.abundance.mtx" file is equivalent to the "est_counts" column in the abundance.tsv file, no?

In the very end, I am interested in comparing the gene counts obtained with this method vs using a log1p transformation on the (DESeq2) normalized counts (which originated from htseq-count).

Can kallisto's estimated counts be (more or less) directly compared with that other measurement? Or are both kallisto's estimated counts and TPM corrected by transcript lenght? (or effective tx length).

Thanks!

ADD REPLY • link 12 months ago by txema.heredia ▴ 110

1

Entering edit mode

Correct, the matrix.abundance.mtx is the estimated counts in abundance.tsv.

kallisto's TPM is corrected by effective transcript length.

kallisto's estimated counts are not length-corrected (but length correction doesn't matter for DESeq2).

tximport is still the best method for gene-level analysis so if you can feed the counts into tximport, that would be best (at least for gene-level analysis).

I can try introducing a new commit into kallisto that recapitulates the abundance.tsv structure from the mtx files sometime this week or next week to make it more usable with tools like tximport/swish/sleuth/etc..

ADD REPLY • link 12 months ago by dsull ★ 5.8k

1

Entering edit mode

Hi! In the latest commit of kallisto-D, I added an options to quant-tcc:

Add --matrix-to-files if you want everything outputted into the abundance TSV format expected by other programs.

Note: I haven't tested this yet (taking several days off right now) -- so if you give this a try and if you encounter any bugs, let me know and I'll fix it as soon as I get back.

ADD REPLY • link 11 months ago by dsull ★ 5.8k

0

Entering edit mode

Thanks! I went back to this project last Friday and, by pure coincidence, I saw you added it on a commit just 10h before I checked that.

It works wonders, thanks!

ADD REPLY • link 11 months ago by txema.heredia ▴ 110

0

Entering edit mode

Hi, sorry to bother you again.

I see now that this last version is not generating bootstrapping files when using "-b 30", neither in .h5 nor in plaintext. Even when the output messages say that it will do it:

[tcc] Matrix dimensions: 1 x 307,408
[tcc] Bootstrapping will be performed and outputted as HDF5
[quant] Running EM algorithm...[quant] Processing sample/cell 0

Is this a bug or is this feature not implemented yet?

ADD REPLY • link 11 months ago by txema.heredia ▴ 110

0

Entering edit mode

Hello again!

In order for it to be outputted as plaintext, supply the --plaintext option when running quant-tcc.

In order for it to be outputted as HDF5, you need to recompile kallisto with HDF5 support: -DUSE_HDF5=ON

i.e.

git clone https://github.com/pachterlab/kallisto-D
cd kallisto-D && mkdir build && cd build && cmake .. -DUSE_HDF5=ON && make && cd ../../

With the --plaintext option, you get abundance TSV files and a bunch of bs_abundance TSV files (for the bootstrap).

Without the --plaintext option, you still get an abundance TSV file but the HDF5 file will only be generated if kallisto was compiled with -DUSE_HDF5=ON. Note that all bootstraps are incorporated into a single HDF5 file.

Nonetheless, I've updated the Github just now to print out more clear messages.

ADD REPLY • link 11 months ago by dsull ★ 5.8k

0

Entering edit mode

Hi, thanks for your swift response.

Unfortunately I am still having trouble with this. I downloaded and compiled the latest version with HDF5 support. However, I am getting the exact same output file structure whenever I try to use --plaintext or not:

Command:

## run for HDF5
quantdir2=${od}/kallisto_quant_new.new_hdf5Compile_b30
mkdir -p ${quantdir2}
${kallistodir}/kallisto-D-hdf5.new/build/src/kallisto quant-tcc -b 30 -o ${quantdir2} -g ${refdir}/transcripts_to_genes.txt -i ${countdir}/index.saved -f ${countdir}/flens.txt -e ${countdir}/counts_unfiltered/cells_x_tcc.ec.txt ${countdir}/counts_unfiltered/cells_x_tcc.mtx

## run for plaintext
quantdir2=${od}/kallisto_quant_new.new_hdf5Compile_plaintext_b30
mkdir -p ${quantdir2}
${kallistodir}/kallisto-D-hdf5.new/build/src/kallisto quant-tcc --plaintext -b 30 -o ${quantdir2} -g ${refdir}/transcripts_to_genes.txt -i ${countdir}/index.saved -f ${countdir}/flens.txt -e ${countdir}/counts_unfiltered/cells_x_tcc.ec.txt ${countdir}/counts_unfiltered/cells_x_tcc.mtx

Output messages:

## run for HDF5
[index] k-mer length: 31
[index] number of targets: 251,121
[index] number of k-mers: 0
[index] number of distinguishing flanking k-mers: 984,861
[index] number of equivalence classes loaded from file: 307,408
[index] not using the distinguishing flanking k-mers
[tcc] Parsing transcript-compatibility counts (TCC) file as a matrix file
[tcc] Matrix dimensions: 1 x 307,408
[tcc] Bootstrapping will be performed and outputted as HDF5
[quant] Running EM algorithm...[quant] Processing sample/cell 0
 done

## run for plaintext
[index] k-mer length: 31
[index] number of targets: 251,121
[index] number of k-mers: 0
[index] number of distinguishing flanking k-mers: 984,861
[index] number of equivalence classes loaded from file: 307,408
[index] not using the distinguishing flanking k-mers
[tcc] Parsing transcript-compatibility counts (TCC) file as a matrix file
[tcc] Matrix dimensions: 1 x 307,408
[tcc] Bootstrapping will be performed and outputted as plaintext
[quant] Running EM algorithm...[quant] Processing sample/cell 0
 done

Output directory:

## run for HDF5
txema@mm:~/kallisto_umi$ ll merged_fastq/001/kallisto_quant_new.new_hdf5Compile_b30/
total 15248
drwxrwxr-x  2 txema txema    4096 mei 11 12:19 ./
drwxrwxr-x 13 txema txema    4096 mei 11 12:19 ../
-rw-rw-r--  1 txema txema 1127835 mei 11 12:19 genes.txt
-rw-rw-r--  1 txema txema  236243 mei 11 12:19 matrix.abundance.gene.mtx
-rw-rw-r--  1 txema txema  308111 mei 11 12:19 matrix.abundance.gene.tpm.mtx
-rw-rw-r--  1 txema txema 1207208 mei 11 12:19 matrix.abundance.mtx
-rw-rw-r--  1 txema txema 1281123 mei 11 12:19 matrix.abundance.tpm.mtx
-rw-rw-r--  1 txema txema 1253012 mei 11 12:19 matrix.efflens.mtx
-rw-rw-r--  1 txema txema      18 mei 11 12:19 matrix.fld.tsv
-rw-rw-r--  1 txema txema 5651896 mei 11 12:19 transcript_lengths.txt
-rw-rw-r--  1 txema txema 4525225 mei 11 12:19 transcripts.txt

## run for plaintext
txema@mm:~/kallisto_umi$ ll merged_fastq/001/kallisto_quant_new.new_hdf5Compile_plaintext_b30/
total 15248
drwxrwxr-x  2 txema txema    4096 mei 11 12:20 ./
drwxrwxr-x 13 txema txema    4096 mei 11 12:19 ../
-rw-rw-r--  1 txema txema 1127835 mei 11 12:20 genes.txt
-rw-rw-r--  1 txema txema  236243 mei 11 12:20 matrix.abundance.gene.mtx
-rw-rw-r--  1 txema txema  308111 mei 11 12:20 matrix.abundance.gene.tpm.mtx
-rw-rw-r--  1 txema txema 1207208 mei 11 12:20 matrix.abundance.mtx
-rw-rw-r--  1 txema txema 1281123 mei 11 12:20 matrix.abundance.tpm.mtx
-rw-rw-r--  1 txema txema 1253012 mei 11 12:20 matrix.efflens.mtx
-rw-rw-r--  1 txema txema      18 mei 11 12:20 matrix.fld.tsv
-rw-rw-r--  1 txema txema 5651896 mei 11 12:20 transcript_lengths.txt
-rw-rw-r--  1 txema txema 4525225 mei 11 12:19 transcripts.txt

As you can see, there is no .h5, nor bs_abundance files in neither of the cases.

I tried to run the same when compiling with/without -DUSE_HDF5=ON and also using the previous commit of the code. In all cases I get the exact same output file structure. No .h5, no bs_abundance

Am I running this with the wrong command? is -b 30 not enough to trigger bootstrapping?

ADD REPLY • link 11 months ago by txema.heredia ▴ 110

1

Entering edit mode

You need to run quant-tcc with the --matrix-to-files option

ADD REPLY • link 11 months ago by dsull ★ 5.8k

0

Entering edit mode

Oh gosh

I glanced over the original message where you first stated that

Thank you so much!

ADD REPLY • link 11 months ago by txema.heredia ▴ 110

0

Entering edit mode

I ran quant-tcc, Using either matrix.abundance.mtx or matrix.abundance.tpm.mtx Can we load it as a normal count matrix and create a seurat object for further filtering, scaling, normalization etc? Which matrix file would give better accuracy on the transcript expression ? I would like to evaluate gene isoforms expression across specific cell clusters. Do I still need to process first with tximport ?

ADD REPLY • link 11 months ago by sbt_gvs • 0

score 1 · Answer 1 · 2023-02-15

1

Entering edit mode

14 months ago

Trivas ★ 1.7k

I've used IsoformSwitchAnalyzeR before https://bioconductor.org/packages/release/bioc/vignettes/IsoformSwitchAnalyzeR/inst/doc/IsoformSwitchAnalyzeR.html

It might be a little overkill since it basically adds extra analyses to the output of DEXseq, but I've been able to use it with any count matrix that you can import with tximeta (in my case Salmon). Very interested to see what other people recommend, since I am curious to follow up on these types of questions.

ADD COMMENT • link 14 months ago by Trivas ★ 1.7k

0

Entering edit mode

Thanks, I'll keep an eye on it and give it a try.

However, for this analysis we are studying the effect of a certain mutation in a splicing-related gene. Thus, we are not expecting "functional changes of isoform", but a total screw-up of transcription with all kind of aberrant events.

ADD REPLY • link 14 months ago by txema.heredia ▴ 110

score 1 · Answer 2 · 2023-02-15

I am by no means neutral, but I think the combo of the nf-core pipelines rnaseq and differentialabundance will carry you a long way.

The pipelines have been written to work together seamlessly and should satisfy the 75% of your goals that start with "Differentially Expressed". For splicing analysis, there is another pipeline in the making called rnasplice, but it has no stable releases yet, and I do know nothing about its development status. However, feel free to join the nf-core Slack and ask the developers directly in the respective channel. You can surely get some inspiration regarding useable tools and parameters there.

The rnaseq pipeline at least runs StringTie to detect novel transcripts and also supports the use of UMIs in the analysis, if the UMIs are embedded in the FastQ header of the reads. In case the sequencing facility delivers them in a separate FastQ file, you can use barcodex-rs to move them there.

In my experience, considering UMIs in the quantification does not really matter, if you used 100ng or more input. At 10ng, I would not trust the results any more without UMIs:

Comparison of the abundances measured with STAR+Salmon and nf-core rnaseq pipeline 3.9

PS: In case, you need to clip the reads additionally, use version 3.9 of the pipeline or specify the values as ext.args to TrimGalore in the modules.config. We just yesterday discovered a stupid bug that results in an outage of the pipeline parameters --clip_r1, --clip_r2, --three_prime_clip_r1 and --three_prime_clip_r2 in the pipeline releases 3.10 and 3.10.1. A bugfix release will be prepared soon.

score 0 · Answer 3 · 2023-02-22

You could try this - https://github.com/gagneurlab/drop

The pipeline focuses on aberrant splicing, aberrant expression and monoallelic expression. Its from the same team which developed FRASER which is integrated with this pipeline along with OUTRIDER. It might take some time to set up and run the analysis, but the outputs could provide insight. Since the input data is low (6 samples in total), you can use counts from control data like GTEx or other resources mentioned in their paper.