Transcript identification and quantification
2
0
Entering edit mode
4 months ago

Hey All!

I need to identify the transcripts from my splitseq scRNA data coming from Parse Bioscience. I'm thinking to use HISAT2 and Stringtie2 combination to do that. Are there any better ways to do what I want? I would love to hear any suggestions, I'm pretty new in the field and working my way through pretty steep learning curve right now.

I really appreciate any feedback, Thanks!

splitseq transcript scRNA • 1.3k views
ADD COMMENT
1
Entering edit mode
4 months ago
dsull ★ 7.7k

I regularly work on combinatorial barcoding data.

If you’re open to a pseudoalignment approach, kallisto (via kb-python) can process Parse data really easily. Say you’re processing human single-cell samples with v2 of Parse’s kit, and your read files are named R1.fastq.gz and R2.fastq.gz. You can do:

pip install kb_python
wget https://raw.githubusercontent.com/Yenaled/barcodes/refs/heads/main/splitseqv2_barcodes.txt
wget https://raw.githubusercontent.com/Yenaled/barcodes/refs/heads/main/splitseqv2_replace.txt
kb ref -d human -i index.idx -g t2g.txt
kb count -x SPLIT-SEQ -w splitseqv2_barcodes.txt -r splitseqv2_replace.txt -i index.idx -g t2g.txt -o output_dir R1.fastq.gz R2.fastq.gz

The matrices will be outputted into output_dir/counts_unfiltered_modified/ in sparse MatrixMarket (.mtx) format that you can then load into python via scipy.io.mmread().

(Note: splitseqv2_barcodes.txt and splitseqv2_replace.txt contain the Parse kit's barcodes and the correspondence between oligo-dT and randO primers, respectively). If you need any adjustments to the command (like if you want to use a different/custom species, quantify nucleus RNA, using another version of Parse's kit, or have any questions about the command or output files or run into problems running the code, let me know).

If you would like to use a STAR aligner-based approach, I can also tell you how to go about doing it (unfortunately my code for that is in a bit-disorganized snakemake workflow that I can't share).

ADD COMMENT
0
Entering edit mode

Thank you for your help. We’re using v3 chemistry—do you know where I can obtain the barcode sequences? I don’t believe Parse shares them directly with customers, so it would be great if you could share them.

Also, to clarify my original question: I’m interested in identifying the actual transcript that’s translated into protein, rather than just the gene to which the cDNA aligns. Is there a way to retrieve that information?

Thanks again!

ADD REPLY
0
Entering edit mode

Ooh, that's a bit tricky. I worked with v3 of the kit a while back but for some reason I can only find the R2/R3 barcodes. I'll let you know if I can find the R1 barcodes.

Transcript-level analysis is also a bit tricky; essentially, you'd have add the --tcc option to the kb count command and that will give you a TCC matrix (rather than a count matrix) -- you should then manually subset that matrix to only contain the barcodes that have, say, >500 UMI counts. Afterwards, on the subsetted matrix .mtx file, you should run kallisto quant-tcc. (You can view the kallisto documentation here: https://www.biorxiv.org/content/10.1101/2023.11.21.568164v2 -- also see the Supplementary Information).

So yeah, there's multiple difficult steps here :( If I had more time, I'd write up a tutorial.

I don't think it's very straightforward with alternative tools either unfortunately (hopefully someone can prove me wrong on this)...

ADD REPLY
2
Entering edit mode

There are some barcodes provided here --> https://github.com/mortazavilab/parse_pipeline/tree/main/barcodes

Perhaps @dsull knows if there is any from there that can be used. There are multiple v3 files.

ADD REPLY
0
Entering edit mode

AH perfect!! Thanks -- that's exactly right!

Here's the updated command (and links) for v3:

pip install kb_python
wget https://raw.githubusercontent.com/Yenaled/barcodes/refs/heads/main/splitseqv3_barcodes.txt
wget https://raw.githubusercontent.com/Yenaled/barcodes/refs/heads/main/splitseqv3_replace.txt
kb ref -d human -i index.idx -g t2g.txt
kb count -x "1,10,18,1,30,38,1,50,58:1,0,10:0,0,0" -w splitseqv3_barcodes.txt -r splitseqv3_replace.txt -i index.idx -g t2g.txt -o output_dir R1.fastq.gz R2.fastq.gz

Haven't tested it on actual data yet, so hopefully it works.

In any case, that's one hurdle out of the way -- the transcript-level analysis is still a bit tricky though.

ADD REPLY
0
Entering edit mode

I will test it with my data and compare output with Parse's output! Thanks!

ADD REPLY
0
Entering edit mode

Thanks for the idea, I will investigate it! I build a pipeline with HISAT2 and StringTie2 to get transcripts from our data, it works but I haven't yet checked how accurate output is.

ADD REPLY
0
Entering edit mode
4 months ago
ATpoint 89k

Are there any better ways to do what ...

Parse offers a pipeline for that which is automated (and a bit of a pain but do-able). Contact their support or browse their documentation. It is available both as a standalone command line tool or a web-based solution.

ADD COMMENT
0
Entering edit mode

Thanks for the reply! I do use Parse pipeline to get expression matrices but what I really want to do is to be able to identify the transcript and not the gene that is expressed, I hope that makes sense..

ADD REPLY

Login before adding your answer.

Traffic: 4061 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6