I regularly work on combinatorial barcoding data.
If you’re open to a pseudoalignment approach, kallisto (via kb-python
) can process Parse data really easily. Say you’re processing human single-cell samples with v2 of Parse’s kit, and your read files are named R1.fastq.gz and R2.fastq.gz. You can do:
pip install kb_python
wget https://raw.githubusercontent.com/Yenaled/barcodes/refs/heads/main/splitseqv2_barcodes.txt
wget https://raw.githubusercontent.com/Yenaled/barcodes/refs/heads/main/splitseqv2_replace.txt
kb ref -d human -i index.idx -g t2g.txt
kb count -x SPLIT-SEQ -w splitseqv2_barcodes.txt -r splitseqv2_replace.txt -i index.idx -g t2g.txt -o output_dir R1.fastq.gz R2.fastq.gz
The matrices will be outputted into output_dir/counts_unfiltered_modified/
in sparse MatrixMarket (.mtx) format that you can then load into python via scipy.io.mmread()
.
(Note: splitseqv2_barcodes.txt
and splitseqv2_replace.txt
contain the Parse kit's barcodes and the correspondence between oligo-dT and randO primers, respectively). If you need any adjustments to the command (like if you want to use a different/custom species, quantify nucleus RNA, using another version of Parse's kit, or have any questions about the command or output files or run into problems running the code, let me know).
If you would like to use a STAR aligner-based approach, I can also tell you how to go about doing it (unfortunately my code for that is in a bit-disorganized snakemake workflow that I can't share).
Thank you for your help. We’re using v3 chemistry—do you know where I can obtain the barcode sequences? I don’t believe Parse shares them directly with customers, so it would be great if you could share them.
Also, to clarify my original question: I’m interested in identifying the actual transcript that’s translated into protein, rather than just the gene to which the cDNA aligns. Is there a way to retrieve that information?
Thanks again!
Ooh, that's a bit tricky. I worked with v3 of the kit a while back but for some reason I can only find the R2/R3 barcodes. I'll let you know if I can find the R1 barcodes.
Transcript-level analysis is also a bit tricky; essentially, you'd have add the --tcc option to the kb count command and that will give you a TCC matrix (rather than a count matrix) -- you should then manually subset that matrix to only contain the barcodes that have, say, >500 UMI counts. Afterwards, on the subsetted matrix .mtx file, you should run kallisto quant-tcc. (You can view the kallisto documentation here: https://www.biorxiv.org/content/10.1101/2023.11.21.568164v2 -- also see the Supplementary Information).
So yeah, there's multiple difficult steps here :( If I had more time, I'd write up a tutorial.
I don't think it's very straightforward with alternative tools either unfortunately (hopefully someone can prove me wrong on this)...
There are some barcodes provided here --> https://github.com/mortazavilab/parse_pipeline/tree/main/barcodes
Perhaps @dsull knows if there is any from there that can be used. There are multiple
v3
files.AH perfect!! Thanks -- that's exactly right!
Here's the updated command (and links) for v3:
Haven't tested it on actual data yet, so hopefully it works.
In any case, that's one hurdle out of the way -- the transcript-level analysis is still a bit tricky though.
I will test it with my data and compare output with Parse's output! Thanks!
Thanks for the idea, I will investigate it! I build a pipeline with HISAT2 and StringTie2 to get transcripts from our data, it works but I haven't yet checked how accurate output is.