Hi everyone -
I'm trying to make a custom reference for a 10x Genomics v3 single-nuclei RNA-Seq run. According to the instructions on 10x's website (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/references) I can use the following commands:
# 1. Download the Ensembl98 release of mm10's genome (primary assembly):
wget ftp://ftp.ensembl.org/pub/release-98/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
# 2. Unzip the genome
gunzip Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
# 3. Download the Ensembl98 release of mm10's annotation file (GTF):
wget ftp://ftp.ensembl.org/pub/release-98/gtf/mus_musculus/Mus_musculus.GRCm38.98.gtf.gz
# 4. Unzip the .gtf file
gunzip Mus_musculus.GRCm38.98.gtf.gz
# 5. Filter the .gtf file for the biotype "transcript" and change them to "exon". This has the functional effect of creating a pre-mRNA gtf file.
awk 'BEGIN{FS="\t"; OFS="\t"} $3 == "transcript"{ $3="exon"; print}' Mus_musculus.GRCm38.98.gtf > Mus_musculus.GRCm38.98.premrna.gtf
# 6. Use cellranger mkref to create a reference for downstream analysis
cellranger mkref --genome=mm10 \
--fasta=Mus_musculus.GRCm38.dna.primary_assembly.fa \
--genes=Mus_musculus.GRCm38.98.premrna.gtf \
--ref-version=3.1.0
Step 6 is where my error appears. When I run this command, the following results appear:
+++++++
Creating new reference folder at /scratch/jglab/RYC/191218_snRNASeq_spikein_SPFvGF/CellRanger_References/refdata-cellranger-mm10-3.0.0_premrna/mm10_3.0.0_premrna
...done
Writing genome FASTA file into reference folder...
...done
Computing hash of genome FASTA file...
...done
Indexing genome FASTA file...
...done
Writing genes GTF file into reference folder...
...done
Computing hash of genes GTF file...
...done
Writing genes index file into reference folder (may take over 10 minutes for a 3Gb genome)...
/opt/htcf/spack/opt/spack/linux-ubuntu16.04-x86_64/gcc-5.4.0/cellranger-3.1.0-f4xtbwsorfbhh23ig7ccyjrfgipn5zwj/cellranger-cs/3.1.0/bin/../tenkit/bin/common/_master: line 76: 12996 Killed $SUBCMD "$@"
+++++++
I am currently running cellranger v 3.1.0 through our university's cluster, which stores cellranger at the directory /opt/htcf/spack/opt/spack/linux-ubuntu16.04-x86_64/gcc-5.4.0/cellranger-3.1.0-f4xtbwsorfbhh23ig7ccyjrfgipn5zwj/cellranger-cs/3.1.0/bin/. Looking at the github repository for cellranger, I can see the line that reads $SUBCMD "$@" but do not know functionally what it's doing. I checked the fasta file for the genome as well as the pre-filtered and post-filtered .gtf file and they are all of the correct format (chromosome names match, .gtf file has appropriate column names, both were downloaded from ensembl). I have attempted this pipeline with different versions of cellranger as well as different versions of the ensembl mouse genome/gtf annotations, and I get the same error each time.
Would love your thoughts on how I could move past this, or if anyone has had any similar experiences with these errors. If I can help provide additional information that would be helpful, please let me know. I really appreciate your help!
I suggest you use the pre-made indexes that 10x provides and save yourself the trouble. Is there a reason you are trying to make these yourself? Pre-made indexes for human and mouse genomes are available here. You will need to do a click-through registration.
If you scroll down, you'll see that for single nuclei applications, you need to do the tweaks the OP described to the gtf before mkref, so the OP does have to make their own index locally. (Looks like you relabel 'transcripts' as 'exons' so it won't expect introns to be omitted)
My apologies for not looking at the exact link included by OP and thanks for pointing out the
single nuclei
application.Instructions provided uses files from their premade index bundle. I just tried the instructions out. They seem to be working for creation of the modified index on a 32G server. It is taking some time for the operation to complete. Will update when done.
Thanks for testing this out genomax! As swbarnes2 pointed out, I need to make a custom reference to only include "transcripts" converted to "exons" so that introns are retained in the .gtf file. It's reassuring to see that on 32G, creating the reference works. I'll report back as soon as I get access to more system memory.
Hey genomax!
I added additional memory and haven't errored out after ~25 minutes... Looking good so far! I can't believe the fix was (potentially) this simple. Thanks for riding this journey with me..!
To speed this up further you could also add more threads/memory at end of your
mkref
command. e.g.--nthreads=8 --memgb=40
By default it uses 1 thread and 16G RAM.This run took about an hour to complete.