I am using bat (rhinolophus sinicus) snRNA-seq brain samples located here. The associated paper is located here. The samples were prepped with the MGI DNBelab C4 scRNA Preparation Kit and were sequenced on the BGI DNBSEQTM technology platform. I downloaded all the bat brain tissue samples with parallel-fastq-dump. Below is an example of downloading a single sample.
parallel-fastq-dump --tmpdir . --threads 8 --gzip --readids --split-files --sra-id SRR13528085
There are a total of 12 samples (6 biological samples and 2 technical replicates per biological sample). After running parallel-fastq-dump for each SRR ID, I have a total of 24 fastq files with _1 and _2 for read 1 and read 2, respectively. The barcode sequences are in read 1 and the cDNA sequences are in read 2. I have a function below which I usually use for alignment of scRNA-seq data.
def sc_star_align(fastq1, fastq2, prefix):
out_prefix = out_dir + prefix + '_'
subprocess.run([star,
'--runThreadN','16',
'--genomeDir',genome,
'--soloType CB_UMI_Simple',
'--soloCBwhitelist',bc_whitelist,
'--soloFeatures','GeneFull',
'--soloCBlen','16',
'--soloUMIstart','17',
'--soloUMIlen','12',
'--soloCBmatchWLtype','1MM_multi_Nbase_pseudocounts',
'--soloBarcodeReadLength','0',
'--clipAdapterType','CellRanger4',
'--outFilterScoreMin','30',
'--soloUMIdedup','1MM_CR',
'--soloCellFilter','EmptyDrops_CR',
'--soloUMIfiltering','MultiGeneUMI_CR',
'--outFileNamePrefix',out_prefix,
'--readFilesCommand','zcat',
'--readFilesIn',fastq2,fastq1,
'--outSAMtype', 'BAM', 'SortedByCoordinate'])
I have a few questions on how to adjust this function for these samples:
- How do I find the barcode whitelist for these samples?
- How do I determine the cell barcode and UMI lengths?
- How do I determine the UMI start site?
- How would I create the STAR index for bat (rhinolophus sinicus)?
The only information listed in the paper is the read lengths which are 30bp for read 1 and 100bp for read2. I even looked at the code for the paper but they did not include their alignment commands. Any help is appreciated very much.
You simply need to make a list of barcodes one on each line: https://kb.10xgenomics.com/hc/en-us/articles/115004506263-What-is-a-barcode-whitelist
So you will need to reformat the list of barcodes you found in the other thread about DNBSeq.
So I'd take the json file located here and essentially create the whitelist to be all possible combinations of the position 1-10 barcodes and position 11-20 barcodes. The UMI length is 10bp and the cell barcode length after creating the combinations is 20bp. The UMI start site is at 21bp and the index I can create with STAR. Is this correct?
The barcodes are in location 1-10 and you have the actual list there. I don't know for certain what is in position 11-20 (perhaps spacer, check in data you have). 21-30 bp are UMI. Looks like RNA read is read 2.
read 1 is 30bp... which I think you meant to type. If you scroll approximately halfway down the json file I linked to, it has the following information.
I think this means the sequence from 11-20 is also cell barcodes. Which is why I was asking if I should just paste all possible combinations of position 1-10 barcodes with position 11-20 barcodes. Hope that makes more sense now. I'm still not sure if I am supposed to paste a list of all combinations.
Ah I see that in now. I am not sure why MGI has split the whitelist into two sections. You may need to see if you can dig up some info about that. All possible combinations sounds excessive. I looked around in the GItHub but did not see an immediate explanation for the split list.
Yeah, I've been digging around the web for a while now. Can't find any information about the barcodes. The total number of combinations is around 2.3 million, which doesn't seem to excessive in comparison to the 10x barcode whitelist files.
Only thing would be to try them out. See if you can detect them in the data you have.
You could also simply look for unique representatives (convert the sequences to plain text, use
sort
anduniq
etc) and see what is present in a real dataset like the one you have.Yes, that is a very good idea. Thanks for the help!
Let us know when you find out. Would be a useful thing to know what the data looks like.
Hmm, okay I'm not sure what is going on, but here is what I did:
parallel-fastq-dump --tmpdir . --threads 8 --gzip --readids --split-files --sra-id SRR13528082
zcat SRR13528082_1.fastq.gz | sed -n '2~4p'| uniq -u | cut -c-20 | uniq -u > Brain2BC.txt
expand.grid
on the first list of barcodesBrain2BC.txt
There are 15,530,488 unique barcodes for the single SRR ID... which seems like a lot. There are 2,359,296 unique barcodes in the
Brain2BC.txt
file. Only roughly 4% of the 15,530,488 unique barcodes for the SRR ID overlap with the barcode whitelist that I generated in step 3. So I'm not sure the whitelist that I am generating is correct, even though they explicitly state the positions of the barcode sequences in the config file on GitHub.So I used the barcodes that I created (combos of positions 1-10 and 11-20) and I obtained similar alignment stats as the paper I pulled the samples from. I'm thinking a lot of the barcodes in the reads were only off for a single base pair. STARSolo allows a single base pair in the read barcodes to not match the pre-defined barcode whitelist.