Is someone familiar with demultiplexing (i.e. whitelisting and extracting UMI and cell barcodes) single cell RNA seq data generated with the QIAGEN QIAseq UPX 3' Transcriptome kit?
The only information I have regarding the format of the fastq files generated with this kit can be found in Figure 2 of the kit handbook here.
I think it could be summarise as follows:
read_1: transcript sequence
read_2: cell_index | UMI | ACG | poly-T
I tried to use
salmon alevin with the
chromiumV3 flag, but it discards more than 97% of the reads dues to "noisy cellular barcodes".
extract seem to handle only droplet-based single cell RNA-Seq.
The QIAGEN GeneGlobe Data Analysis Center pipeline does not explain in details how the demultiplexing is done neither.
Does someone would know any other tools able to deal with this kind of fastq file format?
Looking at the kit protocol here, it is said that "The UMI is a 12-base fully random sequence". But they do not mention the length of the cell barcode.
However, as genomax mentioned, my R2 reads are indeed 27 bp long (without poly-T nor ACG triplet though):
$ zcat my_R2.fastq.gz | head -16 @NB551406:25:HKGLJBGX7:1:11101:14400:1057 2:N:0:ATCACG TATGGAGAACATGGCGCGTTACAAGCN + AAAAAEEEEEEEAAEAEEEEEEEE//# @NB551406:25:HKGLJBGX7:1:11101:17302:1058 2:N:0:ATCACG TATGGAGAACTGACTTGAGTGCAACAN + AAA<AEAEEEEEEEEE6EAEEEEA/E# @NB551406:25:HKGLJBGX7:1:11101:14122:1059 2:N:0:ATCACG GCTCGACACATGCGAAGGCTGGAAGAN + AAAAAAEEE<EEEAEEEAEEAAAA/A# @NB551406:25:HKGLJBGX7:1:11101:5220:1059 2:N:0:ATCACG CTATCCGCTGGCTGTGCTTCGCAAGTT + AAAAAEEAE/EEEA/EEEAEEAEA/A/
After filtering out reads for which at least one base have a quality score < 30, I checked the number of unique k-mers starting from the beginning of the read (the problem is that I don't know how many cell IDs have been used).
k=1, 4 unique bases
k=2, 16 unique sequences
k=3, 57 unique sequences
k=4, 136 unique sequences
k=5, 197 unique sequences
k=6, 257 unique sequences
k=7, 321 unique sequences
k=8, 372 unique sequences
k=9, 426 unique sequences
k=10, 649 unique sequences
Qiagen sent me the cell_ID sequences (length=10 bases) and confirmed that UMI = 12 bases long.