Huge decrease of BUS records using bustools
0
0
Entering edit mode
8 weeks ago

Hello,

I have a single cell pipeline using kallisto bus for the pseudoalignment and then bustools to correct barcodes and sort them. I have an issue of huge decrease of BUS records while running bustools correct on some libraries. For examples for library SRX12187867 I get that output from bustools correct and bustools sort.

Found 6794880 barcodes in the whitelist
Processed 139893071 BUS records
In whitelist = 197236
Corrected = 1746603
Uncorrected = 137949232
Sort bus file
Read in 1943839 BUS records

the output.bus file generated by kallisto bus for this library is 4.4G (for ~140M BUS records). The bustools corrected file generated by bustools correct is only 60M (for ~1.9M BUS records).

I also processed other libraries from the same experiment and did not have the same issue.

Do you have any ideas what could be the reason for such a huge decrease of BUS records? I double checked the whitelist I used and I properly use the 10X V3 whitelist provided by cellranger.

Following @dsull comment the run_info.json file generated by kallisto is

{
        "n_targets": 0,
        "n_bootstraps": 0,
        "n_processed": 619876553,
        "n_pseudoaligned": 110019873,
        "n_unique": 82007271,
        "p_pseudoaligned": 17.7,
        "p_unique": 13.2,
        "kallisto_version": "0.46.0",
        "index_version": 0,
        "start_time": "Mon Jan 15 09:26:58 2024",
        "call": "kallisto bus -i /data/index/Homo_sapiens.GRCh38.transcriptome.idx -o /data/kallisto_bus_results/SRX12187867 -x 10xV3 -t 4 /data/FASTQ/9606/SRX12187867/SRR15896987/FASTQ/SRR15896987_R1.fastq.gz /data/FASTQ/9606/SRX12187867/SRR15896987/FASTQ/SRR15896987_R2.fastq.gz /data/FASTQ/9606/SRX12187867/SRR15896989/FASTQ/SRR15896989_R1.fastq.gz /data/FASTQ/9606/SRX12187867/SRR15896989/FASTQ/SRR15896989_R2.fastq.gz /data/FASTQ/9606/SRX12187867/SRR15896991/FASTQ/SRR15896991_R1.fastq.gz /data/FASTQ/9606/SRX12187867/SRR15896991/FASTQ/SRR15896991_R2.fastq.gz /data/FASTQ/9606/SRX12187867/SRR15896993/FASTQ/SRR15896993_R1.fastq.gz /data/FASTQ/9606/SRX12187867/SRR15896993/FASTQ/SRR15896993_R2.fastq.gz"
}

And a subset of the R1 and R2 files for one sample :

[jwollbre]$ zcat SRR15896991/FASTQ/SRR15896991_R1.fastq.gz | head -4
@SRR15896991.1 A01010:52:H3MHTDSXY:3:1101:1994:1000 length=151
GCTGTTTTAGGGGATTGTAGGTACTATATGGAAATGAAGATTTTAAAGTAGGGTAAGATCAGACCCAGCTGCTCCACTCCTGAGTGTTTACTCACCTAGGAAATACACATCCATACAAAGACTTGTACATACGTGGTTTTAGCAGCTTATT
+SRR15896991.1 A01010:52:H3MHTDSXY:3:1101:1994:1000 length=151
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

[jwollbre]$ zcat SRR15896991/FASTQ/SRR15896991_R2.fastq.gz | head -4
@SRR15896991.1 A01010:52:H3MHTDSXY:3:1101:1994:1000 length=151
CNCATGACACGCGCATATATTCCTTGATTAATAAAAGCATCTACACCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+SRR15896991.1 A01010:52:H3MHTDSXY:3:1101:1994:1000 length=151
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
single-cell bustools 10X • 621 views
ADD COMMENT
1
Entering edit mode

Hello! What you're seeing means that the barcodes extracted from your FASTQ files don't appear in the supplied "whitelist".

This can happen if the "kallisto bus" command being run is incorrect (e.g. perhaps you switched the R1 and R2 files? perhaps you supplied the index file instead of the R1 file? etc.). Can you show me the command you are running and the run_info.json file produced by "kallisto bus"?

ADD REPLY
0
Entering edit mode

Thanks for your comment. I updated my question to add the run_info.json file and an example of the R1 and R2 files.

ADD REPLY
1
Entering edit mode

R2/R1 files are indeed switched. Giveaway is the poly-A tail that you see in your Read 2. That is represented in the library structure: https://kb.10xgenomics.com/hc/en-us/articles/360000939852-What-is-the-difference-between-Single-Cell-3-and-5-Gene-Expression-libraries

Read 1 is normally UMI + Cell barcodes and then you will see a stretch of poly-A's, if read 1 is sequenced longer than 26 or 28 bp. Read 2 immediately reads into RNA insert.

ADD REPLY
0
Entering edit mode

That is exactly what I was scared of. Do you have any advice on how to check for file names to detect those cases where R1 and R2 are switched and both have the same read length? I would like to add a check to my pipeline and be 100% sure the name of the file is correct.

ADD REPLY
1
Entering edit mode

Not sure where and how you got the data from but using

fastq-dump -F --split-files SRR15896991

I get three files. _1 = Illumina index _2 = cell barcode + UMI _3 = RNA read.

$ head -4 SRR15896991_*
==> SRR15896991_1.fastq <==
@A01010:52:H3MHTDSXY:3:1101:1994:1000
NACGCCTT
+A01010:52:H3MHTDSXY:3:1101:1994:1000
#FFFFFFF

==> SRR15896991_2.fastq <==
@A01010:52:H3MHTDSXY:3:1101:1994:1000
CNCATGACACGCGCATATATTCCTTGATTAATAAAAGCATCTACACCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+A01010:52:H3MHTDSXY:3:1101:1994:1000
F#FFF,FFF:FFFFF,FFFFFFFFF:FFFF,FFF:,,:F,,,F,:,F,FF,,FFFFF,F,FFFFFFFF:FFFFF,FFFF,FFF,FF,FFFFF:FF:F,FFFFFFFFF:F,FF,FFFF,FF,:FF,FFFFFF:F,,FF,FFFFFFFFF,FFF

==> SRR15896991_3.fastq <==
@A01010:52:H3MHTDSXY:3:1101:1994:1000
GCTGTTTTAGGGGATTGTAGGTACTATATGGAAATGAAGATTTTAAAGTAGGGTAAGATCAGACCCAGCTGCTCCACTCCTGAGTGTTTACTCACCTAGGAAATACACATCCATACAAAGACTTGTACATACGTGGTTTTAGCAGCTTATT
+A01010:52:H3MHTDSXY:3:1101:1994:1000
,FFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFF:FFFFF:FF:FFFFFF:FFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF:FFFFFFFFF,:FF:FFFFFF:FFFF:FFFFF:F:FF
ADD REPLY
0
Entering edit mode

Thanks GenoMax I did the same and arrived exactly at the same conclusion than you.

I had issues with some SRA libraries download with fastq-dump. R1 and R2 files were not properly named _1 and _2. (or _2 and _3 if a I1 file exists). I used the wrong and naïve assumption that read length could help me solve this issue by renaming all downloaded files based on read length (I1 < R1 < R2).

I am now going to use fasterqParseR approach to check files downloaded from SRA.

ADD REPLY
0
Entering edit mode

I found the R package fasterqParseR. This package is not only using read length but also the "whitelist" to correct for file name. It is quite elegant even if it is probably not 100% bullet proof. If you have any other idea or know any other tools dealing with that issue please let me know and thanks for your answers.

ADD REPLY
0
Entering edit mode

how to check for file names to detect those cases where R1 and R2 are switched

I don't think you can check for the switch based on the file name. It is the reads inside that gives thing away.

ADD REPLY

Login before adding your answer.

Traffic: 1948 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6