low quality data or file name swapping -- cellranger arc errors when processing 10x scMultiomics data
0
0
Entering edit mode
8 hours ago

I downloaded scMultiomics data from here.

To be specific, I downloaded snATAC-seq from here.

I made an ATAC folder and downloaded all snATAC-seq files to this ATAC folder using the wget command below.

cd ATAC/
wget -O W71_LUNGrep2_S6_L001_R1_001.fastq.gz https://www.encodeproject.org/files/ENCFF872EFS/@@download/ENCFF872EFS.fastq.gz
wget -O W71_LUNGrep2_S6_L001_R2_001.fastq.gz https://www.encodeproject.org/files/ENCFF320GWZ/@@download/ENCFF320GWZ.fastq.gz
wget -O W71_LUNGrep2_S6_L001_R3_001.fastq.gz https://www.encodeproject.org/files/ENCFF260JLZ/@@download/ENCFF260JLZ.fastq.gz
wget -O W71_LUNGrep2_S6_L002_R1_001.fastq.gz https://www.encodeproject.org/files/ENCFF591VEX/@@download/ENCFF591VEX.fastq.gz
wget -O W71_LUNGrep2_S6_L002_R2_001.fastq.gz https://www.encodeproject.org/files/ENCFF979SWK/@@download/ENCFF979SWK.fastq.gz
wget -O W71_LUNGrep2_S6_L002_R3_001.fastq.gz https://www.encodeproject.org/files/ENCFF213UEY/@@download/ENCFF213UEY.fastq.gz

Then downloaded scRNA-seq from here.

I made an RNA folder and downloaded all scRNA-seq files to this RNA folder using the wget command below.

cd RNA/
wget -O W71_LUNGrep2_S6_L002_R1_001.fastq.gz https://www.encodeproject.org/files/ENCFF094PRI/@@download/ENCFF094PRI.fastq.gz
wget -O W71_LUNGrep2_S6_L002_R2_001.fastq.gz https://www.encodeproject.org/files/ENCFF639HEH/@@download/ENCFF639HEH.fastq.gz
wget -O W71_LUNGrep2_S6_L001_R1_001.fastq.gz https://www.encodeproject.org/files/ENCFF135JSP/@@download/ENCFF135JSP.fastq.gz
wget -O W71_LUNGrep2_S6_L001_R2_001.fastq.gz https://www.encodeproject.org/files/ENCFF318BVV/@@download/ENCFF318BVV.fastq.gz

Noticing that the filename is from the “original filename” field in the attribution table. For example, for part of the sequencing file of snATAC-seq data ENCFF872EFS, I navigate to https://www.encodeproject.org/files/ENCFF872EFS/, and then I can find W71_LUNGrep2_S6_L001_R1_001.fastq.gz as the filename of this sequencing file. Check the screen shot below.

enter image description here

A tricky point is that if you check ENCFF320GWZ and ENCFF260JLZ, you will find that in the page, ENCFF320GWZ is R2, and ENCFF260JLZ is the index file, but in their own pages -- ENCFF320GWZ and ENCFF260JLZ -- ENCFF320GWZ is R3, and ENCFF260JLZ is R2. However, I tried both order (ENCFF320GWZ is R2 and ENCFF260JLZ is R3, or ENCFF320GWZ is R3 and ENCFF260JLZ is R2) and the cellranger arc returned the same errors.

Then I built a libraries.csv file as below

fastqs,sample,library_type  
${root_dir}$/ENCSR128ZLB/RNA,W71_LUNGrep2,Gene Expression  
${root_dir}$/ENCSR128ZLB/ATAC,W71_LUNGrep2,Chromatin Accessibility

For both scRNA-seq and snATAC-seq files, I extract the string before S index in their original sequencing filename, but they are the same, so would this trigger any error?

So, now my folder and file structure is

.
|-- ATAC
|   |-- W71_LUNGrep2_S6_L001_R1_001.fastq.gz
|   |-- W71_LUNGrep2_S6_L001_R2_001.fastq.gz
|   |-- W71_LUNGrep2_S6_L001_R3_001.fastq.gz
|   |-- W71_LUNGrep2_S6_L002_R1_001.fastq.gz
|   |-- W71_LUNGrep2_S6_L002_R2_001.fastq.gz
|   |-- W71_LUNGrep2_S6_L002_R3_001.fastq.gz
|-- libraries.csv
|-- RNA
    |-- W71_LUNGrep2_S6_L001_R1_001.fastq.gz
    |-- W71_LUNGrep2_S6_L001_R2_001.fastq.gz
    |-- W71_LUNGrep2_S6_L002_R1_001.fastq.gz
    |-- W71_LUNGrep2_S6_L002_R2_001.fastq.g

I used the following command to run cellranger arc on these data.

cellranger-arc count --id=2024_A \
                     --reference=${reference_dir}/refdata-cellranger-arc-GRCh38-2024-A \
                     --libraries=${work_root_dir}/libraries.csv \
                     --localcores=24 \
                     --localmem=180

After running about 1h 50min, cellranger arc returned the following error:

025-11-25 01:59:31 [runtime] (update)          ID.2024-A.SC_ATAC_GEX_COUNTER_CS.SC_ATAC_GEX_COUNTER._GEX_MATRIX_COMPUTER.ALIGN_AND_COUNT.fork0 chunks running (7/21 completed)
2025-11-25 02:04:46 [runtime] (update)          ID.2024-A.SC_ATAC_GEX_COUNTER_CS.SC_ATAC_GEX_COUNTER._GEX_MATRIX_COMPUTER.ALIGN_AND_COUNT.fork0 chunks running (11/21 completed)
2025-11-25 02:10:06 [runtime] (update)          ID.2024-A.SC_ATAC_GEX_COUNTER_CS.SC_ATAC_GEX_COUNTER._GEX_MATRIX_COMPUTER.ALIGN_AND_COUNT.fork0 chunks running (13/21 completed)
2025-11-25 02:11:22 [runtime] (failed)          ID.2024-A.SC_ATAC_GEX_COUNTER_CS.SC_ATAC_GEX_COUNTER._ATAC_MATRIX_COMPUTER.ALIGN_ATAC_READS

[error] Pipestance failed. Error log at:
2024-A/SC_ATAC_GEX_COUNTER_CS/SC_ATAC_GEX_COUNTER/_ATAC_MATRIX_COMPUTER/ALIGN_ATAC_READS/fork0/join-u6dec253e8d/_errors

Log message:
0.5% (< 10%) of read pairs have a valid 10x barcode. This could be a result of poor sequencing quality, a sample mixup, or running the wrong pipeline, for example, running `cellranger-atac` on Multiome AT
AC + GEX data, or vice versa.

Waiting 6 seconds for UI to do final refresh.
Pipestance failed. Use --noexit option to keep UI running after failure.

2025-11-25 02:11:28 Shutting down.

Do I need to upload full output file? Does this error mean there is any low quality issue for the data themselves which means that this set of data is useless? Or I did anything wrong? May I have your suggestions? Thank you very much!

scMultiomics ENCODE scRNA-seq scATAC-seq cellranger-arc • 88 views
ADD COMMENT
0
Entering edit mode

You can check the read lengths of each file to determine which one is the R1 and which is the R2. Also, check the 10x protocol version they have used.

It seems that the i7 and i5 sequences are merged in the ATAC R3 files (ENCFF260JLZ, 24nt).

https://www.10xgenomics.com/support/epi-atac/documentation/steps/sequencing/sequencing-requirements-for-single-cell-atac

ADD REPLY

Login before adding your answer.

Traffic: 4201 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6