low quality data or file name swapping -- cellranger arc errors when processing 10x scMultiomics data
0
0
Entering edit mode
18 hours ago

I downloaded scMultiomics data from here.

To be specific, I downloaded snATAC-seq from here.

I made an ATAC folder and downloaded all snATAC-seq files to this ATAC folder using the wget command below.

cd ATAC/
wget -O W71_LUNGrep2_S6_L001_R1_001.fastq.gz https://www.encodeproject.org/files/ENCFF872EFS/@@download/ENCFF872EFS.fastq.gz
wget -O W71_LUNGrep2_S6_L001_R2_001.fastq.gz https://www.encodeproject.org/files/ENCFF320GWZ/@@download/ENCFF320GWZ.fastq.gz
wget -O W71_LUNGrep2_S6_L001_R3_001.fastq.gz https://www.encodeproject.org/files/ENCFF260JLZ/@@download/ENCFF260JLZ.fastq.gz
wget -O W71_LUNGrep2_S6_L002_R1_001.fastq.gz https://www.encodeproject.org/files/ENCFF591VEX/@@download/ENCFF591VEX.fastq.gz
wget -O W71_LUNGrep2_S6_L002_R2_001.fastq.gz https://www.encodeproject.org/files/ENCFF979SWK/@@download/ENCFF979SWK.fastq.gz
wget -O W71_LUNGrep2_S6_L002_R3_001.fastq.gz https://www.encodeproject.org/files/ENCFF213UEY/@@download/ENCFF213UEY.fastq.gz

Then downloaded scRNA-seq from here.

I made an RNA folder and downloaded all scRNA-seq files to this RNA folder using the wget command below.

cd RNA/
wget -O W71_LUNGrep2_S6_L002_R1_001.fastq.gz https://www.encodeproject.org/files/ENCFF094PRI/@@download/ENCFF094PRI.fastq.gz
wget -O W71_LUNGrep2_S6_L002_R2_001.fastq.gz https://www.encodeproject.org/files/ENCFF639HEH/@@download/ENCFF639HEH.fastq.gz
wget -O W71_LUNGrep2_S6_L001_R1_001.fastq.gz https://www.encodeproject.org/files/ENCFF135JSP/@@download/ENCFF135JSP.fastq.gz
wget -O W71_LUNGrep2_S6_L001_R2_001.fastq.gz https://www.encodeproject.org/files/ENCFF318BVV/@@download/ENCFF318BVV.fastq.gz

Noticing that the filename is from the “original filename” field in the attribution table. For example, for part of the sequencing file of snATAC-seq data ENCFF872EFS, I navigate to https://www.encodeproject.org/files/ENCFF872EFS/, and then I can find W71_LUNGrep2_S6_L001_R1_001.fastq.gz as the filename of this sequencing file. Check the screen shot below.

enter image description here

A tricky point is that if you check ENCFF320GWZ and ENCFF260JLZ, you will find that in the page, ENCFF320GWZ is R2, and ENCFF260JLZ is the index file, but in their own pages -- ENCFF320GWZ and ENCFF260JLZ -- ENCFF320GWZ is R3, and ENCFF260JLZ is R2. However, I tried both order (ENCFF320GWZ is R2 and ENCFF260JLZ is R3, or ENCFF320GWZ is R3 and ENCFF260JLZ is R2) and the cellranger arc returned the same errors.

Then I built a libraries.csv file as below

fastqs,sample,library_type  
${root_dir}$/ENCSR128ZLB/RNA,W71_LUNGrep2,Gene Expression  
${root_dir}$/ENCSR128ZLB/ATAC,W71_LUNGrep2,Chromatin Accessibility

For both scRNA-seq and snATAC-seq files, I extract the string before S index in their original sequencing filename, but they are the same, so would this trigger any error?

So, now my folder and file structure is

.
|-- ATAC
|   |-- W71_LUNGrep2_S6_L001_R1_001.fastq.gz
|   |-- W71_LUNGrep2_S6_L001_R2_001.fastq.gz
|   |-- W71_LUNGrep2_S6_L001_R3_001.fastq.gz
|   |-- W71_LUNGrep2_S6_L002_R1_001.fastq.gz
|   |-- W71_LUNGrep2_S6_L002_R2_001.fastq.gz
|   |-- W71_LUNGrep2_S6_L002_R3_001.fastq.gz
|-- libraries.csv
|-- RNA
    |-- W71_LUNGrep2_S6_L001_R1_001.fastq.gz
    |-- W71_LUNGrep2_S6_L001_R2_001.fastq.gz
    |-- W71_LUNGrep2_S6_L002_R1_001.fastq.gz
    |-- W71_LUNGrep2_S6_L002_R2_001.fastq.g

I used the following command to run cellranger arc on these data.

cellranger-arc count --id=2024_A \
                     --reference=${reference_dir}/refdata-cellranger-arc-GRCh38-2024-A \
                     --libraries=${work_root_dir}/libraries.csv \
                     --localcores=24 \
                     --localmem=180

After running about 1h 50min, cellranger arc returned the following error:

025-11-25 01:59:31 [runtime] (update)          ID.2024-A.SC_ATAC_GEX_COUNTER_CS.SC_ATAC_GEX_COUNTER._GEX_MATRIX_COMPUTER.ALIGN_AND_COUNT.fork0 chunks running (7/21 completed)
2025-11-25 02:04:46 [runtime] (update)          ID.2024-A.SC_ATAC_GEX_COUNTER_CS.SC_ATAC_GEX_COUNTER._GEX_MATRIX_COMPUTER.ALIGN_AND_COUNT.fork0 chunks running (11/21 completed)
2025-11-25 02:10:06 [runtime] (update)          ID.2024-A.SC_ATAC_GEX_COUNTER_CS.SC_ATAC_GEX_COUNTER._GEX_MATRIX_COMPUTER.ALIGN_AND_COUNT.fork0 chunks running (13/21 completed)
2025-11-25 02:11:22 [runtime] (failed)          ID.2024-A.SC_ATAC_GEX_COUNTER_CS.SC_ATAC_GEX_COUNTER._ATAC_MATRIX_COMPUTER.ALIGN_ATAC_READS

[error] Pipestance failed. Error log at:
2024-A/SC_ATAC_GEX_COUNTER_CS/SC_ATAC_GEX_COUNTER/_ATAC_MATRIX_COMPUTER/ALIGN_ATAC_READS/fork0/join-u6dec253e8d/_errors

Log message:
0.5% (< 10%) of read pairs have a valid 10x barcode. This could be a result of poor sequencing quality, a sample mixup, or running the wrong pipeline, for example, running `cellranger-atac` on Multiome AT
AC + GEX data, or vice versa.

Waiting 6 seconds for UI to do final refresh.
Pipestance failed. Use --noexit option to keep UI running after failure.

2025-11-25 02:11:28 Shutting down.

Do I need to upload full output file? Does this error mean there is any low quality issue for the data themselves which means that this set of data is useless? Or I did anything wrong? May I have your suggestions? Thank you very much!

scMultiomics ENCODE scRNA-seq scATAC-seq cellranger-arc • 231 views
ADD COMMENT
0
Entering edit mode

Check the read lengths to make sure the files are correct.

ADD REPLY
0
Entering edit mode

Hi Arup Ghosh , thank you very much for your suggestions, and here are the results -- they seems correct. May I have your suggestions? Thank you very much!

-> ATAC zcat W71_LUNGrep2_S6_L001_R1_001.fastq.gz | awk 'NR%4==2 {print length($0); exit}'
50
-> ATAC zcat W71_LUNGrep2_S6_L001_R2_001.fastq.gz | awk 'NR%4==2 {print length($0); exit}'
50
-> ATAC zcat W71_LUNGrep2_S6_L001_R3_001.fastq.gz | awk 'NR%4==2 {print length($0); exit}'
24
-> ATAC zcat W71_LUNGrep2_S6_L002_R1_001.fastq.gz | awk 'NR%4==2 {print length($0); exit}'
50
-> ATAC zcat W71_LUNGrep2_S6_L002_R2_001.fastq.gz | awk 'NR%4==2 {print length($0); exit}'
50
-> ATAC zcat W71_LUNGrep2_S6_L002_R3_001.fastq.gz | awk 'NR%4==2 {print length($0); exit}'
24

-> RNA zcat W71_LUNGrep2_S6_L001_R1_001.fastq.gz | awk 'NR%4==2 {print length($0); exit}'
28
-> RNA zcat W71_LUNGrep2_S6_L001_R2_001.fastq.gz | awk 'NR%4==2 {print length($0); exit}'
90
-> RNA zcat W71_LUNGrep2_S6_L002_R1_001.fastq.gz | awk 'NR%4==2 {print length($0); exit}'
28
-> RNA zcat W71_LUNGrep2_S6_L002_R2_001.fastq.gz | awk 'NR%4==2 {print length($0); exit}'
90
ADD REPLY
0
Entering edit mode

The barcode files with 24nt read length W71_LUNGrep2_S6_L001_R3_001.fastq.gz and W71_LUNGrep2_S6_L002_R3_001.fastq.gz should be the R2.

ADD REPLY

Login before adding your answer.

Traffic: 4043 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6