CellRanger runs into error when running fastq files from SRA
1
0
Entering edit mode
2.2 years ago
firestar ★ 1.6k

I am downloading reads from SRA to run in CellRanger.

prefetch -p -r yes --max-size 40G -O . SRR10419617
fasterq-dump -O . --threads 4 --mem "26G" --split-3 --skip-technical --print-read-nr --progress SRR10419617

This produces two fastq files:

zcat SRR10419617_1.fastq.gz | head

@SRR10419617.1/1 1 length=8
NGTGGAAC
+SRR10419617.1/1 1 length=8
#AAAFJFF
@SRR10419617.2/1 2 length=8
NGTGGAAC
+SRR10419617.2/1 2 length=8
#AAFFJJJ

zcat SRR10419617_2.fastq.gz | head

@SRR10419617.1/2 1 length=76
NNNGCCTAGTTAACGCATTTACTAAACGCAGACGAAAATGGAAAGATTAATTGGGAGTGGTAGGATGAAACAATTT
+SRR10419617.1/2 1 length=76
###-<<FJFFJJJJJJJ<JJJJJJJJJJJJJJJJFJFJJJJJFJJJJJ<JJJJFJ<JAAJAFFJJFJFJFJJFJFJ
@SRR10419617.2/2 2 length=76
NNNACAGCTATTTCATTATGTGCAATGTGTTACACCCTTTCAAATGTAATAAACTCACAACAAAATTGAAACATAA
+SRR10419617.2/2 2 length=76
###<<FJJJJJJJJJJJJJJFJJJJJJJJJJJJJFAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ

I renamed them to fit with CellRanger.

SRR10419617_S1_L001_R1_001.fastq.gz
SRR10419617_S1_L001_R2_001.fastq.gz

Then I run CellRanger:

cellranger count\
  --nosecondary \
  --id "SRR10419617" \
  --transcriptome "${CELLRANGER_DATA}/refdata-gex-GRCh38-2020-A/" \
  --fastqs "../../raw/reads/PRJNA588461/SRR10419617" \
  --sample "SRR10419617" \
  --localcores 4 \
  --localmem 25

And it fails with this error:

[error] Pipestance failed. Error log at:
SRR10419617/SC_RNA_COUNTER_CS/SC_MULTI_CORE/MULTI_CHEMISTRY_DETECTOR/_GEM_WELL_CHEMISTRY_DETECTOR/DETECT_COUNT_CHEMISTRY/fork0/chnk0-u1e9cf4fd5a/_errors

Log message:
The read lengths are incompatible with all the chemistries for Sample SRR10419617 in "/raw/reads/PRJNA588461/SRR10419617".
 read1 median length = 8
 read2 median length = 76
 index1 median length = 0

The minimum read length for different chemistries are:
SC5P-R2  - read1: 26, read2: 25, index1: 0
SC5P-PE  - read1: 81, read2: 25, index1: 0
SC3Pv1   - read1: 25, read2: 10, index1: 14
SC3Pv2   - read1: 26, read2: 25, index1: 0
SC3Pv3   - read1: 26, read2: 15, index1: 0
SC3Pv3LT - read1: 26, read2: 25, index1: 0

We expect that at least 50% of the reads exceed the minimum length.

I have also tried changing the fast file names. R1 and R2 and R2 as L1 etc, but same error.

Does anyone know what could be the issue? Incorrect fastq names? Should it be R1, R2 and L1? Is a file missing? Is two fastq files with 76 and 8 nucleotides an expected output for 10X?

sratools/2.10.9
EDirect/15.1
cellranger/6.0.2
cellranger 10x single-cell sra • 3.5k views
ADD COMMENT
1
Entering edit mode
2.2 years ago
GenoMax 146k

It appears that the data for this run is not correctly processed/uploaded in SRA. That R1 file is useless since it is just illumina index.

If you look under the Data Access tab the three files R1,R2,I1 appear to be correctly submitted by original submitters: https://trace.ncbi.nlm.nih.gov/Traces/index.html?view=run_browser&acc=SRR10419617&display=data-access

Unfortunately you will need to pay to download those.

ADD COMMENT
0
Entering edit mode

Thanks for the reply. This seems to be a systemic problem. I think I have looked at 4 different studies and all the 10X SRA files seem to be like this. Does anyone know a SRR id with 10X data that actually works? Just to test my workflow/script.

ADD REPLY
1
Entering edit mode

firestar there are plenty of good examples. Here is one SRR17102621. fastq-dump will produce three files. 1 = I1, 2=R1, 3=R3.

fastq-dump -F --split-files SRR17102621

Additional samples: https://www.ncbi.nlm.nih.gov/sra/SRX13290059[accn]

ADD REPLY
0
Entering edit mode

Your sample (SRR17102621) creates 3 files with this code

fastq-dump -F --split-files SRR17102621

while all these variations of fasterq-dump produces just 1 file

fasterq-dump -O . --threads 6 --mem 24G --split-3 --skip-technical --print-read-nr --progress SRR17102621
fasterq-dump -O . --threads 6 --mem 24G --split-files SRR17102621
fasterq-dump --split-files SRR17102621

Now for my example (SRR10419617), both tools produce 2 fastq files while I should get 3 (I think). I wonder if there might be more to it than incorrectly upload SRA file.

ADD REPLY
0
Entering edit mode

It seems clear that fasterq-dump should not be used with 10x data at all since others have reported similar issues.

You could try emailing SRA help desk and ask them about your specific accession. Tell them that the "Data Access" tab shows the three correct files so the submitters probably did the right upload. You can enumerate the problems with *-dump programs and that you can't download the original files without paying.

ADD REPLY
0
Entering edit mode

I contacted sra-tools and I finally have a solution. This seems to work for the "good" 10x SRA files.

fasterq-dump --threads 20 --mem 128G --split-files --include-technical --print-read-nr --progress SRR17102621

Including --split-files and --include-technical seems to be critical. It doesn't work if --split-3 is used. Not exactly sure what that does anyway. For this sample, using prefetch followed by fasterq-dump with 18 cores produced 3 fastq files (59GB total) in 1.5 hours.

I don't have a solution for the "bad" files yet. 12 of 13 experiments that I have looked at seems to be "bad" 10x SRA files (PRJNA330719, PRJNA400576, PRJNA548726, PRJNA558893, PRJNA588461, PRJNA593249, PRJNA625951, PRJNA647809, PRJNA661274, PRJNA682432, PRJNA700854, PRJNA700856). I only checked the first SRR for each experiment.

ADD REPLY
0
Entering edit mode

You should be able to get good fastqs from 10xGenomics.

ADD REPLY
0
Entering edit mode

I mean SRA files.

ADD REPLY
0
Entering edit mode

It is variable. Some runs are fine. People at times will also submit BAM files from cellranger that can be used to reconstitute the fastqs properly.

ADD REPLY
0
Entering edit mode

Can't they be obtained from the ENA instead (https://www.ebi.ac.uk/ena/browser/view/SRX7117651?show=reads)?

sra-exporer gives the following download script to fetch from ENA:

#!/usr/bin/env bash
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR104/017/SRR10419617/SRR10419617_1.fastq.gz -o SRR10419617_3295_1.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR104/017/SRR10419617/SRR10419617_2.fastq.gz -o SRR10419617_3295_2.fastq.gz
ADD REPLY
0
Entering edit mode

Problem is that file 1 from ENA is the same illumina index (at least at beginning):

@SRR10419617.1 1/1
NGTGGAAC
+
#AAAFJFF
@SRR10419617.2 2/1
NGTGGAAC
+
#AAFFJJJ
ADD REPLY
1
Entering edit mode

ahhh .... the level to which single-cell RNA-seq data is often rendered useless during upload to these archives is truly astounding.

ADD REPLY

Login before adding your answer.

Traffic: 1539 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6