10x Cell Ranger 'count' function error:
1
0
Entering edit mode
3.8 years ago
miyagi • 0

Dear all,

I am trying to use CellRanger 'count' function on the 10x single-cell data deposited here (https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-8077/). I tried it two ways:

1) While in the directory where the fastq.gz files are:

cellranger count \
--id=Kalucka_endo \
--fastqs=<path> \
--transcriptome=[path]/refdata-cellranger-mm10-3.0.0

2) with the sample 'prefix':

cellranger count \
--id=Kalucka_endo \
--sample=<sample_ID_prefix> \
--fastqs=<path> \
--transcriptome=<path>/refdata-cellranger-mm10-3.0.0

They state that they used 10x for sequencing but it isn't clear to me that their samples were named using bcl2fastq or demux. An example of one of their samples is 180908_I127_FCHWNVWBBXX_L1_CDKPEI180829002-ATTCTAAG_S1_L001_R2_001.fastq.gz and 180908_I127_FCHWNVWBBXX_L5_CDKPEI180829002-ATTCTAAG_S1_L001_R2_001.fastq.gz.

I tried using 10x tutorial data here: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_ct#explorecount

and was successfully able to get it running using the command line from 1) while in the directory with the fastq.gz files.

When I try either of these two ways of running it with this data, I getting this traceback:

 `cellranger count (3.1.0)  Copyright (c) 2019 10x Genomics, Inc.  All rights reserved.
-------------------------------------------------------------------------------

Invalid path/prefix combination: , None No input FASTQs were found for the requested parameters.

If your files came from bcl2fastq or mkfastq:
 - Make sure you are specifying the correct --sample(s), i.e. matching the sample sheet
 - Make sure your files follow the correct naming convention, e.g. SampleName_S1_L001_R1_001.fastq.gz (and the R2 version)
 - Make sure your --fastqs points to the correct location.

Refer to the "Specifying Input FASTQs" page at https://support.10xgenomics.com/ for more details.`

I tried changing the name to reflect a name more consistent with bcl2fastq naming convention i.e.

<sample name> _ <barcode sequence>_L<lane>_R<read number>_<setnumber>.fastq.gz

for example I made 180908_I127_FCHWNVWBBXX_L1_CDKPEI180829002-ATTCTAAG_S1_L001_R2_001.fastq.gz to be brain1_ATTCTAAG_L001_R2_001.fastq.gz 180908_I127_FCHWNVWBBXX_L5_CDKPEI180829002-ATTCTAAG_S1_L001_R2_001.fastq.gz. to be brain2_ATTCTAAG_L001_R2_001.fastq.gz

With these changes, I am still getting the aforementioned error.

The only other forum question that I was able to find was this one: cellranger count help and I don't believe I am having the same issue as even using no sample name and using a '.' for the --fastqs was unsuccessful.

Any thoughts or suggestions are greatly appreciated!

10x cellranger single cell • 3.8k views
ADD COMMENT
1
Entering edit mode

Processed count data has been provided via the processed data files (LINK) via the page you linked above. You could just use those.

raw_count_matrix_brain.txt
raw_count_matrix_colon.txt
raw_count_matrix_heart.txt
raw_count_matrix_kidney.txt
raw_count_matrix_liver.txt
raw_count_matrix_lung.txt
raw_count_matrix_muscle_EDL.txt
ADD REPLY
0
Entering edit mode

Hi at @genomax,

Yes I actually started with those files, but they're problematic (or at least not clear). Long story short, it is not clear to me if these are normalized reads. These are definitely post-processed reads because they write in the paper that they 'excluded' cells that belong to non-endothelial clusters and the file is clearly missing a large amount of genes, most of which are the ones they used for exclusion (for example, Lyve1 was excluded and there is no Lyve1 in this data matrix). Then the fact that these say 'raw_count' confuses me as well.

I tried extensively to figure out of these are normalized, but I can't figure it out. We went so far as to email the authors but haven' t heard back from them.

If you or anyone has any suggestion as to how I can figure out if these are normalized read counts... I would really appreciate that too since I seems unlikely that i'm going to figure out this weird fastq issue anytime soon.

ADD REPLY
0
Entering edit mode

These are unlikely to be normalized. The file name indicates that these are counts files, probably from the cellranger count analysis which you are trying to do. Since this is single-cell data, these matrix files are going to be sparse with a lot of zeros.

ADD REPLY
0
Entering edit mode

I tend to agree, but what concerned me was the fact that it looks like this would have had to be done post-clustering to identify non-EC clusters, in which case it should be normalized before identifying those.

So you think basically they took those cells out and then just provided the raw reads/non-normalized of the cells that were left ?

Basically the matrix only has some 9,300 or so genes total and it made me somewhat confused as to where in the analysis this is left at.

ADD REPLY
0
Entering edit mode
3.8 years ago

brain1_ATTCTAAG_L001_R2_001.fastq.gz

Look at the example really carefully. Can you see why it probably won't work?

Try brain1_ATTCTAAG_S1_L001_R2_001.fastq.gz

I'm guessing that what follows --sample has to exactly match everything before the _S

ADD COMMENT
0
Entering edit mode

Hi @swbarnes2,

thanks for the reply! unfortunately changing the name to brain1_ATTCTAAG_S1_L001_R2_001.fastq.gz didn't work. Very frustrating...

I doubt this is the case, but I noticed that the original file names have either L1/L5 (bolded) and all of the files are R2.

180908_I127_FCHWNVWBBXX_L1_CDKPEI180829002-ATTCTAAG_S1_L001_R2_001.fastq.gz 180908_I127_FCHWNVWBBXX_L1_CDKPEI180829002-CCCGATTA_S1_L001_R2_001.fastq.gz 180908_I127_FCHWNVWBBXX_L1_CDKPEI180829002-GAATCCGC_S1_L001_R2_001.fastq.gz 180908_I127_FCHWNVWBBXX_L1_CDKPEI180829002-TGGAGGCT_S1_L001_R2_001.fastq.gz 180908_I127_FCHWNVWBBXX_L5_CDKPEI180829002-ATTCTAAG_S1_L001_R2_001.fastq.gz 180908_I127_FCHWNVWBBXX_L5_CDKPEI180829002-CCCGATTA_S1_L001_R2_001.fastq.gz 180908_I127_FCHWNVWBBXX_L5_CDKPEI180829002-GAATCCGC_S1_L001_R2_001.fastq.gz 180908_I127_FCHWNVWBBXX_L5_CDKPEI180829002-TGGAGGCT_S1_L001_R2_001.fastq.gz

The traceback mentions:

 - Make sure your files follow the correct naming convention, e.g. SampleName_S1_L001_R1_001.fastq.gz **(and the R2 version)**

These were the only two obvious differences from the pbmc_1k_v3 tutorial files that I could notice. Any thoughts?

ADD REPLY
0
Entering edit mode

My guess is that L1 likely refers to the original lane designation which would have been L001. I am not sure why they messed up the original file names this way when submitting the data.

ADD REPLY
0
Entering edit mode

It's looking for files with "R1" in the name, and not finding those. You can't proceed without them.

ADD REPLY
0
Entering edit mode

R1 files are provided on the raw data page. Guess OP did not download those?

ADD REPLY
0
Entering edit mode

Thanks,

The reason I did that is according to their file with explanations: https://www.ebi.ac.uk/arrayexpress/files/E-MTAB-8077/E-MTAB-8077.sdrf.txt

only the 8 files above are from brain. So I guess the other R1 files are from another tissue.

ADD REPLY
1
Entering edit mode

So I guess the other R1 files are from another tissue.

No. Those 8 files contain the cell barcodes/UMI for the reads found in R2 files.

ADD REPLY
0
Entering edit mode

Ok I see, I'm only now seeing that these are multiplexed reads. Unfortunately I haven't done de-multiplexing yet. I'll work on that...

But if in the meantime anyone is able to tell me if there is anyway I could understand whether the processed file they deposited is normalized or not, i would really appreciate it.

ADD REPLY

Login before adding your answer.

Traffic: 2771 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6