Dear all,
I am trying to use CellRanger 'count' function on the 10x single-cell data deposited here (https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-8077/). I tried it two ways:
1) While in the directory where the fastq.gz files are:
cellranger count \
--id=Kalucka_endo \
--fastqs=<path> \
--transcriptome=[path]/refdata-cellranger-mm10-3.0.0
2) with the sample 'prefix':
cellranger count \
--id=Kalucka_endo \
--sample=<sample_ID_prefix> \
--fastqs=<path> \
--transcriptome=<path>/refdata-cellranger-mm10-3.0.0
They state that they used 10x for sequencing but it isn't clear to me that their samples were named using bcl2fastq or demux. An example of one of their samples is 180908_I127_FCHWNVWBBXX_L1_CDKPEI180829002-ATTCTAAG_S1_L001_R2_001.fastq.gz and 180908_I127_FCHWNVWBBXX_L5_CDKPEI180829002-ATTCTAAG_S1_L001_R2_001.fastq.gz.
I tried using 10x tutorial data here: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_ct#explorecount
and was successfully able to get it running using the command line from 1) while in the directory with the fastq.gz files.
When I try either of these two ways of running it with this data, I getting this traceback:
`cellranger count (3.1.0) Copyright (c) 2019 10x Genomics, Inc. All rights reserved.
-------------------------------------------------------------------------------
Invalid path/prefix combination: , None No input FASTQs were found for the requested parameters.
If your files came from bcl2fastq or mkfastq:
- Make sure you are specifying the correct --sample(s), i.e. matching the sample sheet
- Make sure your files follow the correct naming convention, e.g. SampleName_S1_L001_R1_001.fastq.gz (and the R2 version)
- Make sure your --fastqs points to the correct location.
Refer to the "Specifying Input FASTQs" page at https://support.10xgenomics.com/ for more details.`
I tried changing the name to reflect a name more consistent with bcl2fastq naming convention i.e.
<sample name> _ <barcode sequence>_L<lane>_R<read number>_<setnumber>.fastq.gz
for example I made 180908_I127_FCHWNVWBBXX_L1_CDKPEI180829002-ATTCTAAG_S1_L001_R2_001.fastq.gz to be brain1_ATTCTAAG_L001_R2_001.fastq.gz 180908_I127_FCHWNVWBBXX_L5_CDKPEI180829002-ATTCTAAG_S1_L001_R2_001.fastq.gz. to be brain2_ATTCTAAG_L001_R2_001.fastq.gz
With these changes, I am still getting the aforementioned error.
The only other forum question that I was able to find was this one: cellranger count help and I don't believe I am having the same issue as even using no sample name and using a '.' for the --fastqs
Any thoughts or suggestions are greatly appreciated!
Processed count data has been provided via the processed data files (LINK) via the page you linked above. You could just use those.
Hi at @genomax,
Yes I actually started with those files, but they're problematic (or at least not clear). Long story short, it is not clear to me if these are normalized reads. These are definitely post-processed reads because they write in the paper that they 'excluded' cells that belong to non-endothelial clusters and the file is clearly missing a large amount of genes, most of which are the ones they used for exclusion (for example, Lyve1 was excluded and there is no Lyve1 in this data matrix). Then the fact that these say 'raw_count' confuses me as well.
I tried extensively to figure out of these are normalized, but I can't figure it out. We went so far as to email the authors but haven' t heard back from them.
If you or anyone has any suggestion as to how I can figure out if these are normalized read counts... I would really appreciate that too since I seems unlikely that i'm going to figure out this weird fastq issue anytime soon.
These are unlikely to be normalized. The file name indicates that these are counts files, probably from the
cellranger count
analysis which you are trying to do. Since this is single-cell data, these matrix files are going to be sparse with a lot of zeros.I tend to agree, but what concerned me was the fact that it looks like this would have had to be done post-clustering to identify non-EC clusters, in which case it should be normalized before identifying those.
So you think basically they took those cells out and then just provided the raw reads/non-normalized of the cells that were left ?
Basically the matrix only has some 9,300 or so genes total and it made me somewhat confused as to where in the analysis this is left at.