10Xv2 scRNA dataset SRR files return only R1, how to run cellranger ?
1
0
Entering edit mode
5 months ago
jo.bac • 0

I am trying to run scRNA velocity with velocyto. A preliminary step is to run cellranger on fastq files but I have an issue with my chosen dataset

Here is what I did:

• Downloaded each SRR from GSE104323 using a loop with fastq-dump --split-files --origfmt --gzip SRR6084134
• This only yielded one SRR file instead of 3 expected files (R1,R2,I1)
• I renamed my fastqs to:

SRR_S0_L001_R1_001.fastq.gz SRR_S1_L001_R1_001.fastq.gz ...

Then tried to run:

cellranger count --id=test --fastqs=fastqgz/ --transcriptome=refdata-gex-mm10-2020-A


which failed with error:

The read lengths are incompatible with all the chemistries for Sample SRR in "/mnt/c/Users/jobac/Downloads/SRA_split_files/GSE104323".
- read1 median length = 98
- read2 median length = 0
- index1 median length = 0


I suppose the problem is that I have only one file instead of separate R1,R2,I1. How to obtain them for this dataset or work around this issue ?

Thanks a lot for the help!

RNA-Seq sequencing geo • 321 views
1
Entering edit mode
5 months ago
GenoMax 99k

As best as I can tell this submitter has submitted these samples in a strange format where the cell barcodes and UMI appear to have been moved to the header of these R1 fastq files. It is going to be a pain to deal with this data since it will likely require some custom coding to get it back into R1,I1,R2 format. Here is an example of one read.

@K00110:126:HJJTVBBXX:4:1101:20151:5464_CACGGATGGG_CAGTGCATGGATGG_
ACAAGGACGGGATAAAGTCCGAGAAATGTTCATGAAGAATGCCCATGTCACAGACCCCAGAGTGGTTGATCTGCTGGTCATTAAGGGAAAGATGGAGC
+K00110:126:HJJTVBBXX:4:1101:20151:5464_CACGGATGGG_CAGTGCATGGATGG_
AAFFFJJJFJJFJJJFJJJJJJJAFFJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJFJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJ


On top there are hundreds of such samples.

If you are just trying to use this as a test then find some other dataset with proper R1,I1,R2 files.

0
Entering edit mode

Thanks a lot ! I looked in detail at one example and found this comment by the authors:

"The Unique Molecular Identifier and cellular barcode corresponding to each read have been appended to the read id separated by an underscore."

Unfortunately I need to analyze this data specifically. It is my first time dealing with fastq, I am a bit at a loss how to split the files back. If someone finds time to do one example highlighting text and showing which parts to split back into the three files I could write a custom script

1
Entering edit mode

v2 barcodes should be 16 bp and UMI's will be 10 bp. It appears that the barcodes in the example above may have only been sequenced as 14 bp.

So from

@K00110:126:HJJTVBBXX:4:1101:20151:5464_CACGGATGGG_CAGTGCATGGATGG_


CAGTGCATGGATGG - cell barcode
CACGGATGGG - UMI

Illumina index is likely gone for good. You could randomly use one of the valid Illumina indexes to create a fake I1 file.