SRA-tools fasterq-dump and cellranger issues
1
1
Entering edit mode
11 weeks ago
vishvak2000 ▴ 10

Hello, I am trying to download fastq-files (SRR12273024) with fasterq-dump/fastq-dump from sra-tools. I have tried the --split-files and -s tags however, I only get 1 fastq file.

@SRR12273024.1 SN7001050R:482:HYKG3BCXX:1:1101:1163:2092 length=109 GATGTANAGAACGCGACTTCCACAAACCTGGATTTTTTATGTACAACCCTGACCCNGACCGTTTGCTATATTCCTTTTTCTATGAAATAATGTGAATGATAATAAAACA +SRR12273024.1 SN7001050R:482:HYKG3BCXX:1:1101:1163:2092 length=109 DDDDDI#<<EHIIIHIIIIIIIIIHEHIIIIFHHHIIIIIHHIIIIIIHHIIHHH#<<DGHHIHIHIEHEHHHFHHIIIIIIIH?EEHH@HIIHIIIIFEHDDHHHHHH @SRR12273024.2 SN7001050R:482:HYKG3BCXX:1:1101:1096:2166 length=109 AAGGTACCTGGGTTCAACTAAAGCGCCAGCCTGCTCCACCCAGAGAAGCACACTTTGTGAGAACCAATGGGAAGGAGCCTGAGCTGCTGGAACCTATTCCCTATGAATT +SRR12273024.2 SN7001050R:482:HYKG3BCXX:1:1101:1096:2166 length=109 DDDDDIHIIIHIFHIIIIIIIIIIGIHIIIIIIIIIHIHHHIGHIHIHI?GHHG?GFHHDH@FG<<CHGHIGHHIHHHEHH1FHIIIIIIIGHEHHIIHGHDGHHHHGI @SRR12273024.3 SN7001050R:482:HYKG3BCXX:1:1101:1086:2183 length=109 CAGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

I have tried various ways to de-interleave the fastq file including methods outlined in: gist.github.com/nathanhaigh/3521724 and biostars.org/p/19446/ however, none of these methods output fastq files that are compatible with the cell ranger pipeline.

When I try to run cellranger counts on the file, I am given this error:

Log message: The read lengths are incompatible with all the chemistries for Sample SRR12273024 in ./

• read1 median length = 109
• read2 median length = 0
• index1 median length = 0

We expect that at least 50% of the reads exceed the minimum length.

I've looked into this error and it seems like the dataset is paired-end, which is why I have been trying to split the files using sra-tools to no avail.

Any help is appreciated!

cellranger • 635 views
3
Entering edit mode
11 weeks ago
ATpoint 55k

Seems there is one read but you can split the splots to get four different reads, being the index (1), the cDNA (2) and then the Cellular Barcode + UMI (3+4) which matches the description of the old v1 chemistry here at (8) Final library structure.

https://teichlab.github.io/scg_lib_structs/methods_html/10xChromium3v1.html

$fasterq-dump SRR12273024 --split-spot --include-technical --split-files$ head -n 4 SRR12273024*
==> SRR12273024_1.fastq <==
@SRR12273024.1 SN7001050R:482:HYKG3BCXX:1:1101:1163:2092 length=8
GCCAACAA
+SRR12273024.1 SN7001050R:482:HYKG3BCXX:1:1101:1163:2092 length=8
DDDDDHII

==> SRR12273024_2.fastq <==
@SRR12273024.1 SN7001050R:482:HYKG3BCXX:1:1101:1163:2092 length=109
GATGTANAGAACGCGACTTCCACAAACCTGGATTTTTTATGTACAACCCTGACCCNGACCGTTTGCTATATTCCTTTTTCTATGAAATAATGTGAATGATAATAAAACA
+SRR12273024.1 SN7001050R:482:HYKG3BCXX:1:1101:1163:2092 length=109
DDDDDI#<<EHIIIHIIIIIIIIIHEHIIIIFHHHIIIIIHHIIIIIIHHIIHHH#<<DGHHIHIHIEHEHHHFHHIIIIIIIH?EEHH@HIIHIIIIFEHDDHHHHHH

==> SRR12273024_3.fastq <==
@SRR12273024.1 SN7001050R:482:HYKG3BCXX:1:1101:1163:2092 length=14
GCACGTCTTATTCC
+SRR12273024.1 SN7001050R:482:HYKG3BCXX:1:1101:1163:2092 length=14
@?D?DDE@GC?11<

==> SRR12273024_4.fastq <==
@SRR12273024.1 SN7001050R:482:HYKG3BCXX:1:1101:1163:2092 length=10
CGTGCCACAG
+SRR12273024.1 SN7001050R:482:HYKG3BCXX:1:1101:1163:2092 length=10
DDDDDHIIII


Not sure how to get that into CellRanger format though, never used CellRanger and V1 chemistry myself. Others may know, or you can check the CellRanger manual for the expected file (formats). Does it expect two or three (or even all four) files?

0
Entering edit mode

Thanks! This seems to have worked out. Additionally, following solution (ii) outlined here seems to make the fastqs compatible with cellranger pipeline. I am curious though, the paper does mention that it is v2 chemistry and the SRA page indicates that there are 3 reads. Is there any way to configure fasterq-dump to get the 3 files: R1,R2 and I1?

2
Entering edit mode

If you look at the Data Access tab for the SRA record for this run there are four files uploaded. It appears the submitter's may have split the UMI's and Barcodes into separate files. If you need to get R1,R2 and I1 files then you will need to rename SRR12273024_1.fastq to SRR12273024_I1.fastq. Then merge SRR12273024_3.fastq and SRR12273024_4.fastq to recreate the SRR12273024_R1.fastq file.

0
Entering edit mode

Thanks! How would I go about merging the fastq files? Additionally, does the number of files/the way the fastqs are uploaded to SRA have anything to do with the type of chromium chemistry used (v1,v2,etc)?

In the case of SRR12273037 which is from the same experiment and thus, v2 chemistry, there are only 3 files as opposed to 4. To download this run would I also need to use the following tags: --split-spot --include-technical --split-files

0
Entering edit mode

I am not sure why the authors uploaded the data this way. I have not worked with v.1 10x chemistry so don't know if that has some bearing on this. It would be unusual to mix chemistries in one experiment but ..

As it stands, you will need to write some code to match the fastq headers for each record and then merge files 3 and 4 to get the 24 bp read.