Problems with fastq-dump generating a single fastq file
1
0
Entering edit mode
18 months ago

Hello everyone,

I am trying to get the fastq files for the dataset with GEO number GSE122960. Unfortunately, fftp links for such files are not available (I could only find links to an SRA formatted fastq which I cannot use). So I decided to use fastq-dump...

I did the following:

prefetch SRR8085151


And then:

fastq-dump --split-files --outdir fastq --gzip --skip-technical --readids --read-filter pass --dumpbase --clip --defline-seq '@$ac.$si.$sg/$ri' --defline-qual '+' /content/SRR8085151 -I


However, fastq-dump returns a single fastq, instead of two fastqs as specified by --split-files. What am I doing wrong? The resulting file name is SRR8085151_pass_1.fastq.gz

Thanks so much!!!

fastq-dump pair-ended fastq • 953 views
0
Entering edit mode

It seems that the file was originally uploaded as BAM. Possibly the reads have incompatible name https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/#bam-files . Please check original BAM from https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR8085151 -> Data Access https://sra-pub-src-1.s3.amazonaws.com/SRR8085151/D246ali2_possorted_genome.bam.1 (Sorry I didn't checked by myself because I only have so narrow internet access now.) (Update) oops sorry genomax has posted the nearly the same thing while I was checking the link data.

0
Entering edit mode

Hello, thank you so so much!! Here is the funny part though... I had tried that, but it turns out that that sample doesn't belong to GSE122960, but to GSE121600... I have been trying to understand what is going on here but I am unable... Specifically, the bam file you have linked corresponds to GSM3439913, which just isn't one of the sample of the GSE! Am I going crazy??

0
Entering edit mode

Where did you get SRR8085151?

In the page of GSE122960 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE122960 , There are comments "Raw data not provided for this record Processed data provided as supplementary file".

The comments in the same section of https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE121600 are "Raw data are available in SRA Processed data provided as supplementary file"

I assume GSE122960 does not have SRA entries.

1
Entering edit mode
18 months ago
GenoMax 115k

This is a single-end dataset so there will only be one file. Don't use --split-files.

0
Entering edit mode

They list it as paired end in GEO/SRA (and you need to do paired end with 10X). My guess as to why there is one file is that they didn't include the read that has the barcodes and UMI or there was a problem with the upload, which is what probably caused the confusion for OP.

0
Entering edit mode

Thanks a lot for your replies. Yes, it is paired end, that is why I was expecting two files. If I am missing that read... how could I possibly analyze this dataset? Is this some error from the authors who uploaded it?

0
Entering edit mode

Ah I now see that this is a 10x dataset. Can you download the BAM file available from AWS? It seems to be freely available (does not need payment) under the Data Access tab in link I provided above. Then use bamtofastq (LINK) utility made available by 10x to see if you extract the right pair of reads?

0
Entering edit mode

Thank you so much!! As I replied to fishgolden above, when I try to download it doing that, it links me to samples that don't belong to that GSE (samples available for download are from GSE121600, not from GSE122960). I have no idea what is going on here...

0
Entering edit mode

How do you know that the data is from a different GSE? Links we provided are specifically for SRR8085151. Are you saying that NCBI does not have the right SRR accession tied to the right GSE? I am able to get R1,I1,R2 reads (one read example from each below) from the BAM file we linked using 10x bam2fastq utility.

@NB500938:92:HK33NAFXX:1:11303:20466:7681 1:N:0:0
TAGCCGGTCTGTCTATATCCACCCCA
+
AAAAAEEEEEEEEEEEEEEEAEAEEE

@NB500938:92:HK33NAFXX:1:11303:20466:7681 2:N:0:0
CCCACAGT
+
AAAAAAEA

@NB500938:92:HK33NAFXX:1:11303:20466:7681 3:N:0:0
TGAGGCCACACAGCTGGGGCGGGGACTTCTGTCTGCCTGTGCTCCATGGGGGGACGGCTCCACCCAGCCTGCGCCACTGTGTTCTTAAGAGGCTTCCA
+
AAAAAEEAEEEAEEEEEEEEEEEEEE6EEEEEEE/AEEE<EE<EAEE<EEEEE/EEEEEEE/EEE<AEEEEEEAEEEE<AE/EAEA/EEEA<EE<AA<

0
Entering edit mode

Hello, thanks for your reply. Yes, I am able to get the 3 files too. Here is my thought process for my concern: If I search SRR8085151 in SRA explorer, I get that its accession is GSM3439913 If I google GSM3439913, I see that it belong to GSE121600. If I go to GSE121600 and select the GSM number, it returns the same SRR. I don't see any obvious connection between the two GSEs, but the one I am interested in (GSE122960) links to this other (GSE121600)

Am I doing something wrong?

0
Entering edit mode

GEO accession GSE121600 belongs to bioproject PRJNA507000. It looks like there are 17 biosamples in this project. If you click on one sample e.g. DONOR_01 you can get to its listing. Click on GEO Sample GSM3489182 (LINK). Scroll to the bottom of the page to find the data matrices.There are 16 more such samples.