Problems with fastq-dump generating a single fastq file
1
0
Entering edit mode
3.5 years ago

Hello everyone,

I am trying to get the fastq files for the dataset with GEO number GSE122960. Unfortunately, fftp links for such files are not available (I could only find links to an SRA formatted fastq which I cannot use). So I decided to use fastq-dump...

I did the following:

prefetch SRR8085151

And then:

fastq-dump --split-files --outdir fastq --gzip --skip-technical --readids --read-filter pass --dumpbase --clip --defline-seq '@$ac.$si.$sg/$ri' --defline-qual '+' /content/SRR8085151 -I

However, fastq-dump returns a single fastq, instead of two fastqs as specified by --split-files. What am I doing wrong? The resulting file name is SRR8085151_pass_1.fastq.gz

Thanks so much!!!

fastq-dump pair-ended fastq • 2.3k views
ADD COMMENT
0
Entering edit mode

It seems that the file was originally uploaded as BAM. Possibly the reads have incompatible name https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/#bam-files . Please check original BAM from https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR8085151 -> Data Access https://sra-pub-src-1.s3.amazonaws.com/SRR8085151/D246ali2_possorted_genome.bam.1 (Sorry I didn't checked by myself because I only have so narrow internet access now.) (Update) oops sorry genomax has posted the nearly the same thing while I was checking the link data.

ADD REPLY
0
Entering edit mode

Hello, thank you so so much!! Here is the funny part though... I had tried that, but it turns out that that sample doesn't belong to GSE122960, but to GSE121600... I have been trying to understand what is going on here but I am unable... Specifically, the bam file you have linked corresponds to GSM3439913, which just isn't one of the sample of the GSE! Am I going crazy??

ADD REPLY
0
Entering edit mode

Where did you get SRR8085151?

In the page of GSE122960 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE122960 , There are comments "Raw data not provided for this record Processed data provided as supplementary file".

The comments in the same section of https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE121600 are "Raw data are available in SRA Processed data provided as supplementary file"

I assume GSE122960 does not have SRA entries.

ADD REPLY
1
Entering edit mode
3.5 years ago
GenoMax 141k

This is a single-end dataset so there will only be one file. Don't use --split-files.

ADD COMMENT
0
Entering edit mode

They list it as paired end in GEO/SRA (and you need to do paired end with 10X). My guess as to why there is one file is that they didn't include the read that has the barcodes and UMI or there was a problem with the upload, which is what probably caused the confusion for OP.

ADD REPLY
0
Entering edit mode

Thanks a lot for your replies. Yes, it is paired end, that is why I was expecting two files. If I am missing that read... how could I possibly analyze this dataset? Is this some error from the authors who uploaded it?

ADD REPLY
0
Entering edit mode

Ah I now see that this is a 10x dataset. Can you download the BAM file available from AWS? It seems to be freely available (does not need payment) under the Data Access tab in link I provided above. Then use bamtofastq (LINK) utility made available by 10x to see if you extract the right pair of reads?

ADD REPLY
0
Entering edit mode

Thank you so much!! As I replied to fishgolden above, when I try to download it doing that, it links me to samples that don't belong to that GSE (samples available for download are from GSE121600, not from GSE122960). I have no idea what is going on here...

ADD REPLY
0
Entering edit mode

How do you know that the data is from a different GSE? Links we provided are specifically for SRR8085151. Are you saying that NCBI does not have the right SRR accession tied to the right GSE? I am able to get R1,I1,R2 reads (one read example from each below) from the BAM file we linked using 10x bam2fastq utility.

@NB500938:92:HK33NAFXX:1:11303:20466:7681 1:N:0:0
TAGCCGGTCTGTCTATATCCACCCCA
+
AAAAAEEEEEEEEEEEEEEEAEAEEE

@NB500938:92:HK33NAFXX:1:11303:20466:7681 2:N:0:0
CCCACAGT
+
AAAAAAEA

@NB500938:92:HK33NAFXX:1:11303:20466:7681 3:N:0:0
TGAGGCCACACAGCTGGGGCGGGGACTTCTGTCTGCCTGTGCTCCATGGGGGGACGGCTCCACCCAGCCTGCGCCACTGTGTTCTTAAGAGGCTTCCA
+
AAAAAEEAEEEAEEEEEEEEEEEEEE6EEEEEEE/AEEE<EE<EAEE<EEEEE/EEEEEEE/EEE<AEEEEEEAEEEE<AE/EAEA/EEEA<EE<AA<
ADD REPLY
0
Entering edit mode

Hello, thanks for your reply. Yes, I am able to get the 3 files too. Here is my thought process for my concern: If I search SRR8085151 in SRA explorer, I get that its accession is GSM3439913 If I google GSM3439913, I see that it belong to GSE121600. If I go to GSE121600 and select the GSM number, it returns the same SRR. I don't see any obvious connection between the two GSEs, but the one I am interested in (GSE122960) links to this other (GSE121600)

Am I doing something wrong?

ADD REPLY
0
Entering edit mode

GEO accession GSE121600 belongs to bioproject PRJNA507000. It looks like there are 17 biosamples in this project. If you click on one sample e.g. DONOR_01 you can get to its listing. Click on GEO Sample GSM3489182 (LINK). Scroll to the bottom of the page to find the data matrices.There are 16 more such samples.

ADD REPLY

Login before adding your answer.

Traffic: 2927 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6