Question: Problems with fastq-dump generating a single fastq file
0
gravatar for pomodoro_sinensis
11 weeks ago by
pomodoro_sinensis0 wrote:

Hello everyone,

I am trying to get the fastq files for the dataset with GEO number GSE122960. Unfortunately, fftp links for such files are not available (I could only find links to an SRA formatted fastq which I cannot use). So I decided to use fastq-dump...

I did the following:

prefetch SRR8085151

And then:

fastq-dump --split-files --outdir fastq --gzip --skip-technical --readids --read-filter pass --dumpbase --clip --defline-seq '@$ac.$si.$sg/$ri' --defline-qual '+' /content/SRR8085151 -I

However, fastq-dump returns a single fastq, instead of two fastqs as specified by --split-files. What am I doing wrong? The resulting file name is SRR8085151_pass_1.fastq.gz

Thanks so much!!!

pair-ended fastq-dump fastq • 228 views
ADD COMMENTlink modified 11 weeks ago by GenoMax95k • written 11 weeks ago by pomodoro_sinensis0

It seems that the file was originally uploaded as BAM. Possibly the reads have incompatible name https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/#bam-files . Please check original BAM from https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR8085151 -> Data Access https://sra-pub-src-1.s3.amazonaws.com/SRR8085151/D246ali2_possorted_genome.bam.1 (Sorry I didn't checked by myself because I only have so narrow internet access now.) (Update) oops sorry genomax has posted the nearly the same thing while I was checking the link data.

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by fishgolden450

Hello, thank you so so much!! Here is the funny part though... I had tried that, but it turns out that that sample doesn't belong to GSE122960, but to GSE121600... I have been trying to understand what is going on here but I am unable... Specifically, the bam file you have linked corresponds to GSM3439913, which just isn't one of the sample of the GSE! Am I going crazy??

ADD REPLYlink written 11 weeks ago by pomodoro_sinensis0

Where did you get SRR8085151?

In the page of GSE122960 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE122960 , There are comments "Raw data not provided for this record Processed data provided as supplementary file".

The comments in the same section of https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE121600 are "Raw data are available in SRA Processed data provided as supplementary file"

I assume GSE122960 does not have SRA entries.

ADD REPLYlink written 11 weeks ago by fishgolden450
1
gravatar for GenoMax
11 weeks ago by
GenoMax95k
United States
GenoMax95k wrote:

This is a single-end dataset so there will only be one file. Don't use --split-files.

ADD COMMENTlink modified 11 weeks ago • written 11 weeks ago by GenoMax95k

They list it as paired end in GEO/SRA (and you need to do paired end with 10X). My guess as to why there is one file is that they didn't include the read that has the barcodes and UMI or there was a problem with the upload, which is what probably caused the confusion for OP.

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by rpolicastro3.2k

Thanks a lot for your replies. Yes, it is paired end, that is why I was expecting two files. If I am missing that read... how could I possibly analyze this dataset? Is this some error from the authors who uploaded it?

ADD REPLYlink written 11 weeks ago by pomodoro_sinensis0

Ah I now see that this is a 10x dataset. Can you download the BAM file available from AWS? It seems to be freely available (does not need payment) under the Data Access tab in link I provided above. Then use bamtofastq (LINK) utility made available by 10x to see if you extract the right pair of reads?

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by GenoMax95k

Thank you so much!! As I replied to fishgolden above, when I try to download it doing that, it links me to samples that don't belong to that GSE (samples available for download are from GSE121600, not from GSE122960). I have no idea what is going on here...

ADD REPLYlink written 11 weeks ago by pomodoro_sinensis0

How do you know that the data is from a different GSE? Links we provided are specifically for SRR8085151. Are you saying that NCBI does not have the right SRR accession tied to the right GSE? I am able to get R1,I1,R2 reads (one read example from each below) from the BAM file we linked using 10x bam2fastq utility.

@NB500938:92:HK33NAFXX:1:11303:20466:7681 1:N:0:0
TAGCCGGTCTGTCTATATCCACCCCA
+
AAAAAEEEEEEEEEEEEEEEAEAEEE

@NB500938:92:HK33NAFXX:1:11303:20466:7681 2:N:0:0
CCCACAGT
+
AAAAAAEA

@NB500938:92:HK33NAFXX:1:11303:20466:7681 3:N:0:0
TGAGGCCACACAGCTGGGGCGGGGACTTCTGTCTGCCTGTGCTCCATGGGGGGACGGCTCCACCCAGCCTGCGCCACTGTGTTCTTAAGAGGCTTCCA
+
AAAAAEEAEEEAEEEEEEEEEEEEEE6EEEEEEE/AEEE<EE<EAEE<EEEEE/EEEEEEE/EEE<AEEEEEEAEEEE<AE/EAEA/EEEA<EE<AA<
ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by GenoMax95k

Hello, thanks for your reply. Yes, I am able to get the 3 files too. Here is my thought process for my concern: If I search SRR8085151 in SRA explorer, I get that its accession is GSM3439913 If I google GSM3439913, I see that it belong to GSE121600. If I go to GSE121600 and select the GSM number, it returns the same SRR. I don't see any obvious connection between the two GSEs, but the one I am interested in (GSE122960) links to this other (GSE121600)

Am I doing something wrong?

ADD REPLYlink written 11 weeks ago by pomodoro_sinensis0

GEO accession GSE121600 belongs to bioproject PRJNA507000. It looks like there are 17 biosamples in this project. If you click on one sample e.g. DONOR_01 you can get to its listing. Click on GEO Sample GSM3489182 (LINK). Scroll to the bottom of the page to find the data matrices.There are 16 more such samples.

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by GenoMax95k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 958 users visited in the last hour
_