Question: sra to fastq conversion - paired-end file won't split
0
gravatar for rioualen
3.0 years ago by
rioualen390
France
rioualen390 wrote:

Hello,

I'm using the fastq-dump program from sratoolkit suite in order to convert sra files to fastq files. I'm using the --split-files parameter in order to get separate fastq files for paired-end data. However, one of my files won't split...

I'm working on this dataset: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE41190

First three samples split fine, the latter doesn't:

@SRR400301.1 1204:1:1:1641:935 length=152
NATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCTTCTGCTTGAAANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR400301.1 1204:1:1:1641:935 length=152
########################################################################################################################################################
@SRR400301.2 1204:1:1:1708:951 length=152
NATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATATCGTATGGGCGGTACTGCGGCGCGGGGGGGNAGAGGGTAGATCTCGGGGGGGGGCGGGTGATTAAAAAAAAAAATCGGGGGG
+SRR400301.2 1204:1:1:1708:951 length=152
)333377777@@@@CCCC@@@CCC@C@@@C@@CC@C@@@C58998@@@@@C@@@@@@@@@CC@C@#######################################################################################
@SRR400301.3 1204:1:1:1765:941 length=152
NTGAAACATCTAAGTACCCCGAGGAAAAGAAATCAACCGAGATTCCCCCAGTAGCGGCGAGCGAACGGGGGGGAGCTTCGCCTTTCCCTCACGGTACTGGNTCACTATCGGTCAGTCAGGAGTATTTAGCCTTGGAGGATGCTCCCCCCATA
+SRR400301.3 1204:1:1:1765:941 length=152

All 4 samples are registered as single-end 36bp reads in GEO, however it clearly is paired-end 76x2bp. Latter file FastQC shows no exception: https://github.com/rioualen/gene-regulation/blob/master/GSM1010247.png

Any clue what I'm missing here?

Thank you

sratoolkit fastq-dump fastq sra • 1.0k views
ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by rioualen390
0
gravatar for Satyajeet Khare
3.0 years ago by
Satyajeet Khare1.5k
Pune, India
Satyajeet Khare1.5k wrote:

The dataset you are working on is a mix of single end and paired end samples. SRR400301 probably ain't a paired end sample. There is only one file corresponding to SRR400301 in ENA.

ADD COMMENTlink written 3.0 years ago by Satyajeet Khare1.5k

Hi, actually they're all paired-end samples, but registered as single-end. The FastQC shows it clearly, but I don't get why this is oddly formatted...

FastQC image: https://github.com/rioualen/gene-regulation/blob/master/GSM1010247.png

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by rioualen390
1

I am not sure how FASTQC tells a paired-end sample. The reads here don't look like paired-end reads though.

ADD REPLYlink written 3.0 years ago by Satyajeet Khare1.5k

The shape of the quality graph is typical of paired end reads, you can see the pretty usual quality drop around 76 and 152. It is the same with the other 3 samples. Plus, the samples are registered as paired-end in ENA.

ADD REPLYlink written 3.0 years ago by rioualen390

That graph could indicate that they're actually paired, but there's no guarantee of it... that said, 152bp reads are pretty unusual, and it would make more sense to me if they were actually 2x76bp reads that got decompressed incorrectly.

I think the take-home message here is that .sra is a terrible way to store data if you want people to be able to use it in the future.

ADD REPLYlink written 3.0 years ago by Brian Bushnell17k

Yeah SRA is a real pain to deal with, that said I was hoping there would be a way to bypass this problem cause I am really interested in this dataset. I contacted the author and he said the person in charge of the files, formatting etc was no longer working there... Publication of data in these huge databases should be more thoroughly checked!

ADD REPLYlink written 3.0 years ago by rioualen390
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2378 users visited in the last hour