Paired-end 454 data - forward run consisting only of TCAG
3
1
Entering edit mode
7.8 years ago

Hi,

I have converted a paired-end, 454 SRA file (SRR1171018.sra, Argopecten irradians) to FASTQ using fastq-dump.2.3.2

</path/to/fastq-dump/> -F --split-files </path/to/SRR1171018.sra>


This yielded SRR1171018_1.fastq and SRR1171018_2.fastq. Despite _2 being absolutely normal, the entirety of the _1 file looks like this:

@IE4R6ZA01CKY6V
TCAG
+IE4R6ZA01CKY6V
IIII
@IE4R6ZA01EDSKW
TCAG
+IE4R6ZA01EDSKW
IIII
@IE4R6ZA01DTY42
TCAG


I initially thought that this may be single-end but incorrectly labelled as paired-end within NCBI, but converting to a single fastq resulted in all reads beginning with TCAG.

I have converted at least 100 sra files in this way in the last 2 months and have never seen this.

1. Is this just bad data?
2. Could I assemble _2 as if it were single-end to avoid losing the data?

Many thanks,

Lewis

sequence software-error RNA-Seq • 3.0k views
2
Entering edit mode
7.7 years ago
kmcarr00 ▴ 280

TCAG is the "key" sequence at the beginning of every 454 library molecule (they did change the sequence to distinguish FLX from FLX Titanium). The sequencer uses this key to one, identify library beads as opposed to control beads which have different key sequence and two, calibrate the signal intensity for single base incorporation. When the original 454 SFF file was uploaded to SRA the submitter properly identified the first 4 bases as a "technical" read and the remaining bases as the a sequence read. fastq-dump with the --split-files option is correctly separating the key (technical) read from the sequence read. Discard the first (_1) file, it is meaningless and proceed with just the sequence (_2) file.

0
Entering edit mode

Many thanks for this, very well explained.

0
Entering edit mode
7.8 years ago

My best guess is that your file contains the barcodes for the run (although these seem shorter than usual).

Often these are included to be able to identify which multiplexed sample was it in a multisample run.

But if that is true then the description of paired end run is incorrect (does the 454 even offer paired end sequencing?, I never heard of that before).

0
Entering edit mode

I have had a sequencing rep offer 'paired-end' 454 before. It's not true paired end where the two directions can be exactly connected, but it's the simultaneous sequencing of both strands of a double stranded sequence so from 1000 reads you get 500 forwards and 500 reverses from 500 DNA sequences.

0
Entering edit mode

Thanks for the reply, that would make a lot more sense. Yeah, this is admittedly the only 454 data I have used that is labelled as 'paired-end'.

0
Entering edit mode
7.7 years ago
lexnederbragt ★ 1.3k

You need to dump the sra file to a single fastq (without trying to split it) and then split into pairs on the paired end linker from 454. See this thread for background and pointers.

0
Entering edit mode

The fact that one needs to scour websites (then getting conflicting information) when trying to figure out something as simple as how to get raw data from a supposedly public data repository is nothing short of mindboggling ...

0
Entering edit mode

It doesn't help that 454/Roche prefer to write as little documentation as possible while avoiding trying to fit into "standard" approaches.