Header conflicts in concatenated FASTQ file?
0
0
Entering edit mode
2.9 years ago
Dunois ★ 2.5k

Say I have two FASTQ files wherein I happen to have two different sequences with the same header like so.

File 1:

@ABC1234
ATGCATGC
+
<<<<<<<<

File 2:

@ABC1234
TTAGTTTT
+
<<<<<<<<

If I were to concatenate these, then I'd have a situation where I have duplicated headers, but associated with unique sequences.

Tools like the de novo transcriptome assembler Trinity seem to suggest pooling reads (i.e., different FASTQ files together) for assembly (e.g., for differential expression analysis). But is the duplicated header-unique sequence situation not an issue in this situation? Do tools that accept FASTQ inputs re-index the sequences and discard the headers?

If this is an issue, what's the best way to deal with this?

FASTQ RNA-seq concatenate • 900 views
ADD COMMENT
0
Entering edit mode

Are you monkeying with the fastq names? Usually reads are named after their instrument ID and run ID and coordinates on the flow cell, which is always going to be unique.

ADD REPLY
0
Entering edit mode

I'm not messing around with the sequence headers, no. I'm just asking to make sure I'm not making a colossal mistake by just cat-ing a couple of files together for assembly.

ADD REPLY
1
Entering edit mode

R1 and R2 read might share the same name (but they often have a _1 or _2 appended to them) otherwise, read IDs are naturally going to be unique if you don't mess with them as they come off the Illumina instrument.

ADD REPLY
0
Entering edit mode

I wasn't thinking of concatenating the pairs themselves. That puts me in the clear then.

Thank you!!

ADD REPLY

Login before adding your answer.

Traffic: 2897 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6