FASTQ headers need additional info?
0
0
Entering edit mode
2.8 years ago
Anand Rao ▴ 630

I have PE Illumina reads in 2 separate sets of FASTQ files (*_1.fq.gz for forward reads & *_2.fq.gz for reverse reads). Snippet of forward reads from 1 example file is shown at the bottom of this post - first 24 lines of file.

Does it look to you like the FASTQ headers shown below are OK (one example from 6 reads' info shown below: @E100006803L1C002R0020000005/1), and will be compatible with genome assembly tools, and then with variant discovery tools?

To my eye, these FASTQ headers appear to be in a format that I do not recognize. I expect FASTQ headers to be in one of the following formats, per the examples on the FASTQ wiki page.:

@HWUSI-EAS100R:6:73:941:1973#0/1

or

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

Can I

  1. use my FASTA files as such, or
  2. should I modify these headers, and/or
  3. add info to existing FASTQ headers that you think / know are missing?

Thanks in advance!

$ head -n 24 NB23DY28_1.fq
@E100006803L1C002R0020000005/1
AGTTGCTGCACTGTTCATAAATAGGCAATGTCAGCAAAATCACTCCAGAAAGGAGGATCTATTACAATCTGCCACTTCTCTTTACCTCAACTTATTGAAGAAAGGATTGATTGGTACTATAACAAATGGGCTTTTAACATATTGCAAGAA
+
FEEEEEEEEEEEEEEEEEEEEEEEEDEEEEEEEEEFEEEEEEEEEEE>EEEEEEDEEEEEEEEFCFEEEFEEEEEEEEEEEEFEEEEEEEEEEEEEEEEEEEEEEEEEEEDEEE>EDEBEEE;EFEEAEBEEBEE;EDEEFEEEDEEDEE
@E100006803L1C002R0020000011/1
TTTTGTTTTTAAAGATCAAATAACATAGTGTTAACTTCCCATTGAAAGCAACTCAATGCCATTTGGTTGACTTTGCCTTTTCCTTGTTCCCTACAGCAACAAAGAAATGATTAAAGAATCTTCCCTGCACATGAGCCATGCTCAATTGGT
+
9&CEE9FD=C7.9@87A8@(E4E89AFCDD)ED8CE@DDE2>EF2D/;CDB:A0E9EED3/3EAAF<E%@;E7DE2B>E3*EDEE6E>AEECEDA)E?DEE,EFB.@>E>?CEE:D*E6@EF4:EE%/5=EE-EEEE/*D06(EE:5EA0
@E100006803L1C002R0020000017/1
TTCTGTTTTTGTTTCTTTTTACCTTTTTTACTTTAAATCAGATTGAGTTCACCAAATAAATAACTAGGAAGACATTTTGAAGTATAATCTTGCTAAGATTACAAATCAACAAAGAAAAAACTTGTTTTTTAACTGATAGTATTCCAAAGG
+
EEEEE;DEEEEDEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEDEEEAEECEEEEEEEEEEEEEEEEEEE?EEEEEEEDE:EEEEEEEEEEEEEEEEEEEEEEEDEEEEEEDEEEEDE=E?E?DEEEEBEDEDDDEBEEEDEDEEEAE
@E100006803L1C002R0020000019/1
CATGCCAGTCATTTCCTGCTCTATCACCAGGAAGCAATCATTCTGTGCTAATTTTTTCTACTGTAGAATAATTTTCCTCATTCTCGAGTGTCACATAAATACAAGCCTATATTATGTATTTTTGGAACTTTTTTCTTAATATGAATAATA
+
EEFEFFFFFFFFFFEFFFFFFFEEFFFFFFEFEFEFEEFEFFFDEEEEDFFEFDFFFFDEFFDEEFEFFEFFEEDEFFFEEFEDFEEEEDEFEFFEEFFEEFEEDEEEAEEFFDFEEEFFFFFDEEFEFEFFDDEDF=EECFEEFFDEBD
@E100006803L1C002R0020000021/1
CTGGTAATCAGCCCTCATCCTGAAGCTATGCAGGGGCCCCCAAGTCATGAGTCACCAGTCATCTCATTAGCATAAAAAAGATACTCATCACTCTGGCAATTCCAAGGGTTTTAAGGAGCAGTGTGCCAGAAACCAAGAAGACAAAATATA
+
EDEEEEEDDEEE?EEEEEEEEECDEFEEEEEEEECEFEEEFEEE?EE1EEEEEF;EEE,EEEEDEFEEEFAEEEEFEEEEEE;EEEEEEFEECEEEEE:DEEEEEFEFEEEDEEEECEEFEEEEFDEEEAEEEEEDDEE5B6EEEE2E9D
@E100006803L1C002R0020000028/1
TAGACCAGCAGTCGAGAATCTGGGTCCCATATCCTGCCTTGACCATGGACTTCTCCACCCATAAGGCAGTCAGGTCACATTTTCACGCTTCATTTTTCCTCTCCCAAGTGGGGCAAATATGTAAAACTAAGTAACATCTGAAGCAGTTTG
+
EE9?EEEE3DE??+DED@EBEEEE>EDEEE@@AEEEAECEE>CCEEE%AEEEEE;@BAE+D<E=E;EDD@C4B-E8EEEDEED2.E?EEEEEDCEEE7CE@E9CE?<C98ECE6EDED<EEEDEEA3DE3EE5EEEE8E8EDD1/%BEEE
format FASTQ • 1.1k views
ADD COMMENT
0
Entering edit mode

Hey Anand, what is the source of these?

ADD REPLY
0
Entering edit mode

Personal genome project results for an individual, returned by a for-profit sequencing company

ADD REPLY
0
Entering edit mode

As long as *_2.fq files end in /2, this looks fine to me.

ADD REPLY
0
Entering edit mode

Thanks for clarifying.

~1 month ago, I had started the GATK variant analyses pipeline with step #1 : converting FASTQ to uBAM format, and an early step failed (can't remember which one off the top of my head, and those notes are hiding somewhere), and

so I had started to wonder if the FASTQ headers may be the reason, and consequently whether I should add back in any missing info to the headers and /or modify their format, such as introducing ':' delimiters between parts of the FASTQ header that look like specific expected header components like machine #, lane # etc.

Anyways, thanks again, and I'll ping the forum again should any failed step looked like it was FASTQ header induced.

ADD REPLY

Login before adding your answer.

Traffic: 1984 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6