I have PE Illumina reads in 2 separate sets of FASTQ files (*_1.fq.gz
for forward reads & *_2.fq.gz
for reverse reads). Snippet of forward reads from 1 example file is shown at the bottom of this post - first 24 lines of file.
Does it look to you like the FASTQ headers shown below are OK (one example from 6 reads' info shown below: @E100006803L1C002R0020000005/1
), and will be compatible with genome assembly tools, and then with variant discovery tools?
To my eye, these FASTQ headers appear to be in a format that I do not recognize. I expect FASTQ headers to be in one of the following formats, per the examples on the FASTQ wiki page.:
@HWUSI-EAS100R:6:73:941:1973#0/1
or
@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
Can I
- use my FASTA files as such, or
- should I modify these headers, and/or
- add info to existing FASTQ headers that you think / know are missing?
Thanks in advance!
$ head -n 24 NB23DY28_1.fq
@E100006803L1C002R0020000005/1
AGTTGCTGCACTGTTCATAAATAGGCAATGTCAGCAAAATCACTCCAGAAAGGAGGATCTATTACAATCTGCCACTTCTCTTTACCTCAACTTATTGAAGAAAGGATTGATTGGTACTATAACAAATGGGCTTTTAACATATTGCAAGAA
+
FEEEEEEEEEEEEEEEEEEEEEEEEDEEEEEEEEEFEEEEEEEEEEE>EEEEEEDEEEEEEEEFCFEEEFEEEEEEEEEEEEFEEEEEEEEEEEEEEEEEEEEEEEEEEEDEEE>EDEBEEE;EFEEAEBEEBEE;EDEEFEEEDEEDEE
@E100006803L1C002R0020000011/1
TTTTGTTTTTAAAGATCAAATAACATAGTGTTAACTTCCCATTGAAAGCAACTCAATGCCATTTGGTTGACTTTGCCTTTTCCTTGTTCCCTACAGCAACAAAGAAATGATTAAAGAATCTTCCCTGCACATGAGCCATGCTCAATTGGT
+
9&CEE9FD=C7.9@87A8@(E4E89AFCDD)ED8CE@DDE2>EF2D/;CDB:A0E9EED3/3EAAF<E%@;E7DE2B>E3*EDEE6E>AEECEDA)E?DEE,EFB.@>E>?CEE:D*E6@EF4:EE%/5=EE-EEEE/*D06(EE:5EA0
@E100006803L1C002R0020000017/1
TTCTGTTTTTGTTTCTTTTTACCTTTTTTACTTTAAATCAGATTGAGTTCACCAAATAAATAACTAGGAAGACATTTTGAAGTATAATCTTGCTAAGATTACAAATCAACAAAGAAAAAACTTGTTTTTTAACTGATAGTATTCCAAAGG
+
EEEEE;DEEEEDEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEDEEEAEECEEEEEEEEEEEEEEEEEEE?EEEEEEEDE:EEEEEEEEEEEEEEEEEEEEEEEDEEEEEEDEEEEDE=E?E?DEEEEBEDEDDDEBEEEDEDEEEAE
@E100006803L1C002R0020000019/1
CATGCCAGTCATTTCCTGCTCTATCACCAGGAAGCAATCATTCTGTGCTAATTTTTTCTACTGTAGAATAATTTTCCTCATTCTCGAGTGTCACATAAATACAAGCCTATATTATGTATTTTTGGAACTTTTTTCTTAATATGAATAATA
+
EEFEFFFFFFFFFFEFFFFFFFEEFFFFFFEFEFEFEEFEFFFDEEEEDFFEFDFFFFDEFFDEEFEFFEFFEEDEFFFEEFEDFEEEEDEFEFFEEFFEEFEEDEEEAEEFFDFEEEFFFFFDEEFEFEFFDDEDF=EECFEEFFDEBD
@E100006803L1C002R0020000021/1
CTGGTAATCAGCCCTCATCCTGAAGCTATGCAGGGGCCCCCAAGTCATGAGTCACCAGTCATCTCATTAGCATAAAAAAGATACTCATCACTCTGGCAATTCCAAGGGTTTTAAGGAGCAGTGTGCCAGAAACCAAGAAGACAAAATATA
+
EDEEEEEDDEEE?EEEEEEEEECDEFEEEEEEEECEFEEEFEEE?EE1EEEEEF;EEE,EEEEDEFEEEFAEEEEFEEEEEE;EEEEEEFEECEEEEE:DEEEEEFEFEEEDEEEECEEFEEEEFDEEEAEEEEEDDEE5B6EEEE2E9D
@E100006803L1C002R0020000028/1
TAGACCAGCAGTCGAGAATCTGGGTCCCATATCCTGCCTTGACCATGGACTTCTCCACCCATAAGGCAGTCAGGTCACATTTTCACGCTTCATTTTTCCTCTCCCAAGTGGGGCAAATATGTAAAACTAAGTAACATCTGAAGCAGTTTG
+
EE9?EEEE3DE??+DED@EBEEEE>EDEEE@@AEEEAECEE>CCEEE%AEEEEE;@BAE+D<E=E;EDD@C4B-E8EEEDEED2.E?EEEEEDCEEE7CE@E9CE?<C98ECE6EDED<EEEDEEA3DE3EE5EEEE8E8EDD1/%BEEE
Hey Anand, what is the source of these?
Personal genome project results for an individual, returned by a for-profit sequencing company
As long as
*_2.fq
files end in/2
, this looks fine to me.Thanks for clarifying.
~1 month ago, I had started the GATK variant analyses pipeline with step #1 : converting FASTQ to uBAM format, and an early step failed (can't remember which one off the top of my head, and those notes are hiding somewhere), and
so I had started to wonder if the FASTQ headers may be the reason, and consequently whether I should add back in any missing info to the headers and /or modify their format, such as introducing ':' delimiters between parts of the FASTQ header that look like specific expected header components like machine #, lane # etc.
Anyways, thanks again, and I'll ping the forum again should any failed step looked like it was FASTQ header induced.