Question: What could be wrong with my FASTQ files? Picard suggests that there is missing header information.
gravatar for kmurph55
8 months ago by
kmurph550 wrote:

Hello, I have two fastq files 3D_1.fastq and 3d_2.fastq. To the best of my knowledge the first file contains forward reads and the second file contains reverse reads. I am able to confirm that the fastq files were generated as paired end reads, 101 base pairs in length, and have Illumina/sanger 1.9+ encoding. The data files that I have are the nucleotide sequences from a single sample and from a highseq machine. For some reason I am getting an error message from Picard that indicates a lack of read group information in the header of my files. I used Bowtie2 to map the reads against a reference genome and used the sorted bam file as the input file in order to validate its information in Picard. These are the first few lines from my first fastq file

 @SN996:194:H5V7HBCXY:1:1108:1872:2028 1:N:0:TCTCGCGC
@SN996:194:H5V7HBCXY:1:1108:1995:2062 1:N:0:TCTCGCGC

These are the first few lines from my second fastq file

@SN996:194:H5V7HBCXY:1:1108:1872:2028 2:N:0:TCTCGCGC
@SN996:194:H5V7HBCXY:1:1108:1995:2062 2:N:0:TCTCGCGC

I know that the fastq files were generated from a single sample, so it would make sense that they do not contain Read Group identification because all reads belong to only a single sample. I would assume that it is fairly common to have sequencing done on a single sample and that if this information was 100% necessary to have in the header that the sequencing company would have formatted the data in such a way that it would not prevent downstream analyses. For what reason would I be getting this error in Picard? Does anyone have a suggestion on how to move past this issue?

sequencing software error • 308 views
ADD COMMENTlink modified 8 months ago by Santosh Anand3.0k • written 8 months ago by kmurph550

is the space before the " @SN996" is a copy+paste problem when you' ve written the current post ? If not, this is your problem.

ADD REPLYlink written 8 months ago by Pierre Lindenbaum102k

Yes this was just an error that I made in my post.

ADD REPLYlink written 8 months ago by kmurph550

Illumina highseq for all your stoner sequencing!

ADD REPLYlink written 8 months ago by WouterDeCoster24k
gravatar for Santosh Anand
8 months ago by
Santosh Anand3.0k
Santosh Anand3.0k wrote:

Picard is a complementary toolset of GATK, and the latter obliges you to add RG information for each read and in header (and so Picard too). The RG info is added by user, according to these guidelines

First decide what your RG (ReadGroup) string would be according to above, and since you have already mapped the reads, it is easier to add RG info using another picard tool AddOrReplaceReadGroups

From next time, You may also enter the RG-info at mapping time. Bowtie can do it by

--rg-id <text>
Set the read group ID to <text>. This causes the SAM @RG header line to be printed, with <text> as the value associated with the ID: tag. It also causes the RG:Z: extra field to be attached to each SAM output record, with value set to <text>.

Remember that RG-info is absolutely necessary for most of the GATK analysis

ADD COMMENTlink written 8 months ago by Santosh Anand3.0k

Thanks!! ... I didnt realize that this information needed to be set by the user.

ADD REPLYlink written 8 months ago by kmurph550
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 952 users visited in the last hour