Question

The Meaning Of B In Illumina 1.5 Pipeline Data?

3

Entering edit mode

14.2 years ago

Martin A Hansen 3.0k

Hello all,

I have observed an anomaly in all Illumina 1.5 pipeline data. I use Biopieces www.biopieces.org) for trimming my data - and trim_seq basically removes residues below a given threshold from the ends. When I plot the length distribution after trimming I observe peaks for every 5 residues.

readfastq -n 1000000 -i in.fastq | trimseq | plotlendist -k SEQLEN -x

                                Length Distribution

  12000 ++----------------------------------------------------------------++
        |                                                                  |
        |                                                                  |
  10000 ++                                                               **+
        |                                                                **|
        |                                                                **|
   8000 ++                                                               **+
        |                                                                **|
   6000 ++                                                          **   **+
        |                                                           **   **|
        |                                                           **   **|
   4000 ++                                                    *    ***   **+
        |                                                     *    ***  ***|
        |                                              **   ***  *****  ***|
   2000 ++                                      **    ***  **** ***********+
        |                                **    ***  ***********************|
        |**        *****  ****   ******************************************|
      0 +******************************************************************+
         +      +     +      +      +     +      +     +      +     +      +
         0      5     10     15     20    25     30    35     40    45     50

Forensics indicate that the presence of B in the quality scores is the problem. If I remove all records containing any B then the anomaly disappears:

readfastq -n 1000000 -i in.fastq | grab -p B -k SCORES -i | trimseq | plotlendist -k SEQLEN -x

                                Length Distribution

  4000 ++-----------------------------------------------------------------++
       |                                                                 **|
  3500 ++                                                                **+
       |                                                                 **|
  3000 ++                                                                **+
       |                                                                 **|
  2500 ++                                                                **+
       |                                                                 **|
  2000 ++                                                                **+
       |                                                                 **|
       |                                                                 **|
  1500 ++                                                                **+
       |                                                                 **|
  1000 ++                                                                **+
       |                                                                 **|
   500 ++                                                                **+
       |                                                                ***|
     0 ++------+------+-----+------+------+-----+------+------+-----+******+
        +      +      +     +      +      +     +      +      +     +      +
        0      5      10    15     20     25    30     35     40    45     50

And for all records with B's:

readfastq -n 1000000 -i in.fastq | grab -p B -k SCORES | trimseq | plotlendist -k SEQLEN -x

                                Length Distribution

  7000 ++-----------------------------------------------------------------++
       |                                                            **     |
  6000 ++                                                           **   **+
       |                                                            **   **|
       |                                                            **   **|
  5000 ++                                                           **   **+
       |                                                     **     **   **|
  4000 ++                                                    **    ***   **+
       |                                                     **    ***  ***|
       |                                                     **    ***  ***|
  3000 ++                                              *    ***  *****  ***+
       |                                               *    ***  ***** ****|
  2000 ++                                       **   ***    *** ***********+
       |                                 **    ***   ***   ****************|
       |                                ***    ***  ***********************|
  1000 +**           **     **   ***   ************************************+
       |**         ********************************************************|
     0 +*******************************************************************+
        +      +      +     +      +      +     +      +      +     +      +
        0      5      10    15     20     25    30     35     40    45     50

Now, according to Wikipedia:

http://en.wikipedia.org/wiki/FASTQ_format#Encoding

and the docs I have been able to find (page 32):

http://www.scribd.com/doc/48889532/CASAVA1-7-User-Guide-15011196-A

B or Q2 is used as an indicator that a sequence residue quality is substandard, but don't really have a quality score. trim_seq will regard B as Q2 and discard the residue - and to the best of my understanding - that is OK.

But I don't understand the cyclic behaviour I observer. 10-20% of all records contain a B, so I will loose a lot of data by filtering those reads.

Anyone?

(and why do Illumina keep changing FASTQ encoding?)

Cheers,

Martin

illumina fastq • 4.6k views

ADD COMMENT • link 14.2 years ago by Martin A Hansen 3.0k

score 1 · Answer 1 · 2011-05-11

1

Entering edit mode

14.2 years ago

Pablo ★ 1.9k

Yes, the read should be discarded due to low quality. No, I don't know why they keep changing the encoding :-)

I'd recommend using FastQC for quality control. http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/

ADD COMMENT • link 14.2 years ago by Pablo ★ 1.9k

0

Entering edit mode

To be safe the correct thing would be to discard reads with B in the quality scores. But I would like to understand the problem - I am baffled about the cyclic anomaly. Fully understanding this may allow me to simple trim these reads instead of discarding them saving up to 20% of my data!

ADD REPLY • link 14.2 years ago by Martin A Hansen 3.0k

0

Entering edit mode

Have you tried FastQc? May be it gives you a little bit more info about your problem.

On the other hand, if it's bad data it will only spoil you analysis, so you'd better filter it out. You might just have a bad run from the sequencer (there is no way to recover from that). In an extreme case you might just need to do the experiment again.

ADD REPLY • link 14.2 years ago by Pablo ★ 1.9k

0

Entering edit mode

I contacted a friend. He says that B does not necessarily indicates a bad residue read. And that the insertions of B's can be skipped using some flag in GERALD or the bcl converter. However, that does not clarify the meaning of the Bs - or especially the fact that they mostly occur every 5 nucleotides. I will test FastQc to see what they do.

ADD REPLY • link 14.2 years ago by Martin A Hansen 3.0k

0

Entering edit mode

Tested FastQc - doesn't tell me anything.

ADD REPLY • link 14.2 years ago by Martin A Hansen 3.0k

score 0 · Answer 2 · 2011-06-01

Here is the answer from Illumina tech support:

In looking at the graphs, I noticed the cyclic pattern you described is present in all three graphs. It is, however, more subtle in the non-multiplexed sample. The cyclic nature is a result of several factors:

The neighborhood analysis aspect of Illumina Q-scoring results in a 5-cycle cycle nature to the scores. This has been noted in the past. The other aspect is the quality of the sample. In certain cases this cyclic pattern may be more pronounced in some samples. The use of the Trim_Seq option may also result in a heightened presentation of the cycles, especially in cases of more aggressive trimming.