Question: The Meaning Of B In Illumina 1.5 Pipeline Data?
3
gravatar for Martin A Hansen
7.4 years ago by
Martin A Hansen3.0k
Denmark
Martin A Hansen3.0k wrote:

Hello all,

I have observed an anomaly in all Illumina 1.5 pipeline data. I use Biopieces www.biopieces.org) for trimming my data - and trim_seq basically removes residues below a given threshold from the ends. When I plot the length distribution after trimming I observe peaks for every 5 residues.

readfastq -n 1000000 -i in.fastq | trimseq | plotlendist -k SEQLEN -x

                                Length Distribution

  12000 ++----------------------------------------------------------------++
        |                                                                  |
        |                                                                  |
  10000 ++                                                               **+
        |                                                                **|
        |                                                                **|
   8000 ++                                                               **+
        |                                                                **|
   6000 ++                                                          **   **+
        |                                                           **   **|
        |                                                           **   **|
   4000 ++                                                    *    ***   **+
        |                                                     *    ***  ***|
        |                                              **   ***  *****  ***|
   2000 ++                                      **    ***  **** ***********+
        |                                **    ***  ***********************|
        |**        *****  ****   ******************************************|
      0 +******************************************************************+
         +      +     +      +      +     +      +     +      +     +      +
         0      5     10     15     20    25     30    35     40    45     50

Forensics indicate that the presence of B in the quality scores is the problem. If I remove all records containing any B then the anomaly disappears:

readfastq -n 1000000 -i in.fastq | grab -p B -k SCORES -i | trimseq | plotlendist -k SEQLEN -x

                                Length Distribution

  4000 ++-----------------------------------------------------------------++
       |                                                                 **|
  3500 ++                                                                **+
       |                                                                 **|
  3000 ++                                                                **+
       |                                                                 **|
  2500 ++                                                                **+
       |                                                                 **|
  2000 ++                                                                **+
       |                                                                 **|
       |                                                                 **|
  1500 ++                                                                **+
       |                                                                 **|
  1000 ++                                                                **+
       |                                                                 **|
   500 ++                                                                **+
       |                                                                ***|
     0 ++------+------+-----+------+------+-----+------+------+-----+******+
        +      +      +     +      +      +     +      +      +     +      +
        0      5      10    15     20     25    30     35     40    45     50

And for all records with B's:

readfastq -n 1000000 -i in.fastq | grab -p B -k SCORES | trimseq | plotlendist -k SEQLEN -x

                                Length Distribution

  7000 ++-----------------------------------------------------------------++
       |                                                            **     |
  6000 ++                                                           **   **+
       |                                                            **   **|
       |                                                            **   **|
  5000 ++                                                           **   **+
       |                                                     **     **   **|
  4000 ++                                                    **    ***   **+
       |                                                     **    ***  ***|
       |                                                     **    ***  ***|
  3000 ++                                              *    ***  *****  ***+
       |                                               *    ***  ***** ****|
  2000 ++                                       **   ***    *** ***********+
       |                                 **    ***   ***   ****************|
       |                                ***    ***  ***********************|
  1000 +**           **     **   ***   ************************************+
       |**         ********************************************************|
     0 +*******************************************************************+
        +      +      +     +      +      +     +      +      +     +      +
        0      5      10    15     20     25    30     35     40    45     50

Now, according to Wikipedia:

http://en.wikipedia.org/wiki/FASTQ_format#Encoding

and the docs I have been able to find (page 32):

http://www.scribd.com/doc/48889532/CASAVA1-7-User-Guide-15011196-A

B or Q2 is used as an indicator that a sequence residue quality is substandard, but don't really have a quality score. trim_seq will regard B as Q2 and discard the residue - and to the best of my understanding - that is OK.

But I don't understand the cyclic behaviour I observer. 10-20% of all records contain a B, so I will loose a lot of data by filtering those reads.

Anyone?

(and why do Illumina keep changing FASTQ encoding?)

Cheers,

Martin

illumina fastq • 2.3k views
ADD COMMENTlink written 7.4 years ago by Martin A Hansen3.0k
1
gravatar for Pablo
7.4 years ago by
Pablo1.9k
Canada
Pablo1.9k wrote:

Yes, the read should be discarded due to low quality. No, I don't know why they keep changing the encoding :-)

I'd recommend using FastQC for quality control. http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/

ADD COMMENTlink written 7.4 years ago by Pablo1.9k

To be safe the correct thing would be to discard reads with B in the quality scores. But I would like to understand the problem - I am baffled about the cyclic anomaly. Fully understanding this may allow me to simple trim these reads instead of discarding them saving up to 20% of my data!

ADD REPLYlink written 7.4 years ago by Martin A Hansen3.0k

Have you tried FastQc? May be it gives you a little bit more info about your problem.

On the other hand, if it's bad data it will only spoil you analysis, so you'd better filter it out. You might just have a bad run from the sequencer (there is no way to recover from that). In an extreme case you might just need to do the experiment again.

ADD REPLYlink written 7.4 years ago by Pablo1.9k

I contacted a friend. He says that B does not necessarily indicates a bad residue read. And that the insertions of B's can be skipped using some flag in GERALD or the bcl converter. However, that does not clarify the meaning of the Bs - or especially the fact that they mostly occur every 5 nucleotides. I will test FastQc to see what they do.

ADD REPLYlink written 7.4 years ago by Martin A Hansen3.0k

Tested FastQc - doesn't tell me anything.

ADD REPLYlink written 7.4 years ago by Martin A Hansen3.0k
0
gravatar for Martin A Hansen
7.4 years ago by
Martin A Hansen3.0k
Denmark
Martin A Hansen3.0k wrote:

Here is the answer from Illumina tech support:

In looking at the graphs, I noticed the cyclic pattern you described is present in all three graphs. It is, however, more subtle in the non-multiplexed sample. The cyclic nature is a result of several factors:

The neighborhood analysis aspect of Illumina Q-scoring results in a 5-cycle cycle nature to the scores. This has been noted in the past. The other aspect is the quality of the sample. In certain cases this cyclic pattern may be more pronounced in some samples. The use of the Trim_Seq option may also result in a heightened presentation of the cycles, especially in cases of more aggressive trimming.

ADD COMMENTlink written 7.4 years ago by Martin A Hansen3.0k

But I am still unsure what to do with the Bs ...

ADD REPLYlink written 7.4 years ago by Martin A Hansen3.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1499 users visited in the last hour