Question: Bwa Soft-Clipping Pattern In Read Length Distribution Using "-Q 20" Parameter
1
gravatar for toni
4.8 years ago by
toni2.1k
Lyon
toni2.1k wrote:

Hi all,

I just had to analyze a really bad lane. It is an old Illumina GAIIx lane with reads of 152 cycles. The last 30-40 cycles are of really really bad quality on average so I decided to use the -q 20 switch in bwa aln to trim reads 3'ends based on quality prior to mapping (something I usually would not do).

To have a look at what this trimming parameter left in the BAM file I drew the distribution of the length of soft-clipped part in the reads. To do so, I took the CIGAR string for 20 million alignments that were declared unique by BWA (XT:A:U tag). Here follows what I got :

Soft clipping pattern

So we can see that there is a periodic pattern after the main pic (representing no soft-clipped bases). We can also notice the slightly higher bar at position 117. 152-117=35, indeed by default BWA won't trim the reads to something less than 35bp.

Have you already noticed such a pattern using bwa aln -q and what in the algorithm produces this ?

Because looking at the -q definition in the doc, I can not see any reason why the trimming would have such a periodic pattern.

Thanks guys.

T.

quality alignment bwa • 2.7k views
ADD COMMENTlink modified 4.8 years ago by Martin A Hansen3.0k • written 4.8 years ago by toni2.1k
1

This I have noted before. It comes from the Illumina sequencing: The Meaning Of B In Illumina 1.5 Pipeline Data?

ADD REPLYlink written 4.8 years ago by Martin A Hansen3.0k

Thank you for this information.

ADD REPLYlink written 4.8 years ago by toni2.1k

awesome text plots! ;-) could you please add this as an answer as well?

ADD REPLYlink written 4.8 years ago by Istvan Albert ♦♦ 73k

Is it possible that the qualities in the file have some sort of periodicity?

ADD REPLYlink written 4.8 years ago by Istvan Albert ♦♦ 73k

I do not think so, but now that you raised the point, I will probably do a few checks on this :)

ADD REPLYlink written 4.8 years ago by toni2.1k

By looking at the link given in the comment below, it seems that you were right !

ADD REPLYlink modified 4.8 years ago • written 4.8 years ago by toni2.1k
2
gravatar for Martin A Hansen
4.8 years ago by
Martin A Hansen3.0k
Denmark
Martin A Hansen3.0k wrote:

This has been noted before and Illumina tech support answered this:

The cyclic nature is a result of several factors:

The neighborhood analysis aspect of Illumina Q-scoring results in a 5-cycle cycle nature to the scores. This has been noted in the past. The other aspect is the quality of the sample. In certain cases this cyclic pattern may be more pronounced in some samples. The use of the Trim_Seq option may also result in a heightened presentation of the cycles, especially in cases of more aggressive trimming.

ADD COMMENTlink written 4.8 years ago by Martin A Hansen3.0k

I accept this as the right answer, because I don't think we will get a deeper explanation of this "problem". But that is enough to understand the putative origin of this pattern. Thanks again.

ADD REPLYlink written 4.8 years ago by toni2.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 581 users visited in the last hour