Question

A Better Way To Clean Up Our Data If Not Mapq

0

Entering edit mode

10.6 years ago

diltsjeri ▴ 470

Hi,

Yesterday I Don'T Good Base Quality Scores = Good Mapping Scores? (Maybe A Reference Error?) about an issue we had with an amplicon where the majority of the reads were mapping with a mapq value of 10 or less even though the base quality was 97.5%. It was made clear that our reference had some repeating regions and was potentially causing this phenomenon. We were using mapq to clean up our data, but if it's causing this one amplicon's reads to fall short of a mapq15 threshold, what would be a more preferred way to clean up the data? As supplemental information we are using Ion Torrent technology.

Thanks.

Edit: or is next gen data always so fickle that our pipeline should be more customize to accommodate that particular amplicon?

samtools mapping ion-torrent • 2.4k views

ADD COMMENT • link updated 10.6 years ago by swbarnes2 14k • written 10.6 years ago by diltsjeri ▴ 470

0

Entering edit mode

What do you mean "clean up" the data? What are you trying to do? What is your desired analysis goal? It all depends on the downstream things you want to do.

ADD REPLY • link 10.6 years ago by matted 7.8k

0

Entering edit mode

Ultimately we would like to genotype samples. It seems this one amplicon needs to be re-evaluated.

ADD REPLY • link 10.6 years ago by diltsjeri ▴ 470

score 2 · Answer 1 · 2013-09-05

2

Entering edit mode

10.6 years ago

Devon Ryan 104k

Pretty much the only thing you can do is use longer reads or change your insert size so that you can more reliably map to the amplicon in question. Using short reads with an amplicon that is highly repetitive is simply not going to work that well.

ADD COMMENT • link 10.6 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks. I will try using the longer reads first before we have to resort to the latter.

ADD REPLY • link 10.6 years ago by diltsjeri ▴ 470

score 2 · Answer 2 · 2013-09-05

Filtering by mapq isn't wrong. Your experiment, as you carried it out, will not be able to tell you much about repetitive regions. That's how the technique is. Filtering by mapq will strip those regions of data, so you will only have data about regions where your experiment is informative. As long as your analysis doesn't try to draw conclusions from missing data, you should be fine.

score 1 · Answer 3 · 2013-09-05

NGS is not conceived to describe or even detect repetitive regions. it's a methodological limitation, and this is why we try to study in depth the region to be sequenced before choosing a particular sequencing method. the main reason is the short length of the reads, which is usually not enough to allow consistent mapping. if the read has a piece outside the repetitive region you can luckily detect the region's boundaries, but unless the read is big enough to touch both boundaries (usually not) you will end up with reads mapping to multiple locations (hence low mapq) due to the repetitiveness of the sequence. so longer reads would do, but they should be long enough to cover all or almost all the regions from both sides.

in case you want to repeat your sequencing experiment again though NGS, paired-end reads could be more helpful since the distance between each pair is previously known and can be introduced in the mapping+pairing algorithm, although in our experience the mapq of those mapped reads when ending up in a repetitive region. the only thing you can do is to study all those reads that map multiple times in between those region's boundaries, and maybe infer the number of repeats by the increase of coverage in relation to the region's surroundings, or any other statistical adventure. this is similar to the current approaches for CNV detection in target resequencing, which work quite acceptable, but it'll never be a direct measurement.