Exome Sequencing: Masking The Non-Genic Sequences ?
3
10
Entering edit mode
10.3 years ago

Hi all,

I'm about to align a set of short reads on the genome after an exome capture.

I wonder if it would be worth to replace all the non-genic genomic sequences with 'N'.

Wouldn't it speed up the alignment with BWA ? What would be the other possible consequences ? Has it ever been done before ?

Many thanks

Pierre

next-gen sequencing bwa exome fasta • 6.6k views
ADD COMMENT
1
Entering edit mode

from SeqAnswer: "The main consequence of removing information is increasing the risk of false alignments -- i.e. BWA may decide to make an imperfect & incorrect alignment which is the best it can make with your masked database, but with an unmasked one it would find a better (and correct) alignment. So, you may have false variant calls as a result."

ADD REPLY
4
Entering edit mode
10.3 years ago

In a word, no. It's not necessary for any of the BWT aligners. In fact it would cause you to miss any unannotated exons, or possibly entire novel genes. It would also hide other biological events, such as genomic sequence contamination in the preparation process.

If you want to speed up the alignment, split the reads into several chunks and map them in parallel. Afterwards use a merging program such as samtools or Picard to create a single BAM file. We usually aim for ~ 1Gbp per process for a vertebrate genome.

ADD COMMENT
0
Entering edit mode

"it would cause you to miss any unannotated exons": tell me if I'm wrong but I'm using "Exome capture" so my reads should only be mapped on the set of genes/probes defined by the manufacturer (Exome Nimblegen).

ADD REPLY
0
Entering edit mode

Yes, they should - if the annotation used to the design the probes is identical to that used for analysis. Maybe I'm just paranoid about these things!

ADD REPLY
0
Entering edit mode

In theory yes, but in our (limited) experience the targeted capture laboratory methods are not perfect, and you will get some percentage of your reads correctly mapping outside of the target region because that's what was in the lane. You will probably want to quantify this.

ADD REPLY
9
Entering edit mode
10.3 years ago
lh3 32k

I gave the answer to another question: "No, do not align to masked genome for any purpose." "Masking has never been perfect and probably will never be perfect. This will lead to wrongly mapped sequences, spurious SNPs/indels calls and all sorts of problems. I cannot think of a single use case when masking [before mapping] may lead to better outcomes."

In case of exomes, when you map the reads you will find a lot of them coming from unique non-targeted regions.

ADD COMMENT
3
Entering edit mode

Picard's CalculateHsMetrics produces graphs and stats to help assess how well your capture worked (http://picard.sourceforge.net/picard-metric-definitions.shtml#HsMetrics)

ADD REPLY
0
Entering edit mode

I support this, I don't see an advantage either, and the minimal speed-up isn't worth the limition imho.

ADD REPLY
3
Entering edit mode
10.3 years ago
Jan Oosting ▴ 920

How would you identify the non-genic sequences before alignment ?

You could use a reference consisting of the exonic regions or the sequences of the capture probes in stead of the whole genome. That would speed up the alignment, but would make it harder to identify events like genomic re-arrangements. Also it is good to know whether the untargeted sequeences are part of your target species, or have another origin

ADD COMMENT
0
Entering edit mode

I would use ucsc knownGene +/- 10Kb

ADD REPLY
0
Entering edit mode

The point is: Before alignment you don't know anything about the sequences except things like nucleotide distribution. With the choice of your reference for alignment you limit the sequences you're interested in

ADD REPLY

Login before adding your answer.

Traffic: 2175 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6