Question

Splitting reference genome for alignment

1

Entering edit mode

6.5 years ago

prasundutta87 ▴ 660

Hi,

I am only interested in aligning DNAseq reads to certain genes. If I split my reference genome based on the coordinates of my gene of interest (as present in the GTF/GFF file) and then use BWA for aligning my reads to the resulting 'smaller reference genomes', will it be a good idea?

If yes, is there a threshold to the number of bases upstream and downstream of the gene coordinates that should be considered? And what caveats does this method involving splitting the reference genome can have that I should pay attention to?

My motto for using this method is to reduce alignment time as I am only interested in say 20-30 genes and not all genes.

alignment genome gene • 1.7k views

ADD COMMENT • link updated 6.5 years ago by Pierre Lindenbaum 161k • written 6.5 years ago by prasundutta87 ▴ 660

score 2 · Accepted Answer · 2017-10-24

2

Entering edit mode

6.5 years ago

Pierre Lindenbaum 161k

for aligning my reads to the resulting 'smaller reference genomes', will it be a good idea?

NO, you'll get false positives. It's the same as : Exome Sequencing: Masking The Non-Genic Sequences ? (you're 'masking' a whole chromosome) . Citing Heng Li:

This will lead to wrongly mapped sequences, spurious SNPs/indels calls and all sorts of problems. I cannot think of a single use case when masking [before mapping] may lead to better outcomes."

ADD COMMENT • link 6.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I am only interested in aligning DNAseq reads to certain genes

what you can do is removing the reads after bwa and before sorting

bwa (...) | samtools view -L my.bed (...) | samtools sort (...)