Question

Aligning high depth data to short target sequence

1

Entering edit mode

8.2 years ago

c.v.oflynn ▴ 100

Hi All,

Just a general query regarding best practice;

What is the BEST way to align WGS data to an individual gene or fragment?

Example: I have several libraries of high coverage shotgun sequence from my favourite organism. I have previously aligned this data to the latest high quality draft genome. I can use the annotations and locations to easily extract sequences, variants or whatever from the genome. If a gene is not annotated I can also easily locate it within the draft genome using homology searches and then extract whatever information I am interested in.

But what if the gene I am interested in is present within the organism and previously sequenced but missing from the draft genome. What is the best way to use my WGS sequences to inspect this gene?

I can think of a couple of strategies neither are fully satisfying..

Map my libraries directly to the single gene but I end up with crazy high coverage, some weird calls and I do not entirely trust this method.

Or do I add the gene as a mock chromosome to the reference sequence and re-align to this new genome, removing high coverage issues and hopefully only recruiting the correct reads to the gene of interest? I would still miss reads that overlap the ends of the mock chromosome.

Any thoughts?

Ciaran

whole-genome resequencing target alignment • 1.5k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.2 years ago by c.v.oflynn ▴ 100

score 3 · Answer 1 · 2016-01-27

3

Entering edit mode

8.2 years ago

Devon Ryan 104k

Ideally you'd add the gene as a new contig to the multifasta file and map against that. If you use local alignment then the end of the segments should still get covered.

ADD COMMENT • link 8.2 years ago by Devon Ryan 104k

0

Entering edit mode

This is the right approach. Aligners try very hard to place every read, and if you map all of the reads to only your segment of interest, you're probably going to get lots of reads incorrectly mapped there that would otherwise map well to other portions of the genome.

ADD REPLY • link 8.2 years ago by Chris Miller 22k