Question

How to do genome guided denovo assembly? If not, how to align the denovo assembled bam to a particular reference genome?

0

Entering edit mode

6.9 years ago

kirannbishwa01 ★ 1.6k

I have two different question which may sound similar, but I think are different:

How to do genome guided denovo assembly? How to align the denovo assembled bam to a particular reference genome?

Explanation of problem in detail: I have aligned the fastq reads (100 bp - PE) for several samples from different individuals of different population to a reference genome using BWA mem.

The initial alignment percentage was good (about 80 - 90 %) read alignment. I then did a hard filtering on the aligned reads by removing the reads from the regions:

with high coverage (above 97.5th percentile coverage distribution)
reads where mates are mapped to different chromosome
reads that contain hets sites for all the samples
reads that have low mapQ
and several other stringent filtering

This filtered about 30-40 % of the aligned reads from each sample. Now, I am thinking if I can

take this filtered reads (as bam) and convert them to fastq
and run a denovo assembly of these fastq (genome guided/unguided ??)
or merge/align the denovo assembled bam to a reference

I think this should be able to take care of paralogous alignment, identifying big InDels, chromosomal changes to certain extent. Any one tried this? with which software might this be possible?

Thanks,

genome alignment Assembly allpathsLG paralogous • 3.1k views

ADD COMMENT • link updated 6.8 years ago by h.mon 35k • written 6.9 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

The initial alignment percentage was good (about 80 - 90 %) read alignment. I then did a hard filtering on the aligned reads by removing the reads:

with high coverage (97.5th percentile coverage) reads where mates are mapped to different chromosome reads that contain hets sites for all the samples and several other stringent filtering

Why did you do these operations?

ADD REPLY • link 6.9 years ago by Brian Bushnell 20k

0

Entering edit mode

Hi @Brian:

If the issue about filtering isn't clear, this is what I did.

after doing the initial alignment using BWA mem, I plotted the coverage of the alignment through out the genome for all the chormosome. This is how I plotted the coverage: C: Plotting the coverage and depth (Y-axis) statistics of a bam file along the geno
then I created a bed coverage map for each chromosome for each sample and filtered the aligned reads where the coverage were more then 97.5% ile of the coverage distribution.
I also filtered low mapQ read (below mapQ 30).

These kind of alignment regions generally are contaminated with paralogous alignment, so I had to do it. But, I don't want to throw the data away. I rather want to run a de novo assembly (using all the filtered data), then create a long read out of it, after that I want to blast this read with several different close species to see what's going on.

ADD REPLY • link 6.9 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

These kind of alignment regions generally are contaminated with paralogous alignment, so I had to do it.

I'm not so sure about that. I'd suggest assembling with the raw data, after adapter-trimming, quality-trimming, and error-correction. Though I'd be interested in seeing any evidence you have regarding that point.

ADD REPLY • link 6.9 years ago by Brian Bushnell 20k

0

Entering edit mode

Doesn't the extremely high coverage in a particular region indicate misalignment of the paralogous reads?

ADD REPLY • link 6.9 years ago by kirannbishwa01 ★ 1.6k

1

Entering edit mode

Not necessarily... high coverage might indicate a misassembly, or a structural variation, or simply normal Illumina bias. If you filter out regions over the 97.5% percentile of coverage, you are guaranteed to lose 2.5% of your genome, whether it's correct or not, and whether the coverage is super-high or not. It does not seem to me to be a good strategy. Also, mapq 30 is an odd filter to apply. Variant-callers already take mapq into account, and 30 is pretty high. I would expect filtering on mapq 30 to yield extreme ref bias, which makes variant calling much less accurate.

ADD REPLY • link 6.9 years ago by Brian Bushnell 20k

0

Entering edit mode

So, what is you strategy for dealing with paralogous alignment. What is you choice for best filtering strategy after alignment step?

Also, I am not talking about human genome in here which are mostly homozygous. Since you have more experience with alignment: How should filtering strategy change when working with human like data (primates) vs. other animals vs. plants?

Any suggestions appreciated.

Thanks,

ADD REPLY • link 6.9 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

What organism are you working with?

ADD REPLY • link 6.9 years ago by Brian Bushnell 20k

0

Entering edit mode

Arabidopsis lyrata.

ADD REPLY • link 6.9 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

@ Brian: So, any suggestions ?

ADD REPLY • link 6.8 years ago by kirannbishwa01 ★ 1.6k

1

Entering edit mode

Generally, I would just map and call variants, and use the variant quality score as the primary filter critera. You may want to apply a mapq filter, but not that high. And I'm not convinced at all that a blanket high-coverage filter is a good idea. Or at least, not for 2.5% of the genome.

I find that it's helpful to look at the variants with IGV and Excel to see if there are regions with dense mismatches with low-quality-score variants, or other suspicious features (and yes, super-high coverage is suspicious, particularly when that region contains SNPs at strange allele ratios, like 1:5 var:ref in a diploid). When you spot them you can manually remove everything from that region, or look at the components of the variant quality (such as the strand ratio, average quality score, position of the variants in the read, etc) to see if there is some specific feature you can use to get rid of a lot of false positives automatically. This kind of depends on the data; the universal rules that work well are probably already baked into variant callers as defaults.

ADD REPLY • link 6.8 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks for the Info Brian. Actually, I have also looked the data with IGV, and saw that regions with high coverage have too much clustered SNPs which made me suspect those regions might be ill-aligned. I realize that using the bed file for coverage will remove all the read that touch that region (thus loosing more than just high coverage sites).

I might change my parameters to somewhat 1% and use allele-ratio as another guide.

Most of my sample have the similar patterns of coverage, as shown in this question/tutorial. There is no alignment at/around the centromere. So, I am thinking if the aligner forces the reads to align to another similar regions.

But, that for the suggestions !

ADD REPLY • link 6.8 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

I want to blast this read with several different close species to see what's going on.

This seems a XY problem. What do you really want to do? What is the question you are interest in?

How to align the denovo assembled bam to a particular reference genome?

Bam is most commonly the output of mapping, not assembly. And you usually (never?) align a bam to a fasta.

ADD REPLY • link 6.9 years ago by h.mon 35k

0

Entering edit mode

I am actually thinking about assembling the reads that were filtered. Then collapse that assembled reads to create a single reads (consensus sequence) which could be longer in length; and be able to blast that read across different species.

ADD REPLY • link 6.9 years ago by kirannbishwa01 ★ 1.6k

score 0 · Answer 1 · 2017-06-11

How to do genome guided denovo assembly?

I didn't notice this before, but "genome guided" is one thing, "de novo" is another. (scratch that, there are genome-guided de novo strategies.)

As Brian Bushnell told you - and as your results suggest - your filtering strategies are too agressive. You will most likely get shorter assemblies than using your full data. I think they will contain both fewer mis-assemblies, but also will miss some correct assemblies.

I never worked with the Arabdopsis lyrata genome, but from the paper it seems to be of good quality. Unless you have a good mix of libraries (good coverage per sample, paired ends and some large mate-pair), a de novo assembly will get less contiguous assemblies than the reference genome. If you have only paired and data, your assemblies will be very fragmented, there is no way around it.

So I suggest you map the samples against the A.lyrata reference genome, and use this a starting point to search for variations small (SNPs and indels) and large (larger deletions / duplications, structural variations). There are several software for that.