Entering edit mode
6.6 years ago
Tania
▴
180
Hi all
If we have an interesting variant, checked the alignments, the coverage, bias strand,,...etc and it is ok. But the variant is in a duplicated gene (checked the duplicated gene data bases) but the variant is not in a duplicated segment? Should we ignore it?
Thanks
What I usually did, was to only consider variants called using uniquely mapped reads (probably corresponding to a "unique segment"), irrespective of the duplication state of the gene. Even paralogs, may have slightly more divergent regions in which you can confidently map, and you shouldn't discard this data, in my opinion.
You'll find that the majority of the human genome exhibits some level of sequence similarity. The lack of adequate genomic maintenance that has allowed this to occur has ironically helped us to evolve and to confer new functionality to genes by copying them (or parts of them) and then allowing them to mutate over millions of years. The human genome is very messy, though.
Aside from just sequence similarity, duplicated genes are problematic in NGS and a majority of protein coding genes have a related pseudogene. As per Fabio, I do not recommend just throwing out data for any particular gene that is duplicated. The 'unique alignment' idea, mentioned by Fabio, is one good way to improve the situation, but the way in which some aligners implement 'uniqueness' is merely by looking at the MAPQ. Bowtie genuinely can only map reads that uniquely align, though - use the
--best -m 1
parametrs passed to bowtie.Aside from unique alignment with Bowtie (v1), setting a MAPQ threshold >40 or 50 (Phred-scaled) is recommended (by me, I guess) (samtools can be used to throw out or mark reads that fall below a particular MAPQ), and then also looking at the values for other things, such as:
Take a look at the
vcfutils.pl
executable that comes bundled with BCFtools, as it can be used to help apply extra filtering on your variants.Finally, we have to remember that certain regions of the genome, including coding exons, are impossible to be faithfully sequenced using 'short' read NGS. This is problematic for clinical testing companies who are interested in genes falling into this category, and requires that a side method is used, such as long read NGS, Sanger sequencing, or something like MLPA (Multiplex ligation-dependent probe amplification).
Excellent. Thanks Fabio and Kevin so much.