frequently found false positives from exome seq
2
4
Entering edit mode
9.7 years ago
poisonAlien ★ 3.2k

Hi all,

I think this is a very common issue in Exome sequencing. Whenever we do exome sequencing and variant calling, some of these genes are popped up more often than any others.

MUC (mucins), USP (Ubiquitin specific paptides), CYP genes, HLAs, TTN, and more.

Most of the time Mucins are observed because of paralogous alignment and some are due to their enormous gene length, but how does the community deal with the rest of these ? Do we simply ignore from further analysis ? Is there any list of such messy genes ? How does one decide a such gene is false positive?

I also found some blogs and a biostar question, talking about this issue.

exome-seq targetted-capture • 5.8k views
ADD COMMENT
0
Entering edit mode

I've been wondering about this. Aren't the mucins distinct enough that it's not due to paralogy? If I find a variant in MUC2 and BLAT 100 bases around it back to the genome, I find only 1 hit or 1 obvious best hit.

Is it because of incorrect assembly?

ADD REPLY
2
1
Entering edit mode

I guess it's unsurprising that TTN is the most common one, given that it's the biggest (in terms of exons and CDS length).

ADD REPLY
0
Entering edit mode

Have you found the results table? I'd like to merge a VCF with this FLAGS result, but the article doesn't clearly report it. The supplemental table S4 named "The entire ranked list of FLAGS" is actually linked to some kind of histogram figure.

ADD REPLY
0
Entering edit mode

Yes there is a mess in the suppl tables. Try the other *.txt files.

ADD REPLY
0
Entering edit mode
9.7 years ago
Naga ▴ 450

It is danger to filter out these genes (as stated in the blog). But you can create a blacklist of variants from your in-house exome data, sequenced for other projects/phenotype/families. And remove the variants that appear in both dataset. This will reduce FP, but does not remove all of them.

Then you can use the intolerance score to rank the genes, most of the above genes have lot of non-synonymous variants in the population compared to other genes, so they will go down in the prioritization list. And here is the paper "Genic Intolerance to Functional Variation and the Interpretation of Personal Genomes.

ADD COMMENT

Login before adding your answer.

Traffic: 2015 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6