Hi,
I'm using example data shown in https://asia.ensembl.org/info/docs/tools/vep/vep_formats.html#default to test VEP. I don't know if it's important but I also added some fields and ClinVar as custom source. I get multiple CSQ data. Mostly identical but sometimes even gene symbol is different. Why is this happening? How can I decide which data is "actual" data?
Thanks!
Input Data
1 881907 881906 -/C +
5 140532 140532 T/C +
12 1017956 1017956 T/A +
18 Lines of INFO CSQ Data For 12:1017956-1017956
12_1017956_T/A|RAD52|12:1017956|A|T|||MODIFIER|-1|downstream_gene_variant|||||||||||,
12_1017956_T/A|WNK1|12:1017956|A|T|*/K|Tag/Aag|HIGH|1|stop_lost|||||||||||,
12_1017956_T/A|WNK1|12:1017956|A|T|*/K|Tag/Aag|HIGH|1|stop_lost|||||||||||,
12_1017956_T/A|RAD52|12:1017956|A|T|||MODIFIER|-1|downstream_gene_variant|||||||||||,
12_1017956_T/A|RAD52|12:1017956|A|T|||MODIFIER|-1|downstream_gene_variant|||||||||||,
12_1017956_T/A|RAD52|12:1017956|A|T|||MODIFIER|-1|downstream_gene_variant|||||||||||,
12_1017956_T/A|RAD52|12:1017956|A|T|||MODIFIER|-1|downstream_gene_variant|||||||||||,
12_1017956_T/A|RAD52|12:1017956|A|T|||MODIFIER|-1|downstream_gene_variant|||||||||||,
12_1017956_T/A|WNK1|12:1017956|A|T|*/K|Tag/Aag|HIGH|1|stop_lost|||||||||||,
12_1017956_T/A|RAD52|12:1017956|A|T|||MODIFIER|-1|downstream_gene_variant|||||||||||,
12_1017956_T/A|WNK1|12:1017956|A|T|*/K|Tag/Aag|HIGH|1|stop_lost|||||||||||,
12_1017956_T/A|WNK1|12:1017956|A|T|||MODIFIER|1|downstream_gene_variant|||||||||||,
12_1017956_T/A|WNK1|12:1017956|A|T|*/K|Tag/Aag|HIGH|1|stop_lost|||||||||||,
12_1017956_T/A|RAD52|12:1017956|A|T|||MODIFIER|-1|downstream_gene_variant|||||||||||,
12_1017956_T/A|WNK1|12:1017956|A|T|||MODIFIER|1|non_coding_transcript_exon_variant|||||||||||,
12_1017956_T/A|RAD52|12:1017956|A|T|||MODIFIER|-1|downstream_gene_variant|||||||||||,
12_1017956_T/A|WNK1|12:1017956|A|T|||MODIFIER|1|downstream_gene_variant|||||||||||,
12_1017956_T/A|WNK1|12:1017956|A|T|||MODIFIER|1|downstream_gene_variant|||||||||||
Thank you. I think It's because of transcripts as well. I didn't put it in the fields, so that's why they're not shown in CSQ. Is there any way to disable transcripts? I only want rsid, gnomad, clinvar, amino acid change. And it would be great to have only one CSQ for each position.
Hi Magnolia,
Pierre is correct in saying that the multiple rows in your output corresponds to multiple transcripts. A single variant can have multiple predicted consequences (on the multiple transcripts of a single gene or even multiple transcripts of 2 or more genes).
You can use the different filtering options when running VEP, such as --pick and --per_gene, to restrict your results: http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html#filt
You could also use the filter_vep script to filter your output with multiple rows: http://www.ensembl.org/info/docs/tools/vep/script/vep_filter.html
Using pick options really worked. Thank you!
warning? if i understand it correctly, it seems like
--per_gene
would throw out all variants except for the one at the position with the highest consequence. whereas--pick
would keep one per position.