VEP outputs more rs ids than input, why?
My input to the tool is 109,153 rs ids, but I seems to be getting an output of 142,325 rs ids. Why is this so?

Which are the additional rs ids in the output file?

SNP ensembl vep • 2.6k views
./vep --cache --force_overwrite -i p0_01.csv --vcf --fork 4  -o case_vep_output.vcf

Any thoughts anybody?

You are not getting responses because your question is unspecific. How do you define "more rs ids"? More entries as in wc -l or more total occurrences like in grep -c 'rs'. Please elaborate.

Output file contains more entries than the input file.

Can you show some of the new entries?

uday@uday-desktop:~/ensembl-vep$wc -l temp1.csv 109153 temp1.csv uday@uday-desktop:~/ensembl-vep$ wc -l temp2.csv
142325 temp2.csv
uday@uday-desktop:~/ensembl-vep$grep -c 'rs' temp1.csv 108774 uday@uday-desktop:~/ensembl-vep$ grep -c 'rs' temp2.csv
142325


temp1.csv is the list of input rs ids and temp2.csv is the list of resulting VEP's rs ids.

Are you sure that every single line in your files have unique rsIDs? That the same rsID isn't being listed multiple times for having different possible effects in different transcripts?

Thanks for your response. I don't see a column with that name in the output file. However, I've given rs ids as input to the VEP and obtained a result of more rs ids than expected. 1. My question is, is the tool designed to do so as you've mentioned? 2. To give out the information of all merged SNPs at a particular loci? 3. And also, how do I deal with SNPs in the form of chr:location? Since the output say : No variant found for such formats?

Where are you seeing the rsIDs? In what column of the output? Can you show us a few lines of your output please? Could you tell us some rsIDs that appear in your output that are not in your input

The aim of the VEP is to tell you the effects of variants on genes. You can input data in a variety of formats including VCF, lists of variant IDs and HGVS. You do not need to know the rsID of the variants you input and the variants can be novel. For every variant, it will tell you which genes it hits and the effects on those genes. If the variant is already known in the database, it will also tell you the identifier (including rsID, COSMIC ID and many more) and give you relevant information about that variant, such as frequency and clinical significance.

You can input variants without an rsID using only the location, if you use one of the accepted formats. You cannot use a mixed format input file. If you have some variants with just an rsID and others with just a location, you will need to do two queries. If most of your data is a list of rsIDs, the VEP is looking for all the inputs to be variant identifiers and will give a "no variant found" message for anything that is not.

##fileformat=VCFv4.1
##VEP="v94" time="2018-11-19 03:50:42" cache="/home/uday/.vep/homo_sapiens/94_GRCh38" db="homo_sapiens_core_94_38@ensembldb.ensembl.org" ensembl=94.5c08d90 ensembl-io=94.8d53275 ensembl-variation=94.066b102 ensembl-funcgen=94.08b0c13 1000genomes="phase3" COSMIC="86" ClinVar="201807" ESP="V2-SSA137" HGMD-PUBLIC="20174" assembly="GRCh38.p12" dbSNP="151" gencode="GENCODE 29" genebuild="2014-07" gnomAD="170228" polyphen="2.2.2" regbuild="16" sift="sift5.2.2"
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
1   1068801 rs55746161  C   A,G .   .   CSQ=A|downstream_gene_variant|MODIFIER|AL390719.1|ENSG00000217801|Transcript|ENST00000394517|processed_transcript|||||||||||2796|1||Clone_based_ensembl_gene|,G|downstream_gene_variant|MODIFIER|AL390719.1|ENSG00000217801|Transcript|ENST00000394517|processed_transcript|||||||||||2796|1||Clone_based_ensembl_gene|,A|intron_variant&non_coding_transcript_variant|MODIFIER|AL390719.1|ENSG00000217801|Transcript|ENST00000412397|transcribed_unprocessed_pseudogene||9/9||||||||||1||Clone_based_ensembl_gene|,G|intron_variant&non_coding_transcript_variant|MODIFIER|AL390719.1|ENSG00000217801|Transcript|ENST00000412397|transcribed_unprocessed_pseudogene||9/9||||||||||1||Clone_based_ensembl_gene|,A|downstream_gene_variant|MODIFIER|AL390719.1|ENSG00000217801|Transcript|ENST00000427998|processed_transcript|||||||||||2345|1||Clone_based_ensembl_gene|,G|downstream_gene_variant|MODIFIER|AL390719.1|ENSG00000217801|Transcript|ENST00000427998|processed_transcript|||||||||||2345|1||Clone_based_ensembl_gene|,A|downstream_gene_variant|MODIFIER|AL390719.1|ENSG00000217801|Transcript|ENST00000433695|processed_transcript|||||||||||2527|1||Clone_based_ensembl_gene|,G|downstream_gene_variant|MODIFIER|AL390719.1|ENSG00000217801|Transcript|ENST00000433695|processed_transcript|||||||||||2527|1||Clone_based_ensembl_gene|,A|downstream_gene_variant|MODIFIER|AL390719.1|ENSG00000217801|Transcript|ENST00000451054|processed_transcript|||||||||||2360|1||Clone_based_ensembl_gene|,G|downstream_gene_variant|MODIFIER|AL390719.1|ENSG00000217801|Transcript|ENST00000451054|processed_transcript|||||||||||2360|1||Clone_based_ensembl_gene|,A|downstream_gene_variant|MODIFIER|RNF223|ENSG00000237330|Transcript|ENST00000453464|protein_coding|||||||||||2165|-1||HGNC|HGNC:40020,G|downstream_gene_variant|MODIFIER|RNF223|ENSG00000237330|Transcript|ENST00000453464|protein_coding|||||||||||2165|-1||HGNC|HGNC:40020,A|downstream_gene_variant|MODIFIER|AL390719.1|ENSG00000217801|Transcript|ENST00000456409|processed_transcript|||||||||||2345|1||Clone_based_ensembl_gene|,G|downstream_gene_variant|MODIFIER|AL390719.1|ENSG00000217801|Transcript|ENST00000456409|processed_transcript|||||||||||2345|1||Clone_based_ensembl_gene|
1   1130420 rs11580120  C   T   .   .   CSQ=T|intergenic_variant|MODIFIER||||||||||||||||||||
1   1130717 rs61766345  G   A   .   .   CSQ=A|intergenic_variant|MODIFIER||||||||||||||||||||
1   1132196 rs11589263  G   A   .   .   CSQ=A|upstream_gene_variant|MODIFIER|LINC01342|ENSG00000223823|Transcript|ENST00000416774|lincRNA|||||||||||4821|1||HGNC|HGNC:50551
1   1132482 rs9442374   T   C,G .   .   CSQ=C|upstream_gene_variant|MODIFIER|LINC01342|ENSG00000223823|Transcript|ENST00000416774|lincRNA|||||||||||4535|1||HGNC|HGNC:50551,G|upstream_gene_variant|MODIFIER|LINC01342|ENSG00000223823|Transcript|ENST00000416774|lincRNA|||||||||||4535|1||HGNC|HGNC:50551
1   1133503 rs61766346  G   A   .   .   CSQ=A|upstream_gene_variant|MODIFIER|LINC01342|ENSG00000223823|Transcript|ENST00000416774|lincRNA|||||||||||3514|1||HGNC|HGNC:50551

Regarding rs ids that appear in output but not in input, that is not at all the case, I was wrong. There were multiple occurrences of a few rs ids which is why the output was larger than the input. Like you said, it must have given me the output of multiple gene hits as well. Regardless of the gene hits, the position of the rs ids having multiple gene hits is going to be the same, isn't it? Since I'm only interested in obtaining the position of variants in GRCh38 assembly. Do you concur?

Yes, the position is the position.

Ya sorry about that. I didn't notice it, was kinda in a hurry.

Is this is in the colocated variants column? Some loci have more than one rsID assigned to them, usually when multiple rsIDs have been merged. The colocated variants column will show you every variant known at that locus, not just the one you used as input.