Hi Brian,
Thanks for your response. I think I've solved my problem using a few lines of bash. Previously I made a vcf from the dbSNP file, then used this to GATK-annotate the target vcf. With the current method, I just select a few columns. Lines from the input files, and my output are below the code. There is no automated validation, such as checking alleles, length of the output columns, or omitted variants. If you can see any easy improvements, that would be useful.
## From dbSNP information file:
## 1. Remove header and blank rows.
## 2. Select columns containing only old and new variant IDs (2+5).
## 3. Sort by original variant ID (column 1).
grep -v '^#' ${dbsnp_dat} | grep -v -e '^$' | awk '{OFS="\t";print $2, $5}' | sort -k1 > body_${dbsnp_dat}
## Join the input vcf with the dbsnp information, which matches the variant IDs:
## New variant IDs end up being the last column (232), which isn't terrible but it's not ideal.
join -t $'\t' -1 3 -2 1 body_${vcf}.txt body_${dbsnp_dat} > joined.txt
## Create first three columns of new vcf file (chromosome, position, variant ID):
## Note here I've made a variant ID column which is a conjugation of the old and new IDs, separated by a semi-colon.
awk '{OFS="\t";print $2,$3,$1";"$232}' joined.txt > first3cols.txt
## Remove unecessary columns from body of temp vcf (Takes a few minutes with 1.7m variants):
cut -f4-231 < joined.txt > other_cols.txt
## Combine the first columns with the remainder, and sort by chromosome and BP. Specification of tab seperator with -t 't' dependent on operating system.
paste first3cols.txt other_cols.txt | sort -k2 -k1 -t 't' > final_body.txt
## Now put header rows on the vcf.
Top rows of my output:
chr3R 15205653 hc_abcdi;rs881319980 G GT 64059.90 PASS AC=135;AF=0.311;AN=434;BaseQRankSum=-2.600e-01;ClippingRankSum=-2.800e-02;DP=5447;EFF=UPSTREAM(MODIFIER||3550||632|CG4203|protein_coding|CODING|FBtr0083048||GT),DOWNSTREAM(MODIFIER||2677||503|CG42542|protein_coding|CODING|FBtr03
chrX 19680803 hc_abcdk;rs206315230 G T 9516.41 PASS AC=17;AF=0.038;AN=444;BaseQRankSum=0.115;ClippingRankSum=0.181;DP=5421;EFF=UPSTREAM(MODIFIER||4912||645|et|protein_coding|CODING|FBtr0308571||T),UPSTREAM(MODIFIER||3005||547|Ubqn|protein_coding|CODING|FBtr0074750||T),INTRON(MODIFIER||||
Lines from my in-house vcf:
chr2L 21539 hc_nwpsm T TATA 10409.49 PASS AC=49;AF=0.113;AN=434;BaseQRankSum=0.039;ClippingRankSum=0.00;DP=3241;EFF=UPSTREAM(MODIFIER||2970||1161|l_2_gl|protein_coding|CODING|FBtr0078170||TATA),DOWNSTREAM(MODIFIER||283||842|Ir21a|protein_coding|CODING|FBtr0113008||TATA),DOWNSTREAM(MODI
chr2L 22077 hc_nodcb C CCAT 123.81 PASS AC=1;AF=2.252e-03;AN=444;BaseQRankSum=-5.360e-01;ClippingRankSum=-7.840e-01;DP=6341;EFF=CODON_INSERTION(MODERATE||atc/atcATG|I790IM|842|Ir21a|protein_coding|CODING|FBtr0113008|5|CCAT),UPSTREAM(MODIFIER||3508||1161|l_2_gl|protein_coding|CODING|FBtr00781
chr2L 22081 hc_munaj G GAAATAACA 65.81 PASS AC=1;AF=2.252e-03;AN=444;BaseQRankSum=1.58;ClippingRankSum=-2.840e-01;DP=6430;EFF=FRAME_SHIFT(HIGH||gac/gaTGTTATTTc|D789DVI?|842|Ir21a|protein_coding|CODING|FBtr0113008|5|GAAATAACA),UPSTREAM(MODIFIER||3512||1161|l_2_gl|protein_coding|CODING|FBt
chr2L 25378 hc_fueqn C T 41734.96 PASS AC=121;AF=0.275;AN=440;BaseQRankSum=0.157;ClippingRankSum=-2.440e-01;DP=4427;EFF=UPSTREAM(MODIFIER||223||842|Ir21a|protein_coding|CODING|FBtr0113008||T),DOWNSTREAM(MODIFIER||24||1998|Cda5|protein_coding|CODING|FBtr0078164||T),INTERGENIC(MODIFIE
chr2L 25727 hc_jamie T G 1342.13 PASS AC=10;AF=0.023;AN=442;BaseQRankSum=-1.560e-01;ClippingRankSum=0.383;DP=4819;EFF=UTR_3_PRIME(MODIFIER||794||1998|Cda5|protein_coding|CODING|FBtr0078164|14|G),UPSTREAM(MODIFIER||572||842|Ir21a|protein_coding|CODING|FBtr0113008||G);FS=4.030;InbreedingCoef
The same lines from the file provided by dbSNP:
ss1985661068 hc_nwpsm -/ATA 444 rs881281674 0 2L 21539 NT_033779.5 21539 0 Release_6_plus_ISO1_MT 1
ss1985661069 hc_nodcb -/CAT 444 rs881341942 0 2L 22077 NT_033779.5 22077 0 Release_6_plus_ISO1_MT 1
ss1985661070 hc_munaj -/AAATAACA 444 rs881205429 0 2L 22081 NT_033779.5 22081 0 Release_6_plus_ISO1_MT 1
ss1985661071 hc_fueqn C/T 444 rs881216460 0 2L 25378 NT_033779.5 25378 0 Release_6_plus_ISO1_MT 1
ss1985661072 hc_jamie G/T 444 rs881333198 0 2L 25727 NT_033779.5 25727 0 Release_6_plus_ISO1_MT 1
Hi Ashutosh and Alons, and apologies for the late response,
I used GATK AnnotateVariants the and it seemed to work. I'm using a different vcf now, and it fails, really for these reasons:
I thought about trying to import the two files into R and match the IDs there, but they're too big. My current alternative is to match the IDs using bash. I've used SNPeff for other things, and shall try with this task.
The two more frustrating things are that:
Anyway, thanks again for getting back to me, and I hope to post a solution soon.
I'm not sure whether I can help you, but would you mind posting examples of the problematic data? Specifically, a couple lines from your custom-made VCF, the dbSNP indel lines, etc. Basically, a couple lines from all of the relevant files, corresponding to the same mutations.
VCF has the incredibly irritating requirement of padding indels with an unrelated nearby base, which is why it differs from dbSNP, but it's not overly difficult to programatically convert from VCF->dbSNP (the other way is harder).