Question: replace vcf file ID column
0
gravatar for tanya.copley
3.2 years ago by
tanya.copley10
tanya.copley10 wrote:

Hi, I am working with two very large vcf files (over 10 Gb each so copy pasting is too large and I want to include the function as part of a script for future studies) and need to replace the "ID" column variables in one of them in order to have matching IDs for merging. First, I removed all rows containing ## to have a simple matrix with no information liens. When I try replacing the column using awk (first by converting the vcf to a .txt file) (

awk 'FNR==NR{a[NR]=$3;next}{$3=a[FNR]}1' file2.txt file1.txt > output.txt

and then converting back to a vcf), it does not work. When I remove the first 3 columns of the vcf and convert to a .txt and try using a simple

paste file2.txt file1.txt > output.txt

(where file2.txt is the CHROM, POS and new ID columns) and converting back to a vcf, the contents are not put in the same row, but rather one row after the other. So, I tried the following command afterwards to try to merge every other row together, but it is not working either (

awk '{getline b;printf("%s %s\n",$0,b)}' output.txt > final.txt

). Any help would be appreciated.

ADD COMMENTlink modified 3.2 years ago by Pierre Lindenbaum128k • written 3.2 years ago by tanya.copley10
0
gravatar for Pierre Lindenbaum
3.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum128k wrote:

save the header

grep "^#"  input.vcf > output.vcf

(where file2.txt is the CHROM, POS and new ID

create a CHROM_POS/ID and sort

awk '{printf("%s_%s\t%s\n",$1,$2,$3);}'  file2.txt | LC_ALL=C sort -k1,1 > sorted1.txt

create a pseudo key for the VCF and sort

grep -v '^#' input.vcf |  awk '{printf("%s_%s\t%s\n",$1,$2,$0);}'   LC_ALL=C sort -k1,1 > sorted2.txt

join and concatenate (I'm lazy: here you have to play with the join parameters/output to select/remove some column,s keep the orphan, check the 'join' manual )

join -t $'\t' -1 1 -2 1 sorted1.txt sorted2.txt | awk something >> output.vcf
ADD COMMENTlink written 3.2 years ago by Pierre Lindenbaum128k

unfortunately this is still giving me the same problem with the two files being on different lines rather than being together on the same line. Thanks though

ADD REPLYlink written 3.2 years ago by tanya.copley10

uhh ??? .....

ADD REPLYlink written 3.2 years ago by Pierre Lindenbaum128k

Ya, I can't figure out why it's doing that. I ended up doing it in R- took forever, but it worked

ADD REPLYlink written 3.2 years ago by tanya.copley10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1231 users visited in the last hour