How edit VCF file (specifically #CHROM and ID columns)
2
0
Entering edit mode
4.6 years ago
angelaparody ▴ 80

Hi,

I have a .vcf file and I need to edit #CHROM and ID columns because these are problematic when I try to obtain other formats files (specifically the #CHROM column contains scaffold# and the ID colum contains only " . " for all the SNPs).

I have a tab-delimited.txt file with both columns with which I want to edit those two columns of my vcf file. I tried different things from solutions given in biostars but nothing worked for me so far.

So, in short, how could I edit #CHROM and ID columns of a .vcf file using a tab-delimited.txt file? And please, I need the answer to be quite in detail since I am pretty new in all this.

Cheers,

'Angela

formats • 6.1k views
0
Entering edit mode

please show us a few lines of your input, a few lines of your desired output, what are the solutions you have tried ?

0
Entering edit mode

nothing worked for me so far

Please be as specific as possible, e.g. include error message or explain how the output you got differs from what you aim to obtain.

3
Entering edit mode
4.6 years ago
angelaparody ▴ 80

Hi,

I was writing all the details but it happened that I found out how to make it work. Well, actually, I have been able to set IDs on the fly with

bcftools annotate --set-id +'%CHROM\_%POS\_%REF\_%FIRST_ALT' myfilename.vcf > new.vcf


myfilename.vcf was looking like this (first rows and first three columns):

#CHROM  POS ID
scaffold00002   764729  .
scaffold00002   764955  .
scaffold00002   765132  .
scaffold00002   766694  .
scaffold00002   766775  .
scaffold00002   766966  .
scaffold00002   771319  .
scaffold00002   773905  .
scaffold00002   775644  .
scaffold00002   776411  .
scaffold00007   1178023 .
scaffold00007   1178440 .
scaffold00007   1180956 .


and after using the command, the new.vcf file looks like this:

#CHROM  POS ID
scaffold00002   764729  scaffold00002_764729_T_C
scaffold00002   764955  scaffold00002_764955_C_G
scaffold00002   765132  scaffold00002_765132_G_A
scaffold00002   766694  scaffold00002_766694_C_G
scaffold00002   766775  scaffold00002_766775_G_A
scaffold00002   766966  scaffold00002_766966_G_A
scaffold00002   771319  scaffold00002_771319_C_G
scaffold00002   773905  scaffold00002_773905_A_G
scaffold00002   775644  scaffold00002_775644_T_C
scaffold00002   776411  scaffold00002_776411_A_T
scaffold00007   1178023 scaffold00007_1178023_C_T
scaffold00007   1178440 scaffold00007_1178440_A_G
scaffold00007   1180956 scaffold00007_1180956_C_T


So my problem is half solved. I think I won't need to edit CHROM information, although not totally sure since I read somewhere that it might be problematic that this column contains letters. Anyway, if I need to edict CHROM information I will ask again the question.

Thanks again,

Ángela

0
Entering edit mode

Use the code button (the one with 101010 on it in the formatting bar) to format your code so it is more readable.

0
Entering edit mode

Sorry about that, I am new here. From now one I will use it. Thanks

'Angela

0
Entering edit mode
4.6 years ago
angelaparody ▴ 80

Hi again,

It would actually be very helpful to be able to edit #CHROM column as well. What I am after is to change from this:

#CHROM  POS ID
scaffold00002   764729  scaffold00002_764729_T_C
scaffold00002   764955  scaffold00002_764955_C_G
scaffold00002   765132  scaffold00002_765132_G_A
scaffold00002   766694  scaffold00002_766694_C_G
scaffold00002   766775  scaffold00002_766775_G_A
scaffold00002   766966  scaffold00002_766966_G_A
scaffold00002   771319  scaffold00002_771319_C_G
scaffold00002   773905  scaffold00002_773905_A_G
scaffold00002   775644  scaffold00002_775644_T_C
scaffold00002   776411  scaffold00002_776411_A_T
scaffold00007   1178023 scaffold00007_1178023_C_T
scaffold00007   1178440 scaffold00007_1178440_A_G
scaffold00007   1180956 scaffold00007_1180956_C_T


to this:

#CHROM  POS ID
1   764729  scaffold00002_764729_T_C
1   764955  scaffold00002_764955_C_G
1   765132  scaffold00002_765132_G_A
1   766694  scaffold00002_766694_C_G
1   766775  scaffold00002_766775_G_A
1   766966  scaffold00002_766966_G_A
1   771319  scaffold00002_771319_C_G
1   773905  scaffold00002_773905_A_G
1   775644  scaffold00002_775644_T_C
1   776411  scaffold00002_776411_A_T
2   1178023 scaffold00007_1178023_C_T
2   1178440 scaffold00007_1178440_A_G
2   1180956 scaffold00007_1180956_C_T


Specifically, each scaffold would be changed to a number. I have 126 scaffolds in total, so numbers in #CHROM column would be from 1 to 126. I don't think it would work just replacing the column (but not sure) because of the format of the file (.vcf) which has a header), but again, not sure...(I am pretty new in all this).

I tried this:

## Remove header from txt file
tail -n+2 sub.txt > newpos.txt

grep -P '^#' test.vcf > new.vcf

grep -v -P '^#' test.vcf \
| cut -f3- \
| paste newpos.txt - >> new.vcf


Which was posted: Replace fields CHROM and POS in a vcf file but it didn't work for me: the new.vcf file of the step ##Get header from vcf is empty. I suspect there is something wrong in the code, but as I don't understand completely this language I don't know what could it be. This is what my terminal shows:

MacBook-Pro-de-Angela:EditingCHROMfieldVCF angelaparodymerino$grep -P '^#' mac3_minDP3_maxmeanDP289_maf005_minQ40_minGQ30_hwe005_265ind_IDs2.vcf > new.vcf usage: grep [-abcDEFGHhIiJLlmnOoPqRSsUVvwxZ] [-A num] [-B num] [-C[num]] [-e pattern] [-f file] [--binary-files=value] [--color=when] [--context[=num]] [--directories=action] [--label] [--line-buffered] [--null] [pattern] [file ...]  In short, does anyone know how can I edit #CHROM column of a .vcf file from a .txt file? Thanks in advance, Regards, 'Angela Parody-Merino ADD COMMENT 1 Entering edit mode output: $sed -n 's/^[a-z]*0*//p ' test

#CHROM  POS ID
2   764729  scaffold00002_764729_T_C
2   764955  scaffold00002_764955_C_G
2   765132  scaffold00002_765132_G_A
2   766694  scaffold00002_766694_C_G
2   766775  scaffold00002_766775_G_A
2   766966  scaffold00002_766966_G_A
2   771319  scaffold00002_771319_C_G
2   773905  scaffold00002_773905_A_G
2   775644  scaffold00002_775644_T_C
2   776411  scaffold00002_776411_A_T
7   1178023 scaffold00007_1178023_C_T
7   1178440 scaffold00007_1178440_A_G
7   1180956 scaffold00007_1180956_C_T


Input:

\$ cat test
#CHROM  POS ID
scaffold00002   764729  scaffold00002_764729_T_C
scaffold00002   764955  scaffold00002_764955_C_G
scaffold00002   765132  scaffold00002_765132_G_A
scaffold00002   766694  scaffold00002_766694_C_G
scaffold00002   766775  scaffold00002_766775_G_A
scaffold00002   766966  scaffold00002_766966_G_A
scaffold00002   771319  scaffold00002_771319_C_G
scaffold00002   773905  scaffold00002_773905_A_G
scaffold00002   775644  scaffold00002_775644_T_C
scaffold00002   776411  scaffold00002_776411_A_T
scaffold00007   1178023 scaffold00007_1178023_C_T
scaffold00007   1178440 scaffold00007_1178440_A_G
scaffold00007   1180956 scaffold00007_1180956_C_T

0
Entering edit mode

Thanks so much! it worked! :)

Regards,

'Angela Parody-Merino