Question: VCF files: Change Chromosome Notation
gravatar for Quak
5.0 years ago by
United States
Quak300 wrote:

There are two VCF files that I like to merge them, using GATK or VCFtools. The problem is, they have different chromosomal notation, one has Chr, the other does not. This question could be similar to this one

Is there any quick awk/sed commands that you suggest ?! Also I appreciate if you make comment, which of these two (GATK/VCFtools) is more reliable for this task.

next-gen sequence • 16k views
ADD COMMENTlink modified 8 months ago by jerviedog20 • written 5.0 years ago by Quak300

I am very new at this and ran into a similar but slightly more complicated problem today with the Cryptococcus genome. I think I solved it thanks to help and links posted here (and didn't find a solution elsewhere) so thought I should post it here in case someone comes along with a similar problem. The reference genome I use does not use either numerical (1, 2, 3) or chr (chr1, chr2, chr3) notation, it has wacky chromosome names (CP003827, CP003822 etc.). So to replace my chromosome names in a vcf file to make them numerical I used a series of grep commands in awk:

awk '{gsub(/CP003827/, "8"); gsub(/CP003822/, "3"); gsub(/CP003824/, "5");gsub(/CP003834/, "Mt"); gsub(/CP003833/, "14"); gsub(/CP003829/, "10"); gsub(/CP003826/, "7"); gsub(/CP003820/, "1"); gsub(/CP003828/, "9"); gsub(/CP003825/, "6"); gsub(/CP003821/, "2"); gsub(/CP003830/, "11"); gsub(/CP003832/, "13"); gsub(/CP003831/, "12"); gsub(/CP003823/, "4"); print;}' original.vcf > original_newchr.vcf

I had no knowledge of awk before stumbling onto this post so there might be a more elegant way to do this, but this seems to work, which is good enough for me!

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by acgerstein0
gravatar for Vivek
5.0 years ago by
Vivek2.2k wrote:

awk '{gsub(/^chr/,""); print}' your.vcf > no_chr.vcf

Should get rid of the chr

awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' no_chr.vcf > with_chr.vcf

Will add the chr to the VCF without chr. 

You can use either command depending on how the chromosomes are named in your reference. 

Regarding preference of tools, if you plan to do downstream processing with GATK, I'd suggest sticking with GATK combinevariants for consistency.

ADD COMMENTlink written 5.0 years ago by Vivek2.2k

Thanks for this Vivek :) Works great!

It does still leave the ID lines in the VCF unmodified though, so I made a little tweak to catch those too:

awk '{ if($0 !~ /^#/) print "chr"$0; else if(match($0,/(##contig=<ID=)(.*)/,m)) print m[1]"chr"m[2]; else print $0 }' no_chr.vcf > with_chr.vcf

I'm also curious if this is ever the right thing to do, since VCFs from genomes with "chr" (notably the Mouse Genomes Project VCFs) may not be the same as the genome without the chr (notably the UCSC genomes and BAMs mapped to them) even if they both say they are "mm10" or something similar. All i'm saying to future reader is to *be careful* :)

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by John12k

Yeah, that's bit more sophisticated but I don't think most tools including GATK mind what's in the VCF header as long as the notation in the VCF entries conforms with the reference provided.

ADD REPLYlink written 3.2 years ago by Vivek2.2k

You're probably right - I was having a different issue with GATK earlier and I thought this might be the reason, but it turned out to be unrelated.

Ho hum. Maybe one day the gods of bioinformatics will decided to drop the "chr"s in all datasets and we can live in peace :P

ADD REPLYlink written 3.2 years ago by John12k

Or the goddesses will decide to keep "chr"s in all datasets and we can live in peace too (or this could be the cause of another war again?)

ADD REPLYlink written 8 months ago by bounlu170

Could you explain a little bit about this part of the command? {if($0 !~ /^#/) Because I'm reading it as: if $0-- if the line contains !~ -- something NOT EQUAL to /^#/ -- a number at the beginning of the line

For some reason, I thought it should read the same except without the ! As in awk '{if($0 ~ /^#/) print "chr"$0; else print $0}' no_chr.vcf > with_chr.vcf

So am I misunderstanding such that ! does not mean "not equal to" and # does not refer to "any number"?

ADD REPLYlink written 2.5 years ago by ms2380

The !~ operator is not the same as !=. The first specifies a "not-match" operation on a regular expression pattern (in this case, lines that start with a # character). The second operator is a "not-equals" operation on a pair of scalar values, like a pair of numbers or strings. Refer to the documentation for more details:

ADD REPLYlink written 2.5 years ago by Alex Reynolds28k

Ok, thanks, Alex. I see what it is doing now. It also looks like the formatting in vcf files are not necessarily obvious. As you can see from my subset below, it looks like many lines do not start with a # character, yet, they are not changed to chr1, which is good. Therefore, though they appear to be the start of a line, they are really a continuation of the preceding line. (for example, lines 3 and 4 below.)

1 3006834 . GA G 77 PASS INDEL;DP=419;DP4=5,9,200,205;CSQ=-||||intergenic_variant|||||||| GT:GQ:DP:MQ0F:GP:PL:AN:MQ:DV:DP4:SP:SGB:PV4:FI 1/1:49:38:0:81,49,0:62,43,0:2:41:31:0,7,18,13:21:-0.69311:.:1 1/1:101:30:0:133,101,0:104,90,0:2:60:30:0,0,16,14:0:-0.693097:.:1 1/1:24:11:0:66,24,0:44,16,0:2:45:9:0,2,9,0:17:-0.662043:.:1 1/1:57:19:0:110,57,0:78,44,0:2:60:18:1,0,10,8:0:-0.691153:.:1 ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. 1/1:69:20:0:125,69,0:99,60,0:2:60:20:0,0,9,11:0:-0.692067:.:1 ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:.:. 1/1:123:40:0:124,133,0:92,120,0:2:58:40:0,0,12,28:0:-0.693145:.:1 ./.:.:.:.:.:.:.:.:.:.:.:.:.:. 1/1:44:15:0:85,44,0:56,33,0:2:59

ADD REPLYlink written 2.5 years ago by ms2380
gravatar for vlaufer
5.0 years ago by
United States
vlaufer260 wrote:

Give a statistical geneticist an awk line, feed him for a day, teach a statistical geneticist how to awk, feed him for a lifetime... check out this page for find and replace using awk:

ADD COMMENTlink written 5.0 years ago by vlaufer260

+1. Such a powerful tool. Basic knowledge of awk should be in every bioinformaticist's back pocket.

ADD REPLYlink written 5.0 years ago by Alex Reynolds28k
gravatar for jerviedog
8 months ago by
United States
jerviedog20 wrote:

Another way of doing this by using bcftools v1.9:

echo "1 chr1" >> chr_name_conv.txt echo "2 chr2" >> chr_name_conv.txt bcftools annotate --rename-chrs chr_name_conv.txt original.vcf.gz | bgzip > rename.vcf.gz

ADD COMMENTlink written 8 months ago by jerviedog20
gravatar for Alex
12 months ago by
Alex0 wrote:

Next to the above answers, this resource might be helpful: mappings between common notations (UCSC, Ensembl, Gencode)

You can use this together with

awk 'NR==FNR{map[$1]=$2;next} { for (i=1;i<=NF;i++) $i=($i in map ? map[$i] : $i) } 1' mapping_file.txt input.vcf > output.vcf

to batch rename between common chromosomal notations.

ADD COMMENTlink written 12 months ago by Alex0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1903 users visited in the last hour