Question: VCF files: Change Chromosome Notation
10
gravatar for Quak
5.4 years ago by
Quak300
United States
Quak300 wrote:

There are two VCF files that I like to merge them, using GATK or VCFtools. The problem is, they have different chromosomal notation, one has Chr, the other does not. This question could be similar to this one

Is there any quick awk/sed commands that you suggest ?! Also I appreciate if you make comment, which of these two (GATK/VCFtools) is more reliable for this task.

next-gen sequence • 19k views
ADD COMMENTlink modified 16 days ago by carmen.bugarin.diz0 • written 5.4 years ago by Quak300

I am very new at this and ran into a similar but slightly more complicated problem today with the Cryptococcus genome. I think I solved it thanks to help and links posted here (and didn't find a solution elsewhere) so thought I should post it here in case someone comes along with a similar problem. The reference genome I use does not use either numerical (1, 2, 3) or chr (chr1, chr2, chr3) notation, it has wacky chromosome names (CP003827, CP003822 etc.). So to replace my chromosome names in a vcf file to make them numerical I used a series of grep commands in awk:

awk '{gsub(/CP003827/, "8"); gsub(/CP003822/, "3"); gsub(/CP003824/, "5");
      gsub(/CP003834/, "Mt"); gsub(/CP003833/, "14"); gsub(/CP003829/, "10"); 
      gsub(/CP003826/, "7"); gsub(/CP003820/, "1"); gsub(/CP003828/, "9");
      gsub(/CP003825/, "6"); gsub(/CP003821/, "2"); gsub(/CP003830/, "11");
      gsub(/CP003832/, "13"); gsub(/CP003831/, "12"); gsub(/CP003823/, "4");
      print;}' original.vcf > original_newchr.vcf

I had no knowledge of awk before stumbling onto this post so there might be a more elegant way to do this, but this seems to work, which is good enough for me!

ADD REPLYlink modified 12 weeks ago by RamRS24k • written 3.5 years ago by acgerstein0
38
gravatar for Vivek
5.4 years ago by
Vivek2.3k
Denmark
Vivek2.3k wrote:
awk '{gsub(/^chr/,""); print}' your.vcf > no_chr.vcf

Should get rid of the chr

awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' no_chr.vcf > with_chr.vcf

Will add the chr to the VCF without chr.

You can use either command depending on how the chromosomes are named in your reference.

Regarding preference of tools, if you plan to do downstream processing with GATK, I'd suggest sticking with GATK CombineVariants for consistency.

ADD COMMENTlink modified 12 weeks ago by RamRS24k • written 5.4 years ago by Vivek2.3k
4

Thanks for this Vivek :) Works great!

It does still leave the ID lines in the VCF unmodified though, so I made a little tweak to catch those too:

awk '{ 
        if($0 !~ /^#/) 
            print "chr"$0;
        else if(match($0,/(##contig=<ID=)(.*)/,m))
            print m[1]"chr"m[2];
        else print $0 
      }' no_chr.vcf > with_chr.vcf

I'm also curious if this is ever the right thing to do, since VCFs from genomes with chr (notably the Mouse Genomes Project VCFs) may not be the same as the genome without the chr (notably the UCSC genomes and BAMs mapped to them) even if they both say they are mm10 or something similar. All I'm saying to future reader is to be careful :)

ADD REPLYlink modified 12 weeks ago by RamRS24k • written 3.6 years ago by John12k
1

Yeah, that's bit more sophisticated but I don't think most tools including GATK mind what's in the VCF header as long as the notation in the VCF entries conforms with the reference provided.

ADD REPLYlink written 3.6 years ago by Vivek2.3k
1

You're probably right - I was having a different issue with GATK earlier and I thought this might be the reason, but it turned out to be unrelated.

Ho hum. Maybe one day the gods of bioinformatics will decided to drop the "chr"s in all datasets and we can live in peace :P

ADD REPLYlink written 3.6 years ago by John12k
1

Or the goddesses will decide to keep "chr"s in all datasets and we can live in peace too (or this could be the cause of another war again?)

ADD REPLYlink written 13 months ago by bounlu190

Could you explain a little bit about this part of the command?

{if($0 !~ /^#/)

Because I'm reading it as:

if $0-- if the line contains
!~  -- something NOT EQUAL to
/^#/ -- a number at the beginning of the line

For some reason, I thought it should read the same except without the !

As in

awk '{if($0 ~ /^#/) print "chr"$0; else print $0}' no_chr.vcf > with_chr.vcf

So am I misunderstanding such that ! does not mean "not equal to" and # does not refer to "any number"?

ADD REPLYlink modified 12 weeks ago by RamRS24k • written 2.9 years ago by ms2380

The !~ operator is not the same as !=. The first specifies a "not-match" operation on a regular expression pattern (in this case, lines that start with a # character). The second operator is a "not-equals" operation on a pair of scalar values, like a pair of numbers or strings. Refer to the documentation for more details: https://www.gnu.org/software/gawk/manual/html_node/Comparison-Operators.html#Comparison-Operators

ADD REPLYlink written 2.9 years ago by Alex Reynolds28k

Ok, thanks, Alex. I see what it is doing now. It also looks like the formatting in vcf files are not necessarily obvious. As you can see from my subset below, it looks like many lines do not start with a # character, yet, they are not changed to chr1, which is good. Therefore, though they appear to be the start of a line, they are really a continuation of the preceding line. (for example, lines 3 and 4 below.)

1       3006834 .       GA      G       77      PASS    INDEL;DP=419;DP4=5,9,200,205;CSQ=-||||intergenic_variant||||||||        GT:GQ:DP:MQ0F:GP:PL:AN:MQ:DV:DP4:SP:SGB:PV4:FI  1/1:49:38:0:81,49,0:62,43,0:2:41:31:0,7,18,13:21:-0.69311:.:1   1/1:101:30:0:133,101,0:104,90,0:2:60:30:0,0,16,14:0:-0.693097:.:1       1/1:24:11:0:66,24,0:44,16,0:2:45:9:0,2,9,0:17:-0.662043:.:1     1/1:57:19:0:110,57,0:78,44,0:2:60:18:1,0,10,8:0:-0.691153:.:1   ./.:.:.:.:.:.:.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.:.:.:.:.:.:.   1/1:69:20:0:125,69,0:99,60,0:2:60:20:0,0,9,11:0:-0.692067:.:1   ./.:.:.:.:.:.:.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.:.:.:.:.:.:.   1/1:123:40:0:124,133,0:92,120,0:2:58:40:0,0,12,28:0:-0.693145:.:1       ./.:.:.:.:.:.:.:.:.:.:.:.:.:.   1/1:44:15:0:85,44,0:56,33,0:2:59
ADD REPLYlink modified 12 weeks ago by RamRS24k • written 2.9 years ago by ms2380
9
gravatar for vlaufer
5.4 years ago by
vlaufer260
United States
vlaufer260 wrote:

Give a statistical geneticist an awk line, feed him for a day, teach a statistical geneticist how to awk, feed him for a lifetime... check out this page for find and replace using awk:

http://awk.info/?awk1line

ADD COMMENTlink written 5.4 years ago by vlaufer260
1

+1. Such a powerful tool. Basic knowledge of awk should be in every bioinformaticist's back pocket.

ADD REPLYlink written 5.4 years ago by Alex Reynolds28k
4
gravatar for jerviedog
13 months ago by
jerviedog40
United States
jerviedog40 wrote:

Another way of doing this by using bcftools v1.9:

echo "1 chr1" >> chr_name_conv.txt
echo "2 chr2" >> chr_name_conv.txt
bcftools annotate --rename-chrs chr_name_conv.txt original.vcf.gz | bgzip > rename.vcf.gz
ADD COMMENTlink modified 12 weeks ago by RamRS24k • written 13 months ago by jerviedog40
1
gravatar for Shicheng Guo
12 weeks ago by
Shicheng Guo7.7k
Shicheng Guo7.7k wrote:

When you change the "chr1" to 1, don't forget to change the 'annotation lines'. the following solution will be okay.

awk '{gsub(/^chr/,""); print}' test.vcf > test.temp.vcf
awk '{gsub(/=chr/,"="); print}' test.temp.vcf > test.update.vcf
ADD COMMENTlink written 12 weeks ago by Shicheng Guo7.7k

This has already been addressed by the comment to the most popular answer (C: VCF files: Change Chromosome Notation) - I don't see why this is fit to be a new answer by itself.

ADD REPLYlink written 12 weeks ago by RamRS24k
0
gravatar for Alex
17 months ago by
Alex0
Alex0 wrote:

Next to the above answers, this resource might be helpful: mappings between common notations (UCSC, Ensembl, Gencode) https://github.com/dpryan79/ChromosomeMappings

You can use this together with

awk 'NR==FNR{map[$1]=$2;next} { for (i=1;i<=NF;i++) $i=($i in map ? map[$i] : $i) } 1' mapping_file.txt input.vcf > output.vcf

to batch rename between common chromosomal notations.

ADD COMMENTlink written 17 months ago by Alex0
0
gravatar for carmen.bugarin.diz
16 days ago by
King's College London
carmen.bugarin.diz0 wrote:

I had this problem before and I was able to fix it while running some filters to the file. If you are performing any sort of filtering on those VCF files you could use:

sed 's:chr::g'

Example:

vcftools --gzvcf sample.vcf.gz --keep sample_list.txt --mac 1 --recode-to-stream | sed 's:chr::g' | bgzip -c > new_sample.vcf.gz
ADD COMMENTlink modified 16 days ago by RamRS24k • written 16 days ago by carmen.bugarin.diz0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1773 users visited in the last hour