VCF files: Change Chromosome Notation
6
13
Entering edit mode
7.6 years ago
Quak ▴ 380

There are two VCF files that I like to merge them, using GATK or VCFtools. The problem is, they have different chromosomal notation, one has Chr, the other does not. This question could be similar to this one

Is there any quick awk/sed commands that you suggest ?! Also I appreciate if you make comment, which of these two (GATK/VCFtools) is more reliable for this task.

next-gen sequence • 37k views
ADD COMMENT
0
Entering edit mode

I am very new at this and ran into a similar but slightly more complicated problem today with the Cryptococcus genome. I think I solved it thanks to help and links posted here (and didn't find a solution elsewhere) so thought I should post it here in case someone comes along with a similar problem. The reference genome I use does not use either numerical (1, 2, 3) or chr (chr1, chr2, chr3) notation, it has wacky chromosome names (CP003827, CP003822 etc.). So to replace my chromosome names in a vcf file to make them numerical I used a series of grep commands in awk:

awk '{gsub(/CP003827/, "8"); gsub(/CP003822/, "3"); gsub(/CP003824/, "5");
      gsub(/CP003834/, "Mt"); gsub(/CP003833/, "14"); gsub(/CP003829/, "10"); 
      gsub(/CP003826/, "7"); gsub(/CP003820/, "1"); gsub(/CP003828/, "9");
      gsub(/CP003825/, "6"); gsub(/CP003821/, "2"); gsub(/CP003830/, "11");
      gsub(/CP003832/, "13"); gsub(/CP003831/, "12"); gsub(/CP003823/, "4");
      print;}' original.vcf > original_newchr.vcf

I had no knowledge of awk before stumbling onto this post so there might be a more elegant way to do this, but this seems to work, which is good enough for me!

ADD REPLY
0
Entering edit mode

the awk-based answers below are confusing. Just use bcftools annotate --rename-chrs as highlighted by @jerviedog . This will also work with appropriate subsets of NCBI's assembly_report.txt files

ADD REPLY
49
Entering edit mode
7.6 years ago
Vivek ★ 2.5k
awk '{gsub(/^chr/,""); print}' your.vcf > no_chr.vcf

Should get rid of the chr

awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' no_chr.vcf > with_chr.vcf

Will add the chr to the VCF without chr.

You can use either command depending on how the chromosomes are named in your reference.

Regarding preference of tools, if you plan to do downstream processing with GATK, I'd suggest sticking with GATK CombineVariants for consistency.

ADD COMMENT
7
Entering edit mode

Thanks for this Vivek :) Works great!

It does still leave the ID lines in the VCF unmodified though, so I made a little tweak to catch those too:

awk '{ 
        if($0 !~ /^#/) 
            print "chr"$0;
        else if(match($0,/(##contig=<ID=)(.*)/,m))
            print m[1]"chr"m[2];
        else print $0 
      }' no_chr.vcf > with_chr.vcf

I'm also curious if this is ever the right thing to do, since VCFs from genomes with chr (notably the Mouse Genomes Project VCFs) may not be the same as the genome without the chr (notably the UCSC genomes and BAMs mapped to them) even if they both say they are mm10 or something similar. All I'm saying to future reader is to be careful :)

ADD REPLY
1
Entering edit mode

Yeah, that's bit more sophisticated but I don't think most tools including GATK mind what's in the VCF header as long as the notation in the VCF entries conforms with the reference provided.

ADD REPLY
1
Entering edit mode

You're probably right - I was having a different issue with GATK earlier and I thought this might be the reason, but it turned out to be unrelated.

Ho hum. Maybe one day the gods of bioinformatics will decided to drop the "chr"s in all datasets and we can live in peace :P

ADD REPLY
1
Entering edit mode

Or the goddesses will decide to keep "chr"s in all datasets and we can live in peace too (or this could be the cause of another war again?)

ADD REPLY
0
Entering edit mode

Could you explain a little bit about this part of the command?

{if($0 !~ /^#/)

Because I'm reading it as:

if $0-- if the line contains
!~  -- something NOT EQUAL to
/^#/ -- a number at the beginning of the line

For some reason, I thought it should read the same except without the !

As in

awk '{if($0 ~ /^#/) print "chr"$0; else print $0}' no_chr.vcf > with_chr.vcf

So am I misunderstanding such that ! does not mean "not equal to" and # does not refer to "any number"?

ADD REPLY
0
Entering edit mode

The !~ operator is not the same as !=. The first specifies a "not-match" operation on a regular expression pattern (in this case, lines that start with a # character). The second operator is a "not-equals" operation on a pair of scalar values, like a pair of numbers or strings. Refer to the documentation for more details: https://www.gnu.org/software/gawk/manual/html_node/Comparison-Operators.html#Comparison-Operators

ADD REPLY
0
Entering edit mode

Ok, thanks, Alex. I see what it is doing now. It also looks like the formatting in vcf files are not necessarily obvious. As you can see from my subset below, it looks like many lines do not start with a # character, yet, they are not changed to chr1, which is good. Therefore, though they appear to be the start of a line, they are really a continuation of the preceding line. (for example, lines 3 and 4 below.)

1       3006834 .       GA      G       77      PASS    INDEL;DP=419;DP4=5,9,200,205;CSQ=-||||intergenic_variant||||||||        GT:GQ:DP:MQ0F:GP:PL:AN:MQ:DV:DP4:SP:SGB:PV4:FI  1/1:49:38:0:81,49,0:62,43,0:2:41:31:0,7,18,13:21:-0.69311:.:1   1/1:101:30:0:133,101,0:104,90,0:2:60:30:0,0,16,14:0:-0.693097:.:1       1/1:24:11:0:66,24,0:44,16,0:2:45:9:0,2,9,0:17:-0.662043:.:1     1/1:57:19:0:110,57,0:78,44,0:2:60:18:1,0,10,8:0:-0.691153:.:1   ./.:.:.:.:.:.:.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.:.:.:.:.:.:.   1/1:69:20:0:125,69,0:99,60,0:2:60:20:0,0,9,11:0:-0.692067:.:1   ./.:.:.:.:.:.:.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.:.:.:.:.:.:.   1/1:123:40:0:124,133,0:92,120,0:2:58:40:0,0,12,28:0:-0.693145:.:1       ./.:.:.:.:.:.:.:.:.:.:.:.:.:.   1/1:44:15:0:85,44,0:56,33,0:2:59
ADD REPLY
17
Entering edit mode
3.3 years ago
jerviedog ▴ 170

Another way of doing this by using bcftools v1.9:

echo "1 chr1" >> chr_name_conv.txt
echo "2 chr2" >> chr_name_conv.txt
bcftools annotate --rename-chrs chr_name_conv.txt original.vcf.gz | bgzip > rename.vcf.gz
ADD COMMENT
11
Entering edit mode
7.6 years ago
vlaufer ▴ 280

Give a statistical geneticist an awk line, feed him for a day, teach a statistical geneticist how to awk, feed him for a lifetime... check out this page for find and replace using awk:

http://awk.info/?awk1line

ADD COMMENT
4
Entering edit mode
2.4 years ago
Shicheng Guo ★ 8.8k
for i in {1..22} X Y MT
do
echo "chr$i $i" > chr_name_conv.txt
bcftools annotate --rename-chrs chr_name_conv.txt Schrodi_IL23_IL17_combined_RECAL_SNP_INDEL_variants.VA.chr$i.vcf.gz -Oz -o Schrodi_IL23_IL17_combined_RECAL_SNP_INDEL_variants.VA.chr$i.Minimac4.vcf.gz
tabix -p vcf Schrodi_IL23_IL17_combined_RECAL_SNP_INDEL_variants.VA.chr$i.Minimac4.vcf.gz
done
ADD COMMENT
0
Entering edit mode

This has already been addressed by the comment to the most popular answer (C: VCF files: Change Chromosome Notation) - I don't see why this is fit to be a new answer by itself.

ADD REPLY
0
Entering edit mode

It fits because the original answer didn't have the for loop. Which actually saves time.

ADD REPLY
0
Entering edit mode

The loop is unnecessary for a single VCF file. Plus, this is an inefficient loop that renames chromosome by chromosome, so it wastes resources. Unnecessary and wasteful loops are shell abuse.

ADD REPLY
0
Entering edit mode

It is more time efficient than echoing the 25 chromosome names:)

ADD REPLY
0
Entering edit mode

No it's not. Querying a large VCF file 25 times is a LOT more inefficient than typing 25 brief lines or using xargs/parallel instead of the for loop.

That loop is a disaster.

ADD REPLY
1
Entering edit mode
2.2 years ago

I had this problem before and I was able to fix it while running some filters to the file. If you are performing any sort of filtering on those VCF files you could use:

sed 's:chr::g'

Example:

vcftools --gzvcf sample.vcf.gz --keep sample_list.txt --mac 1 --recode-to-stream | sed 's:chr::g' | bgzip -c > new_sample.vcf.gz
ADD COMMENT
0
Entering edit mode
3.6 years ago
Alex • 0

Next to the above answers, this resource might be helpful: mappings between common notations (UCSC, Ensembl, Gencode) https://github.com/dpryan79/ChromosomeMappings

You can use this together with

awk 'NR==FNR{map[$1]=$2;next} { for (i=1;i<=NF;i++) $i=($i in map ? map[$i] : $i) } 1' mapping_file.txt input.vcf > output.vcf

to batch rename between common chromosomal notations.

ADD COMMENT

Login before adding your answer.

Traffic: 1635 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6