There are two VCF files that I like to merge them, using GATK or VCFtools. The problem is, they have different chromosomal notation, one has Chr, the other does not. This question could be similar to this one
Is there any quick awk/sed commands that you suggest ?! Also I appreciate if you make comment, which of these two (GATK/VCFtools) is more reliable for this task.
ADD COMMENT
• link
updated 23 months ago by
Ram
44k
•
written 10.5 years ago by
Quak
▴
520
5
Entering edit mode
the awk-based answers below are confusing. Just use bcftools annotate --rename-chrsas highlighted by @jerviedog. This will also work with appropriate subsets of NCBI's assembly_report.txt files
I am very new at this and ran into a similar but slightly more complicated problem today with the Cryptococcus genome. I think I solved it thanks to help and links posted here (and didn't find a solution elsewhere) so thought I should post it here in case someone comes along with a similar problem. The reference genome I use does not use either numerical (1, 2, 3) or chr (chr1, chr2, chr3) notation, it has wacky chromosome names (CP003827, CP003822 etc.). So to replace my chromosome names in a vcf file to make them numerical I used a series of grep commands in awk:
I had no knowledge of awk before stumbling onto this post so there might be a more elegant way to do this, but this seems to work, which is good enough for me!
I'm also curious if this is ever the right thing to do, since VCFs from genomes with chr (notably the Mouse Genomes Project VCFs) may not be the same as the genome without the chr (notably the UCSC genomes and BAMs mapped to them) even if they both say they are mm10 or something similar. All I'm saying to future reader is to be careful :)
ADD REPLY
• link
updated 5.3 years ago by
Ram
44k
•
written 8.7 years ago by
John
13k
1
Entering edit mode
Yeah, that's bit more sophisticated but I don't think most tools including GATK mind what's in the VCF header as long as the notation in the VCF entries conforms with the reference provided.
Why add a new answer? I'll be moving this to a comment because the BaseRecalibrator point is one new piece of information that saves your post from being deleted as a duplicate.
Hi Ram, there is a little difference. It adds "chr" only to chromosomes 1 to 9 , X,Y and MT (which should be changed to chrM later...). That avoids adding chr to non canonical contigs (if "by luck" they have the sames names in both files...)
Good call. I'll move your post as a reply to John's comment. Please edit it and add the detail you mentioned above (adding "chr" to only 1-22|X|Y|MT and not to other contigs)
So am I misunderstanding such that ! does not mean "not equal to" and # does not refer to "any number"?
ADD REPLY
• link
updated 5.3 years ago by
Ram
44k
•
written 8.0 years ago by
ms238
•
0
0
Entering edit mode
The !~ operator is not the same as !=. The first specifies a "not-match" operation on a regular expression pattern (in this case, lines that start with a # character). The second operator is a "not-equals" operation on a pair of scalar values, like a pair of numbers or strings. Refer to the documentation for more details: https://www.gnu.org/software/gawk/manual/html_node/Comparison-Operators.html#Comparison-Operators
Ok, thanks, Alex. I see what it is doing now. It also looks like the formatting in vcf files are not necessarily obvious. As you can see from my subset below, it looks like many lines do not start with a # character, yet, they are not changed to chr1, which is good. Therefore, though they appear to be the start of a line, they are really a continuation of the preceding line. (for example, lines 3 and 4 below.)
Give a statistical geneticist an awk line, feed him for a day, teach a statistical geneticist how to awk, feed him for a lifetime... check out this page for find and replace using awk:
for i in {1..22} X Y MT
do
echo "chr$i $i" > chr_name_conv.txt
bcftools annotate --rename-chrs chr_name_conv.txt Schrodi_IL23_IL17_combined_RECAL_SNP_INDEL_variants.VA.chr$i.vcf.gz -Oz -o Schrodi_IL23_IL17_combined_RECAL_SNP_INDEL_variants.VA.chr$i.Minimac4.vcf.gz
tabix -p vcf Schrodi_IL23_IL17_combined_RECAL_SNP_INDEL_variants.VA.chr$i.Minimac4.vcf.gz
done
This has already been addressed by the comment to the most popular answer (C: VCF files: Change Chromosome Notation) - I don't see why this is fit to be a new answer by itself.
The loop is unnecessary for a single VCF file. Plus, this is an inefficient loop that renames chromosome by chromosome, so it wastes resources. Unnecessary and wasteful loops are shell abuse.
I had this problem before and I was able to fix it while running some filters to the file. If you are performing any sort of filtering on those VCF files you could use:
To summarize the answers people wrote before, if you have a bunch of vcf files, you don't want and need to know which has 'chr', neither use additional software. just run this in a shell script to add 'chr':
cat ${input_vcf} | grep '##' | grep '#' >>${ouput_vcf} - Unlike what you expect, this won't add just the #CHROM line but every header line except the #CHROM line. Anything that matches ## will automatically match #, and #CHROM won't match ##.
This would not happen if you used bcftools. Everyone (including you and me) that writes scripts makes mistakes. These tools just have been well tested and are widely used.
the awk-based answers below are confusing. Just use
bcftools annotate --rename-chrs
as highlighted by @jerviedog. This will also work with appropriate subsets of NCBI's assembly_report.txt filesI am very new at this and ran into a similar but slightly more complicated problem today with the Cryptococcus genome. I think I solved it thanks to help and links posted here (and didn't find a solution elsewhere) so thought I should post it here in case someone comes along with a similar problem. The reference genome I use does not use either numerical (1, 2, 3) or chr (chr1, chr2, chr3) notation, it has wacky chromosome names (CP003827, CP003822 etc.). So to replace my chromosome names in a vcf file to make them numerical I used a series of grep commands in awk:
I had no knowledge of awk before stumbling onto this post so there might be a more elegant way to do this, but this seems to work, which is good enough for me!