Entering edit mode
2.2 years ago
Hriday
•
0
"I have a single VCF file named 'ALL.wgs.shapeit2_integrated_snvindels_v2a.GRCh38.27022019.sites.vcf.gz'. The issue at hand is that the file uses different chromosomal notation and lacks the 'chr' prefix.
Like this "##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=11032019_15h52m43s
##source=IGSRpipeline
##reference=ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
##contig=<ID=1>
##contig=<ID=2>
##contig=<ID=3>
##contig=<ID=4>
##contig=<ID=5>
##contig=<ID=6>
##contig=<ID=7>
##contig=<ID=8>
##contig=<ID=9>
##contig=<ID=10>
##contig=<ID=11>
##contig=<ID=12>
##contig=<ID=13>
##contig=<ID=14>
##contig=<ID=15>
##contig=<ID=16>
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##contig=<ID=20>
##contig=<ID=21>
##contig=<ID=22>
##contig=<ID=X>
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1)">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=EAS_AF,Number=A,Type=Float,Description="Allele frequency in the EAS populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=EUR_AF,Number=A,Type=Float,Description="Allele frequency in the EUR populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=AFR_AF,Number=A,Type=Float,Description="Allele frequency in the AFR populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=AMR_AF,Number=A,Type=Float,Description="Allele frequency in the AMR populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=SAS_AF,Number=A,Type=Float,Description="Allele frequency in the SAS populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=VT,Number=.,Type=String,Description="indicates what type of variant the line represents">
##INFO=<ID=EX_TARGET,Number=0,Type=Flag,Description="indicates whether a variant is within the exon pull down target boundaries">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
#CHROM POS ID REF ALT QUAL FILTER INFO
1 10416 . CCCTAA C . PASS AC=240;AN=5096;DP=365460;AF=0.05;EAS_AF=0.06;EUR_AF=0.07;AFR_AF=0.01;AMR_AF=0.06;SAS_AF=0.05;VT=INDEL;NS=2548
1 16103 . T G . PASS AC=118;AN=5096;DP=29994;AF=0.02;EAS_AF=0;EUR_AF=0.04;AFR_AF=0.03;AMR_AF=0.03;SAS_AF=0.01;VT=SNP;NS=2548
1 17496 . AC A . PASS AC=25;AN=5096;DP=189765;AF=0;EAS_AF=0;EUR_AF=0;AFR_AF=0.02;AMR_AF=0;SAS_AF=0;VT=INDEL;NS=2548
1 51479 . T A . PASS AC=531;AN=5096;DP=17461;AF=0.1;EAS_AF=0;EUR_AF=0.19;AFR_AF=0.02;AMR_AF=0.11;SAS_AF=0.23;VT=SNP;NS=2548
1 51898 . C A . PASS AC=426;AN=5096;DP=15331;AF=0.08;EAS_AF=0.05;EUR_AF=0.14;AFR_AF=0.06;AMR_AF=0.06;SAS_AF=0.11;VT=SNP;NS=2548
Could you please provide some quick awk/sed commands that could address this issue? Additionally, I would appreciate it if you could offer your insight on which of the two tools, GATK or VCFtools, is more dependable for accomplishing this task. Thank you."
VCF files: Change Chromosome Notation ;
which link ?