How to do data cleaning for VCF genetic file:
check REF and ALT is correct or not, if not correct, revise them.
bcftools norm -t "^24,25,26" -m-any --check-ref s -f hg19.fa Exome_QC.vcf.gz -Ov
remove chr0 records
vcftools --vcf All_samples_Exome_QC.vcf --not-chr 0 --recode --out Exome_QC.clean.vcf
remove duplicated location variants (Duplicate marker)
bcftools norm -d both --threads=32 All_samples_Exome.vcf -Ov -o Exome.norm.vcf
remove all the variants whose ALT="-" or REF="-"
bcftools view -e 'ALT ="-" | REF ="-"' All_samples_Exome.vcf.gz -Ov -o Exome_clean.vcf
How to remove duplicate markers according to chr, start, end, ref and alt: check this script
sh remove_VCF_duplicates.sh All_samples_Exome.vcf.gz \> All_samples.undup.vcf
How to change "chr1" to "1". check this script
check REF/ALT same with Reference Genome or Phase Reference (beagle)
Install vt and try to use vt to normalize vcf recommended by RS
Apply MuSiCa to check mutation profile
Apply R package maftools to convert VCF to MAF
Remove variants with low quality :
vcftools --vcf a.vcf --minGQ 90 --out b --recode
install most frequent used genetic analysis tools
list, include and remove samples from VCF
bcftools query -l input.vcf
sciclone for inferring the subclonal architecture of tumors [validated in Ubuntu 18.04]
change chrosome name:
rm chr_name_conv.txt
for i in {1..22} X Y M; do echo "chr$i $i" >> chr_name_conv.txt; done
bcftools annotate --rename-chrs chr_name_conv.txt
Schrodi_IL23_IL17_combined_RECAL_SNP_INDEL_PASS_variants.VA.vcf.gz -Oz -o Schrodi_IL23_IL17_combined_RECAL_SNP_INDEL_PASS_NUM.variants.VA.vcf.gz
Out of interest, where would chr0 records come from?
In many genome projects, chr0 is used to 'group' contigs that could not be assigned (yet) to a specific chromosome. So it's a pseudo-chromosome to collect all the left-over contigs and scaffolds. (which thus has no biological meaning of course)
It should be noted that this is for standard bialletic sites used in most genetic analysis of diploid organisms. In a lot of other cases, especially in the context of gene editing, mosaicism often results in multi-allelic variants, which could be handled by "bcftools norm", too.
the remaining task includes:
This "vcf cleaning procedure" seems to be specific to your use case. Do you know of anyone else that does this exact procedure that you do?
thanks for your share...excellent...he VCF file represents each individual as a column and each position as a row. This format is fine, but I prefer to have my data in the long-and-skinny format, rather than the short-and-fat format. Group-by operations are more flexible with long-and-skinny data, and everyone loves group-bys.