Tutorial: How to do data cleaning for VCF genetic file
2
gravatar for Shicheng Guo
4 weeks ago by
Shicheng Guo7.6k
Shicheng Guo7.6k wrote:

How to do data cleaning for VCF genetic file:

  1. check REF and ALT is correct or not, if not correct, revise them.

    bcftools norm -t "^24,25,26" -m-any --check-ref s -f hg19.fa Exome_QC.vcf.gz -Ov

  2. remove chr0 records

    vcftools --vcf All_samples_Exome_QC.vcf --not-chr 0 --recode --out Exome_QC.clean.vcf

  3. remove duplicated location variants (Duplicate marker)

    bcftools norm -d both --threads=32 All_samples_Exome.vcf -Ov -o Exome.norm.vcf

  4. remove all the variants whose ALT="-" or REF="-"

    bcftools view -e 'ALT ="-" | REF ="-"' All_samples_Exome.vcf.gz -Ov -o Exome_clean.vcf

  5. How to remove duplicate markers according to chr, start, end, ref and alt: check this script

    sh remove_VCF_duplicates.sh All_samples_Exome.vcf.gz \> All_samples.undup.vcf

  6. How to change "chr1" to "1". check this script

  7. check REF/ALT same with Reference Genome or Phase Reference (beagle)

  8. Install vt and try to use vt to normalize vcf recommended by RS

  9. Apply MuSiCa to check mutation profile

  10. Apply R package maftools to convert VCF to MAF

  11. Remove variants with low quality : vcftools --vcf a.vcf --minGQ 90 --out b --recode

  12. install most frequent used genetic analysis tools

  13. list, include and remove samples from VCF bcftools query -l input.vcf

  14. sciclone for inferring the subclonal architecture of tumors [validated in Ubuntu 18.04]

tutorial vcf • 402 views
ADD COMMENTlink modified 13 days ago • written 4 weeks ago by Shicheng Guo7.6k
1

Out of interest, where would chr0 records come from?

ADD REPLYlink written 4 weeks ago by Michael Dondrup46k
1

In many genome projects, chr0 is used to 'group' contigs that could not be assigned (yet) to a specific chromosome. So it's a pseudo-chromosome to collect all the left-over contigs and scaffolds. (which thus has no biological meaning of course)

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by lieven.sterck5.5k
1

It should be noted that this is for standard bialletic sites used in most genetic analysis of diploid organisms. In a lot of other cases, especially in the context of gene editing, mosaicism often results in multi-allelic variants, which could be handled by "bcftools norm", too.

ADD REPLYlink written 4 weeks ago by Vitis2.2k

the remaining task includes:

Statistics: 
Alternative allele frequency > 0.5 sites: 387,454
Reference Overlap: 93.55% 
Match: 2,010,797
Allele switch: 0
Strand flip: 0
Strand flip and allele switch: 0
A/T, C/G genotypes: 0
Filtered sites: 
Filter flag set: 0
Invalid alleles: 377,897
Duplicated sites: 358
NonSNP sites: 0
Monomorphic sites: 11,421
Allele mismatch: 6,377
SNPs call rate < 90%: 108,308
ADD REPLYlink modified 26 days ago • written 26 days ago by Shicheng Guo7.6k
1

This "vcf cleaning procedure" seems to be specific to your use case. Do you know of anyone else that does this exact procedure that you do?

ADD REPLYlink written 25 days ago by RamRS22k
5
gravatar for RamRS
27 days ago by
RamRS22k
Houston, TX
RamRS22k wrote:

With respect to #3, do not use awk to manipulate VCF files. There is a possibility that it might work in an unexpected manner in one of the header lines, and bcftools has been extensively tested and is a lot more VCF-format-specific than awk.

Also, your #5 is the bcftools version of #3, so #3 need not exist.

The bash script is not needed for duplicate removal, when bcftools norm exists. It uses a loop hard-coded with contig names, so it has multiple pitfalls.

Do not use sh to run a shell script unless you're 100% sure the script is POSIX compliant.

Your script to change chrX to X runs through the file twice, producing an unnecessary intermediate file, which bcftools annotate --rename-chrs can do much more reliably and reproducibly.

Your script to change X to chrX won't work for any contig that does not map as str -> "chr"+str exactly. Which means, it will fail on any VCF with additional contigs.

ADD COMMENTlink written 27 days ago by RamRS22k

Awesome! Thanks RamRS!

ADD REPLYlink written 27 days ago by Shicheng Guo7.6k

Hi RamRS, do you have more comprehensive vcf file cleaning procedures? Thanks.

ADD REPLYlink written 27 days ago by Shicheng Guo7.6k
2

Step-1: bcftools norm <split-multiallelic> <normalize-and-left_align-with-ref>

or

Better Step-1: vt decompose | vt norm

Step-2: Annotate with VEP

Everything else is based on what you need downstream

ADD REPLYlink written 27 days ago by RamRS22k

Step 1 like this way?

bcftools norm -m-any --check-ref s -f hg19.fa Exome_QC.vcf.gz -Ov

Step 2, VEP is okay. I prefer the latest version of ANNOVAR which I found have a better annotation to VEP.

table_annovar.pl -vcfinput input.vcf ~/humandb/ --thread 32 -buildver hg19 -out output -remove -protocol refGene,dbnsfp33a -operation gx,f -nastring . -otherinfo -polish -xref ~/humandb/gene_fullxref.txt
ADD REPLYlink written 27 days ago by Shicheng Guo7.6k
1

Yes, step-1 is accurate. vt norm is better because it retains an INFO attribute on modified variants.

I used to prefer ANNOVAR too, but VEP is more accurate, IMO. Plus, annovar's flags confuse me.

ADD REPLYlink written 27 days ago by RamRS22k

What's vt norm ?

ADD REPLYlink written 27 days ago by Shicheng Guo7.6k
1

Did you try googling it? https://genome.sph.umich.edu/wiki/Vt#Normalization

ADD REPLYlink written 27 days ago by RamRS22k

I see. they should give it a better name. xxtools is more popular.

ADD REPLYlink written 27 days ago by Shicheng Guo7.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1585 users visited in the last hour