get a set of reliable SNPs when knowing parent and child relationship
2
0
Entering edit mode
9.2 years ago
lilepisorus ▴ 40

I have a set of SNP data in vcf format and have information regarding parent and offspring relationship. I want to use this kinship relationship to subset SNPs to a smaller number to ensure those SNPs are really reliable. Could someone tell me if there is some existing tools to do so?

SNP Identity-by-descent • 1.8k views
ADD COMMENT
0
Entering edit mode

Hi, Pierre

Could you follow up with this question? How to subtract variants in violations.vcf from all variants? Any existing tools? I checked the -selectVariant in GATK, it seems that it can go for the common or shared ones, not the unique ones.

Thanks
Li

ADD REPLY
1
Entering edit mode
9.2 years ago

https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php "Generating a VCF of all the variants that are mendelian violations:"

java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   -ped family.ped \
   -mvq 50 \
   -o violations.vcf

I wrote a tool https://github.com/lindenb/jvarkit/wiki/VCFTrio to annotate a VCF with mendelian incompatibilities in trios.

ADD COMMENT
0
Entering edit mode

Hi, Pierre

I tried both ways you suggested.

The SelectVariants in GATK gave me the full set of SNPs in my input, suggesting all SNPs violated the Mendel heritability rule.

I also tried VCFTrio, in the result file, I saw comments of tag "MENDEL" in the header, but for each SNP, I did not see "MENDEL" in the INFO field. So I cannot filter sites which violate Mendelian rules based on the INFO field.

I am wondering if it is true that all of sites violate or I did anything wrong in my command. My SNP set is from SNP calling for the whole genome in GATK. So, I doubted none of the SNPs obey the Mendelian Rules based on the provided pedigree. Here is my command for both analyses:

java -Xmx240g -jar GenomeAnalysisTK.jar \
  --pedigreeValidationType SILENT \
  -R Glycine_max.V2.fasta \
  -T SelectVariants \
  --variant:VCF combined.REDUCED.1.realign.123013.uniq.sorted.SNPsOnly.vcf \
  -ped soybean.ped \
  -mvq 50 \
  -o violations.vcf

java -jar ./jvarkit/dist-1.128/vcftrio.jar \
  -p soybean.ped.vcftrio.txt \
  ../vcfFiles/combined.REDUCED.1.realign.123013.uniq.sorted.SNPsOnly.vcf > vcftrio.result
ADD REPLY
0
Entering edit mode

It's possible that the pedigree tools are making assumptions that the reference is a human genome, in order to understand the inheritance patterns? (E.g. in the RTG solution, the reference.sdf should contain a configuration file that specifies the autosomes and the sex chromosome inheritance patterns for the reference genome of interest, so that these constraints can be applied appropriately to the individual samples during variant calling and mendelianness checking).

I'm not familiar with the soybean genome -- is it diploid (which may be an assumption of the mendelianness tools) or polyploid?

ADD REPLY
0
Entering edit mode
9.2 years ago
Len Trigg ★ 1.6k

Here's another way, which outputs just the consistent variants, using Real Time Genomics:

rtg mendelian -t reference.sdf --pedigree family.ped -i input.vcf \
    --output-consistent goodvariants.vcf.gz

There is also --output-inconsistent if you want a file with just the violations, or regular --output if you want the mendelian consistency status just added as annotations. It may help to use --lenient if the variants were not jointly called.

ADD COMMENT

Login before adding your answer.

Traffic: 2259 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6