Question: Imposing a human reference genome onto a vcf file?
0
gravatar for Joel Wallenius
22 months ago by
Sweden
Joel Wallenius70 wrote:

Hello!

I have a vcf file that details the alleles of a few hundred individuals' exomes.

I would like to merge this with a human reference genome (exome) in order to see how my individuals' alleles differ from the reference, rather than differ amongst themselves.

Is this possible? I am not sure where to start... I have vcftools ready for analysis.

sequencing snp allele genome vcf • 686 views
ADD COMMENTlink modified 22 months ago by Pierre Lindenbaum122k • written 22 months ago by Joel Wallenius70

in order to see how my individuals' alleles differ from the reference

the REF column should answer the question. Or give us an example of input/output.

ADD REPLYlink written 22 months ago by Pierre Lindenbaum122k

I don't think there is a reference... having a reference is optional I think?

(There is mention of a reference 'Homo_sapiens_assembly19' in the header lines though)

Example:

CHROM   POS N_ALLELES   N_CHR   {ALLELE:COUNT}
1   865625  2   696 G:695   A:1
1   865665  2   698 G:691   A:7
1   865694  2   698 C:693   T:5
1   865734  2   698 G:697   A:1
1   865738  2   698 A:694   G:4
1   865746  2   698 G:697   A:1
1   866422  2   698 C:696   T:2
1   871219  2   698 C:696   T:2
1   874415  2   698 C:688   T:10
ADD REPLYlink modified 22 months ago • written 22 months ago by Joel Wallenius70

I have a vcf file that details the alleles of a few hundred individuals' exomes.

so you're wrong; this is NOT a vcf file.

ADD REPLYlink written 22 months ago by Pierre Lindenbaum122k

The example I pasted is from the output of the '--counts' option in vcftools, acting on the vcf file.

The actual vcf file is huge... what part would you like to see?

ADD REPLYlink written 22 months ago by Joel Wallenius70
1

So you have a vcf, which does contain a REF field?

to see how my individuals' alleles differ from the reference, rather than differ amongst themselves.

In addition, variant calling is usually performed by comparing to the reference genome. Sounds like you already got what you want.

ADD REPLYlink written 22 months ago by WouterDeCoster40k

Yes it appears I already had what I wanted... the source of my vcf data claimed otherwise, and I have little experience with these things, so was easily confused!

ADD REPLYlink written 22 months ago by Joel Wallenius70
1
gravatar for Pierre Lindenbaum
22 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum122k wrote:

assuming your file is tab delimited and you have a reference genome indexed with samtools faidx with the correct chromosome names.

grep -v CHROM input.txt | while IFS='' read  L; do echo -ne "${L}\t" && echo "${L}" |awk '{printf("%s:%s-%s\n",$1,$2,$2);}' | xargs samtools faidx /path/to/human_hg19.fasta | grep -v '^>' | tr -d '\n'; echo ; done

UPDATE: from a real VCF and using bioalcidaejdk:

java -jar dist/bioalcidaejdk.jar -e 'stream().forEach(V->{println(V.getContig()+"\t"+V.getStart()+"\t"+V.getReference().getDisplayString()+"\t"+V.getAlleles().size()+"\t"+V.getGenotypes().stream().flatMap(G->G.getAlleles().stream()).filter(A->A.isCalled()).count()+"\t"+V.getGenotypes().stream().flatMap(G->G.getAlleles().stream()).filter(A->A.isCalled()).map(A->A.getDisplayString()).collect(Collectors.groupingBy(Function.identity(), Collectors.counting())));});' input.vcf




rotavirus   51  A   2   8   {A=6, G=2}
rotavirus   91  A   2   8   {A=7, T=1}
rotavirus   130 T   2   8   {C=1, T=7}
rotavirus   232 T   2   8   {A=1, T=7}
rotavirus   267 C   2   8   {C=7, G=1}
rotavirus   424 A   2   8   {A=7, G=1}
rotavirus   520 T   2   8   {A=1, T=7}
rotavirus   536 A   2   8   {A=6, T=2}
rotavirus   562 A   2   8   {A=7, G=1}
rotavirus   583 G   2   8   {C=1, G=7}
rotavirus   661 T   2   8   {A=1, T=7}
rotavirus   693 T   2   8   {T=6, G=2}
rotavirus   738 T   2   8   {A=1, T=7}
rotavirus   799 A   2   8   {A=6, C=2}
rotavirus   812 G   2   8   {T=2, G=6}
rotavirus   833 G   2   8   {A=2, G=6}
rotavirus   916 A   2   8   {A=6, T=2}
rotavirus   946 C   2   8   {A=1, C=7}
rotavirus   961 T   2   8   {A=1, T=7}
rotavirus   1044    A   2   8   {A=6, T=2}
rotavirus   1045    C   2   8   {C=6, G=2}
rotavirus   1054    C   2   8   {C=6, G=2}
rotavirus   1064    G   2   8   {A=2, G=6}
ADD COMMENTlink modified 22 months ago • written 22 months ago by Pierre Lindenbaum122k

Thank you very much for the effort, Pierre!

It turns out I was just confused. The source of my vcf data claimed that the first column in the --freq or --counts output was NOT the reference column, but it actually was. So there was no problem to begin with...

ADD REPLYlink written 22 months ago by Joel Wallenius70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1117 users visited in the last hour