converting VCF to CSV
2
1
Entering edit mode
5.7 years ago
krp0001 ▴ 30

 

Dear all, 

I have an multiple sample .vcf file generated from freebayes. i looking for a tool or a script that can convert to comma separated file with the position of the SNP and genotype with sample id that look similar to (below) that i can use for H-scan to look for selective sweeps. I converted vcf to tab using picard but the H-scan doesn't support the format. 

2003,G,G,N,G,G,G,N,G,G,G,G,G,G,G,G,G,G,G,G,A,G,G,G,G,G,G,G,G,G,G,G,G,G,G

2052,G,G,G,G,G,N,N,G,G,N,G,G,G,G,A,G,G,G,G,N,G,G,G,G,G,G,G,G,G,G,G,G,G,A

2465,A,A,A,A,A,A,N,A,A,N,A,A,A,A,A,A,A,A,A,A,G,A,A,A,A,A,A,A,A,A,A,A,A,N

 Thank you very much

 

SNP genome gene Assembly R • 4.5k views
ADD COMMENT
0
Entering edit mode

How do u represent heterozygous snps ?

ADD REPLY
0
Entering edit mode

my data is homozygous and the document explains how the heterozygous snps are treated https://dl.dropboxusercontent.com/u/77898333/H-scan.pdf

2003,G/G,G/N,N/N,G/C,G/A,G/G,N/G,G/N,A/A,G/A,G/N,N/N,N/A,G/G,G/A,G/N,N/N
2052,N/G,G/N,G/G,G/T,G/T,N/N,N/T,N/G,G/G,N/T,G/T,T/N,G/G,T/N,G/T,G/G,T/N
2465,A/A,A/A,A/N,A/A,A/G,A/A,G/A,A/N,N/A,G/A,A/A,N/A,A/A,A/G,N/N,G/G,A/N
ADD REPLY
0
Entering edit mode

can you post first snps from your VCF file ?

ADD REPLY
0
Entering edit mode
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  F11-12  F1-1    F12b-6  F13b-1  F4-3    F7b-3   F8-2    GR12-11 GR1
scaffold1       382     .       T       A       298.864 PASS    .       GT:GQ:DP:RO:QR:AO:QA:GL 
scaffold1       385     .       T       A,C     63.3529 PASS    .       GT:GQ:DP:RO:QR:AO:QA:GL 
scaffold1       386     .       G       C       930.561 PASS    .       GT:GQ:DP:RO:QR:AO:QA:GL 
scaffold1       446     .       C       T       3688.58 PASS    .       GT:GQ:DP:RO:QR:AO:QA:GL 
scaffold1       460     .       C       T       2781.98 PASS    .       GT:GQ:DP:RO:QR:AO:QA:GL 
scaffold1       534     .       C       T       3079.93 PASS    .       GT:GQ:DP:RO:QR:AO:QA:GL 
scaffold1       779     .       G       A       3049.96 PASS    .       GT:GQ:DP:RO:QR:AO:QA:GL 
scaffold1       783     .       G       A       2976.31 PASS    .       GT:GQ:DP:RO:QR:AO:QA:GL
ADD REPLY
2
Entering edit mode
5.7 years ago

First create a tab file

cat test.vcf | vcf-to-tab | grep -v "^#" > tab.out​

Then use the small python snippet to get the desired format: 

with open("tab.out") as f:
    for line in f:
        snps=line.strip().split()
        id="_".join(snps[:2])
        print id+","+",".join([snp.replace('./', 'N') for snp in snps[3:]]) #index starts with 3 because we have to ignore the reference allele.
        ​

tab.out file:

1    3062915    GTTT    GTTT/G    GTTT/GT    GTTT/G
1    3106154    CAAAA    CAAAA/CA    CAAAA/C    CA/CA
1    3157410    GA    G/G    GA/G    GA/G
1    3162006    GAA    GAA/G    GAA/G    GAA/G
1    3177144    GT    GT/G    GT/G    GT/G
1    3184885    TAAAA    TA/T    TAAAA/TA    ./
1    3199812    G    ./    ./    GT/GT
2    3188209    GA    ./    GA/G    ./
2    3199812    G    GTT/GT    ./    ./
2    3212016    CTT    ./    ./    CTT/C
3    3199812    G    ./    GTT/GT    ./
3    3199815    G    ./    G/A    G/T
3    3212016    CTT    C/CT    ./    ./
3    3242491    TT    ./    ./    T/T
4    3212016    CTT    ./    CTT/C    ./
4    3258448    TACACACAC    TACACACAC/T    ./    ./
4    3291771    T    ./    ./    TAA/TAAA

python snippet output:

1_3062915,GTTT/G,GTTT/GT,GTTT/G
1_3106154,CAAAA/CA,CAAAA/C,CA/CA
1_3157410,G/G,GA/G,GA/G
1_3162006,GAA/G,GAA/G,GAA/G
1_3177144,GT/G,GT/G,GT/G
1_3184885,TA/T,TAAAA/TA,N
1_3199812,N,N,GT/GT
2_3188209,N,GA/G,N
2_3199812,GTT/GT,N,N
2_3212016,N,N,CTT/C
3_3199812,N,GTT/GT,N
3_3199815,N,G/A,G/T
3_3212016,C/CT,N,N
3_3242491,N,N,T/T
4_3212016,N,CTT/C,N
4_3258448,TACACACAC/T,N,N
4_3291771,N,N,TAA/TAAA

This can be done with linux commands I guess.

ADD COMMENT
0
Entering edit mode

Thank you very much for your reply, this is what was looking for, will post if there is any issues

Cheers

ADD REPLY
0
Entering edit mode

Hai Goutham,

Below is my out put from you python snippet. while using this out for H-scan, the result looks quite weird, can you please tell me how to remove the "scaffold10322_", by python, some thing that starts from position like : 77,N,N,G,G,N,N,G,G,N,G,G,G,N,G,G,G,N,G,G,N,N

79,G,N,G,G,G,N,G,G,G,G,G,G,G,G,G,G,G,G,G,N,G

  • scaffold10322_77,N,N,G,G,N,N,G,G,N,G,G,G,N,G,G,G,N,G,G,N,N
  • scaffold10322_79,G,N,G,G,G,N,G,G,G,G,G,G,G,G,G,G,G,G,G,N,G
  • scaffold10322_81,C,C,C,C,C,N,C,C,C,C,C,C,C,C,C,C,C,C,C,N,C
  • scaffold10322_82,A,A,G,G,A,G,G,A,A,A,A,A,A,G,A,G,A,G,A,G,A
  • scaffold10322_83,T,T,T,T,T,T,T,A,T,A,T,A,A,T,T,T,A,T,T,T,A
  • scaffold10322_84,G,G,G,G,G,G,G,T,G,T,G,T,T,G,G,G,T,G,G,T,T
  • scaffold10322_85,G,G,G,T,G,T,G,G,G,G,G,G,G,G,G,G,G,T,G,G,G
  • scaffold10322_168,T,T,T,T,A,T,T,A,T,T,T,T,T,T,T,T,A,T,T,T,T
  • scaffold10322_731,C,C,C,C,C,C,T,T,C,C,C,C,C,C,T,C,C,C,T,T,C
  • scaffold10322_4513,C,C,C,C,G,C,C,G,G,C,C,G,G,C,C,C,G,C,C,C,G
ADD REPLY
0
Entering edit mode

If you have only one scaffold, just change id="_".join(snps[:2]) to id=snps[1]

I have used chr_pos to make sure the snps will have unique identifiers.

ADD REPLY
0
Entering edit mode
5.7 years ago
krp0001 ▴ 30

my data is homozygous and the document explains how the heterozygous snps are treated https://dl.dropboxusercontent.com/u/77898333/H-scan.pdf

2003,G/G,G/N,N/N,G/C,G/A,G/G,N/G,G/N,A/A,G/A,G/N,N/N,N/A,G/G,G/A,G/N,N/N
2052,N/G,G/N,G/G,G/T,G/T,N/N,N/T,N/G,G/G,N/T,G/T,T/N,G/G,T/N,G/T,G/G,T/N
2465,A/A,A/A,A/N,A/A,A/G,A/A,G/A,A/N,N/A,G/A,A/A,N/A,A/A,A/G,N/N,G/G,A/N
...
ADD COMMENT

Login before adding your answer.

Traffic: 1767 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6