Question: vcf files - extract SNPs
0
gravatar for paraskevopou
14 months ago by
paraskevopou20
paraskevopou20 wrote:

Hi there all! I have 2 vcf files with filtered SNPs and now I want to extract a. The genes that are shared between the 2 vcf files b. The SNPs that are different on the shared genes between these two vcf files. for example 1.vcf

TRINITY_DN17985_c2_g1    539            0      G     A
TRINITY_DN17985_c2_g1   576         0   G   A 
TRINITY_DN17985_c2_g1   710         0   A   T
TRINITY_DN17985_c2_g1   722         0   T   C
TRINITY_DN17985_c2_g1   822         0   A   G
TRINITY_DN17985_c2_g1   1608    0   T   G
TRINITY_DN17985_c2_g1   2353    0   T   A
TRINITY_DN17985_c2_g1   4371    0   T   A
TRINITY_DN17985_c2_g1   6200    0   C   T
TRINITY_DN17985_c2_g1   7444    0   G   T
TRINITY_DN17985_c2_g1   7456    0   A   T
TRINITY_DN17985_c2_g1   2434    0   G   T
TRINITY_DN17985_c2_g1   4190    0   G   A
TRINITY_DN17985_c2_g1   4275    0   A   G
TRINITY_DN15509_c1_g1   626         0   A   C
TRINITY_DN15509_c1_g1   1647    0   G   A
TRINITY_DN15425_c1_g4   741         0   G   A
TRINITY_DN12988_c0_g1   1553    0   T   A
TRINITY_DN14521_c2_g1   826         0   T   A

2.vcf

TRINITY_DN17985_c2_g1   539         0      G     T
TRINITY_DN17985_c2_g1   576         0   G   T 
TRINITY_DN17985_c2_g1   710         0   A   G
TRINITY_DN17985_c2_g1   722         0   T   C
TRINITY_DN17985_c2_g1   822         0   A   G
TRINITY_DN17985_c2_g1   1608    0   T   G
TRINITY_DN17985_c2_g1   2353    0   T   A
TRINITY_DN17985_c2_g1   4371    0   T   A
TRINITY_DN17985_c2_g1   6200    0   C   T
TRINITY_DN17985_c2_g1   7444    0   G   T
TRINITY_DN17985_c2_g1   7456    0   A   T
TRINITY_DN16743_c0_g1   5739    0   G   A
TRINITY_DN16743_c0_g1   5764    0   T   A
TRINITY_DN16743_c0_g2   199         0   A   T
TRINITY_DN16295_c0_g1   982         0   A   G

and I want to extract only the following for which the ALT allele is different between 1.vcf and 2.vcf files.

TRINITY_DN17985_c2_g1   539         0      G     A
TRINITY_DN17985_c2_g1   576         0   G   A 
TRINITY_DN17985_c2_g1   710         0   A   T

TRINITY_DN17985_c2_g1   539         0      G     T
TRINITY_DN17985_c2_g1   576         0   G   T 
TRINITY_DN17985_c2_g1   710         0   A   G

I know that I can use bcftools isec for getting the SNPs that are shared between the two or the SNPs that are unique in the first and second file respectively. But how I can get first the genes that are present in both files?

Thanks a lot in advance for any help! Sofia

snp rna-seq • 388 views
ADD COMMENTlink modified 14 months ago by Michael Kosicki80 • written 14 months ago by paraskevopou20

I suppose this is not the format that the post should have. The VCF format doesn't look like a VCF format. Can you please wrap it around the code flag? (little button with 101010)

ADD REPLYlink modified 14 months ago • written 14 months ago by Macspider2.8k

I have just done that (wrapped it in code format)

ADD REPLYlink written 14 months ago by Kevin Blighe41k
3
gravatar for Michael Kosicki
14 months ago by
Wellcome Trust Sanger Institute
Michael Kosicki80 wrote:

If the first column is the "genes", then:

bedtools intersect -wa -wb -a 1.vcf -b 2.vcf > shared.vcf
grep -v '^#' shared.vcf | cut -f1 -d' ' shared.vcf | sort | uniq  # list of unique genes

Then, assuming it's the same reference (ie ref is the same on the same position) and that both vcf files actually have 10 columns:

grep -v '^#' shared.vcf | awk '$5!=$15' > diff.vcf   # differing positions, no header

Of course, depending on what you want to do with that file afterwards, you may want to restore the header and remove columns 11-20 (or add them as new rows etc).

ADD COMMENTlink written 14 months ago by Michael Kosicki80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1756 users visited in the last hour