Question

compare two vcf files

0

Entering edit mode

2.6 years ago

순연 • 0

Hi. I have a problem I want to compare the rs numbers in two vcf files. so I want to check which of the Rs numbers are in the top 10 percent. I don't know what to do. Can you help me if I have tools or if I have to look at the blog?

vcf • 2.4k views

ADD COMMENT • link 2.6 years ago by 순연 • 0

0

Entering edit mode

If you still need help, could you elaborate on what you mean by "which of the Rs numbers are in the top 10 percent"? Top 10% of what exactly? Do you simply want to know which rs IDs are shared between two VCF files?

ADD REPLY • link 2.6 years ago by sbstevenlee ▴ 480

0

Entering edit mode

yes,i still need help...

VCF files received from gnomAD and vcf files that have ended the animation.
If there is a common rs number by comparing the two files, and if there is a common rs number, collecting the duplicated rs number among those common rs numbers,
I want to rank the top 10% or the top 5%, and furthermore, I want to google the rs number in the top 10%.
like to check if it's a disease-causing rs number or not!

ADD REPLY • link 2.6 years ago by 순연 • 0

0

Entering edit mode

After finding which rs IDs are shared (which is the easy part), HOW do you want "rank" them? Do you mean to rank them by their allele frequency across all the samples in the two VCF files (i.e. prevalence)? Also, how big are your VCF files? BTW, I can read Korean so if you prefer, you are more than welcome to comment in Korean :)

ADD REPLY • link 2.6 years ago by sbstevenlee ▴ 480

0

Entering edit mode

gnomAD에서 받은 VCF files 와 annoation이 끝난 vcf 파일
두 파일을 비교해서 공통적으로 존재하는 rs number가 있다면, 또 그 공통되는 rs number 중 중복되는 rs number를 모아서
상위 10% 나 상위 5% 등급을 매겨 보고 싶고, 더 나아가 10% 안에 드는 rs number를 구글링을 통해
질병을 일으 킬 수 있는 rs number 인지 아닌지를 확인해 보고 싶어요

구글링을 열심히 하는데 두 개의 파일을 비교하는 것 부터 쉽지가 않아서 질문 올렸어요 두개의 파일을 비교해서 공통되는 rs unmber를 찾으려면 어떤 tools이나 블로그를 봐야하는지 도움을 주셨으면 합니다 감사합니다

ADD REPLY • link 2.6 years ago by 순연 • 0

0

Entering edit mode

You still haven't answered my questions of 1) how you want to rank those common rs IDs and 2) how big the two files are. Therefore, here, I will just put how you can find rs IDs that are common between any two VCF files using Python and the pyvcf submodule I wrote:

# find_common_rs_numbers.py

from fuc import pyvcf
import pandas as pd

data1 = {
    'CHROM': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
    'POS': [100, 101, 102, 103, 104],
    'ID': ['rs1', 'rs2', 'rs3', 'rs4', 'rs5'],
    'REF': ['G', 'T', 'A', 'G', 'T'],
    'ALT': ['A', 'C', 'C', 'T', 'A'],
    'QUAL': ['.', '.', '.', '.', '.'],
    'FILTER': ['.', '.', '.', '.', '.'],
    'INFO': ['.', '.', '.', '.', '.'],
    'FORMAT': ['GT', 'GT', 'GT', 'GT', 'GT'],
    'A': ['0/1', '0/1', '0/1', '0/1', '0/1']
}

vf1 = pyvcf.VcfFrame.from_dict([], data1)
# vf1 = pyvcf.VcfFrame.from_file('first.vcf') <--- CHANGE ME

# >>> vf1.df
#   CHROM  POS   ID REF ALT QUAL FILTER INFO FORMAT    A
# 0  chr1  100  rs1   G   A    .      .    .     GT  0/1
# 1  chr1  101  rs2   T   C    .      .    .     GT  0/1
# 2  chr1  102  rs3   A   C    .      .    .     GT  0/1
# 3  chr1  103  rs4   G   T    .      .    .     GT  0/1
# 4  chr1  104  rs5   T   A    .      .    .     GT  0/1

data2 = {
    'CHROM': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
    'POS': [100, 103, 201, 202, 203],
    'ID': ['rs1', 'rs4', 'rs6', 'rs7', 'rs8'],
    'REF': ['G', 'G', 'A', 'G', 'T'],
    'ALT': ['A', 'T', 'C', 'T', 'A'],
    'QUAL': ['.', '.', '.', '.', '.'],
    'FILTER': ['.', '.', '.', '.', '.'],
    'INFO': ['.', '.', '.', '.', '.'],
    'FORMAT': ['GT', 'GT', 'GT', 'GT', 'GT'],
    'B': ['0/1', '0/1', '0/1', '0/1', '0/1']
}

vf2 = pyvcf.VcfFrame.from_dict([], data2)
# vf2 = pyvcf.VcfFrame.from_file('second.vcf') <--- CHANGE ME

# >>> vf2.df
#   CHROM  POS   ID REF ALT QUAL FILTER INFO FORMAT    B
# 0  chr1  100  rs1   G   A    .      .    .     GT  0/1
# 1  chr1  103  rs4   G   T    .      .    .     GT  0/1
# 2  chr1  201  rs6   A   C    .      .    .     GT  0/1
# 3  chr1  202  rs7   G   T    .      .    .     GT  0/1
# 4  chr1  203  rs8   T   A    .      .    .     GT  0/1

# rs1 and rs4 are shared between the two VCF files.

s = pd.concat([vf1.df.ID, vf2.df.ID])

for x in s[s.duplicated()]:
    print(x)

Above script will list common rs IDs between your VCF files:

$ python3 find_common_rs_numbers.py
rs1
rs4

Note that you will need to install the fuc package to use above script (run $ conda install -c bioconda fuc).

ADD REPLY • link 2.6 years ago by sbstevenlee ▴ 480

0

Entering edit mode

Thank you sir, happy Korean Thanksgiving Day 1) How you want to rank those companies IDs, and what RSunmber is there from the top to the top 10 to what RsIDs are duplicated in two files using the code taught by the teacher, and what RSunber are there?

In fact, I can't think of anything other than Excel that I can do like that. Is there any other good alternative?

2) How big the two files. 1. The total file size to be sampled is divided into ->2.45GB files, which are divided into chr1 and chr2, so the files to be sampled are also divided into chr1 chr2. chr1 200mb cr2 202mb and chr3 199mb. 2. The file you want to compare is a gnomAD file.

enter image description here

ADD REPLY • link 2.6 years ago by 순연 • 0

score 0 · Answer 1 · 2021-09-16

0

Entering edit mode

2.6 years ago

Pierre Lindenbaum 161k

comm -12 <(bcftools query -f '%ID\n' in1.vcf | sort |uniq)  <(bcftools query -f '%ID\n' in2.vcf | sort |uniq)

ADD COMMENT • link 2.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you. I'll try to execute your command! But Comm-12 What does this mean?

ADD REPLY • link 2.6 years ago by 순연 • 0

0

Entering edit mode

https://linux.die.net/man/1/comm

ADD REPLY • link 2.6 years ago by Pierre Lindenbaum 161k