How To Transform Adjacent Complex Variants In Vcf Format To Individuals' Sequences From 1000Genomes Project?
1
4
Entering edit mode
12.2 years ago
Shinvkuo ▴ 130

Hello, everyone. Recently, I have downloaded the latest variant data of 1000genomes project from ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/release/20110521/ and the mapped data from ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/data/HG00096/. I want to transform these variants(INDEL and SV) to sequences for each individual. I haven't found any program available to do the job. Therefore, I writed a script according to these rules: 1) For an insert, next_pos=current_pos+1; 2) For a delete or SV, next_pos=current_pos+length(ref_allele). But I encountered some problem for adjacent variants as follows:

CHROM    POS    ID    REF    ALT    HG00096
20    131503    .    ATC    A    1|0:0.940:0.00,-0.70,-31.60
20    131505    rs3079754    CTCT    C    1|0:1.000:-3.20,0.00,-40.50

I got the reference sequence as ATCTCTTG ##chr20:131503-131010. But when I resolved the first position 131503 which resulted the next position as 131506(131503+3) skipping the variant rs3079754, thus I got the sequence A(131503-131505) + TCTTG(131506-131510) as ATCTTG containing the variants. And I observed the mapped result as ATCTG using IGV. Two are different, Which I should choose? Besides, other instances:

20    179735    rs76051089    TTAAAA    T
20    179741    .    TAAAAG    T

The refernce sequence is TTAAAATAAAAGTGAA ##chr20:179735-179750, I got the variant sequence T(179735-179740) + T(179741-179746) + TGAA(179747-179750) as TTTGAA. And the mapped result is TTAAAAAA observed from IGV. Two are different.

20    144831    .    T    TC
20    144832    rs35137860    C    CT

The refernce sequence is TCTTG ##chr20:144831-144835, I got the variant sequence TC(144831) + CT(144832) + TTG(144833-144835) as TCCTTTG. And the mapped result is TCTTTG observed from IGV. Two are different.

20    356672    rs60652239    A    AAAC
20    356674    .    A    AC

The refernce sequence is AAAAA ##chr20:356672-356676, I got the variant sequence AAAC(356672) + A(356673) + AC(356674) + AA(356675-356676)as AAACAACAA. And the mapped result is AAACAACAA observed from IGV. Two are the same.

20    386575    .    GTTCT    G
20    386582    .    CTTTCT    C
20    386585    rs59090432    TC    T

The refernce sequence is GTTCTTTCTTTCTT ##chr20:386575-386588, I got the variant sequence G(386575-386579) + TT(386580-386581) + C(386582-386587) + T(386588) as GTTCT (skipped rs59090432 variant). And the mapped result is GTTCTTTTTT observed from IGV. Two are different.

How do these differences happen, do they originate from MaCH/Thunder imputation? Did I do wrongly? Now I handled variants one by one, should I consider adjacent variants as a whole which is definitely more complex? Or is there any existing program I missed to do the transformation job?

Any suggestions and discussions are welcome. Thanks!

vcf genome indel sv variant • 3.1k views
ADD COMMENT
1
Entering edit mode

Many steps are assuming site independencies. The confliction is a result of that. Just do whatever you like. The VCF is probably wrong. You cannot fix that without going back to the alignment.

ADD REPLY
1
Entering edit mode
12.2 years ago
Gustavo ▴ 530

"But when I resolved the first position 131503 which resulted the next position as 131506(131503+3) skipping the variant rs3079754 [...]"

The coordinates of the variants are relative to the reference. Resolving one variant should not cause a skip of other variants. You might want to process them in reverse numerical order to make it easier.

Looking at your examples:

20 131503 ATC->A & 131505 CTCT->C: I read this as "lose the TC after the A, and lose the TCT after the C", which yields ATCTCTTG -> A(tc)(tct)TG -> ATG... which is different from both results you mentioned...

20 179735 TTAAAA->T & 179741 TAAAAG->T: TTAAAATAAAAGTGAA -> T(taaaa)T(aaaag)TGAA -> TTTGAA = your answer.

20 144831 T->TC & 144832 C->CT: TCTTG -> T(+c)C(+t)TTG -> TCCTTTG = your answer.

20 356672 A->AAAC & 356674 A->AC: AAAAA -> A(+aac)AA(+c)AA -> AAACAACAA = we all agree!

20 386575 GTTCT->G & 386582 CTTTCT->C & 386585 TC->T: GTTCTTTCTTTCTT -> G(ttct)TTC(ttt(c)t)T -> GTTCT = your answer, but UGH, these last two variants DO appear to conflict. My guess is that the former is the "correct" variant and the latter is there simply because it's a previously named variant. Do note these variants are located in the poly-A tail of an AluSx repeat... so there may be mapping issues confounding this spot.

ADD COMMENT
0
Entering edit mode

Gustavo, thanks for your reply! I have emailed this problem to 1000genomes group. These two adjacent events could be actually just one. "20 131503 . ATC A" could not exist. If we just discard it (131503), the resolved alt sequence AT(131503-131504, no variant at 131503) + C(131505-131508) + TG(131509-131510) as ATCTG will be the same as the mapped result using IGV.

ADD REPLY
0
Entering edit mode

Certainly, there still be some events different from the mapped results. I think this problem could be due to sequence self-similarity (like: TCTC[131504-131507], TAAAATAAAA[179736-179745]) around certain regions. And the 1000genomes analysis group may solve it later.

ADD REPLY

Login before adding your answer.

Traffic: 1742 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6