Question

Converting custom VCF to standard VCF

0

Entering edit mode

8 months ago

avelarbio46 ▴ 30

Hello!

I have received some custom vcfs (yes, I know, and I hate it) containing information of each sample per row, independent of variant. This means that some variants are repeated.

I have the following columns:

columns 1-17

#CHROM  POS GENE    REF ALT QUAL    FILTER  INFO    PTID    AF  AQ  GT  DP  AD  AB  GQ  PL

GT is genotype, AF is allele frequency, DP is depth and AB is the fraction of each allele. I don`t care for extra information except those that do define the frequency or genotype of alleles.

So for some variants:

             PTID
VARIANT A    SAMPLE1
VARIANT A    SAMPLE2
etc

Where Filter, Qual and Info are empty. PTID is the ID for each sample

What I need is a multisample vcf like

#CHROM  POS GENE    REF ALT QUAL    FILTER INFO SAMPLE1_genotype SAMPLE2_genotype etc

Is there any way of converting this to a multi sample vcf based on the PTID? I have no idea where to start as each variant is defined by pos gene ref and alt (I assume), and I`m worried of messing with the file and getting something wrong without knowing

It would be easier to just have the normal VCF, but the institution that gave this custom VCF is not easily accessible

VCF • 1.6k views

ADD COMMENT • link 7 months ago by avelarbio46 ▴ 30

1

Entering edit mode

You have many questions without comments or validation:

ADD REPLY • link updated 8 months ago by Ram 43k • written 8 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

I'm not sure what this means, but most questions also did not have many answers. I always give a thumbs up when the answers works so I'm not sure what is the expectation here! Didn't find any very informative description on the forum rules also

ADD REPLY • link 8 months ago by avelarbio46 ▴ 30

0

Entering edit mode

You should accept the answers that solve your question and/or follow-up others comments and advices with your experience for anyone come across your posts in the future. These posts come up in the search engines so someone else with the same problem would have easier time figuring out what worked and didn't work for you.

ADD REPLY • link 8 months ago by barslmn ★ 2.1k

score 1 · Answer 1 · 2023-08-23

1

Entering edit mode

8 months ago

Pierre Lindenbaum 161k

use awk . Something like this ( I'm too lazy to understand and retrieve all the fields..)

 awk 'BEGIN{printf("##INSERT VCF HEADER HERE\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tSAMPLE\n");} {printf("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\tPTID=AF=%s;AQ=s;GENE=\n",$1,$3,$4,$5,$6,$7,$8,$9,$10,$11,$2);}' input.txt

ADD COMMENT • link 8 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

I think this might be getting closer, but still I'm not sure how to make the genotype for each sample into a new column like a normal multisample vcf. I've updated the question to make this more clear

ADD REPLY • link 8 months ago by avelarbio46 ▴ 30

score 1 · Answer 2 · 2023-08-24

1

Entering edit mode

8 months ago

barslmn ★ 2.1k

I had something similar while back: Converting TSV file to VCF

ADD COMMENT • link 8 months ago by barslmn ★ 2.1k

score 1 · Answer 3 · 2023-09-13

It would be easier to just have the normal VCF, but the institution that gave this custom VCF is not easily accessible

VCF files must adhere to the VCF File Format Specifications. The files you have been given are not VCF files. A "Custom VCF" would be a VCF file that includes INFO and/or FORMAT fields that are not defined in the specifications (e.g. "GT" is specifications-defined but "RANDOMinfoFIELDname" is not).

It would be easier to just have the normal VCF

I would ask your data provider for the actual VCF files for this data. What is likely to have happened is that they've generated real VCF files, done some data transformation on them, and you've been given the custom TSV file output from those transformations. Those TSVs may be useful for the some of the analysis they were doing but they're clearly not useful to you as, not being VCF files, you can't use any of the many many tools that input/output VCF.

score 0 · Answer 4 · 2023-09-13

0

Entering edit mode

7 months ago

avelarbio46 ▴ 30

I ended up using R for this question. The R community is very active and helped reach to an answer (which I customized).

#df.list is your list of dataframes/custom vcfs. I imported them as a named list.

for (i in seq_along(df.list)) {
olddata_long = df.list[[i]]
#get the correct columns of your vcf file. Mine were 1-17

olddata_long = olddata_long[1:17]

#Create FORMAT COLLUMN according to VCF format. GT should be the first
olddata_long["FORMAT"] = "GT:AQ:DP:AD:AB:GQ:PL"

#Correct column orders (FORMAT must be before sample names. GT must be the first after. Columns should be in the order you wrote down on format

olddata_long = olddata_long[c(1:8, 18, 9,12,11,13:17)]

 #transform to data wide, separated by : (vcf format standard)

  data_wide = olddata_long |>
    mutate(across(-c(1:10), ~ paste0(.x))) |> 
    unite("value", -c(1:10), sep = ":")  |> 
    pivot_wider(names_from = PTID, values_from = value)

#change NA to ./.:.:0:.:0,0:.:.           . In my case, I had to customize the NA to match the NA values of each specific column. For example, GT NA value is ./.       NA values are separated by :

data_wide[is.na(data_wide)] = "./.:.:0:.:0,0:.:."

#return dataframe to list
  df.list[[i]] = data_wide
}

I`m not 100% sure that this is in the correct complete VCF format now, but I can annotate with VEP, use BCFtools and tabix. Annotation is working fine on it, so I think its OK.

ADD COMMENT • link 7 months ago by avelarbio46 ▴ 30

0

Entering edit mode

This is not a great approach - R loads the entire file content into memory so you may run into problems. Stick to established tools/shell utilities as much as you can.

ADD REPLY • link 7 months ago by Ram 43k

0

Entering edit mode

I can split my files before hand if they are too big! I`m sure there might be ways that uses less memory, but this works.

ADD REPLY • link 7 months ago by avelarbio46 ▴ 30

1

Entering edit mode

Yeah I guess the presence of samples as rows makes a pivot-wider operation necessary, which makes awk sub-optimal. Good call.

It does seem like the previous answers address a now-altered question, so please don't edit your question in response to an answer when it makes the answer irrelevant. Instead of removing existing content, add corrections/changes as notes so people can follow the evolution of a post. In conjunction with the lack of follow up on your other posts (as listed in a comment above), this starts to look like you are abusing the forum, so please be mindful of your actions.

ADD REPLY • link 7 months ago by Ram 43k