Forum:Principles of Tidy Data (Hadley Wickham) and the VCF format
4
14
Entering edit mode
7.4 years ago

Hadley Wickham, the author of ggplot and many other popular R packages, has recently published a very good paper regarding the principles of tidy data. This article introduces a new library called tidyr, and also proposes a standard for formatting and organizing data before data analysis.

I personally think that the principles proposed in the article are very good, and that they help a lot in data analysis. Some of these are already adopted by many ggplot2/plyr users, as you need a data frame in a long format in order to produce most of the plots.

My question is whether it would make sense to apply these principles to bioinformatics. In particular, if we look at the VCF format, it fails at least two of the rules mentioned in the paper:

• "3.1. Column headers are values, not variable names" (because individuals are encoded on distinct columns)
• "3.2. Multiple variables stored in one column" (because each genotype column contains the status of one or more alleles, plus its coverage etc...

For example, let's take the example from the 4.0 specs of VCF:

> vcf = read.table('vcf_example.vcf', header=T, comment.char='')
> vcf

X.CHROM     POS        ID  REF     ALT QUAL FILTER                              INFO      FORMAT        NA00001        NA00002      NA00003
1      20   14370 rs6054257    G       A   29   PASS           NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
2      20   17330         .    T       A    3    q10               NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50   0|1:3:5:65,3     0/0:41:3
3      20 1110696 rs6040355    A     G,T   67   PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27   2|1:2:0:18,2     2/2:35:4
4      20 1230237         .    T       .   47   PASS                   NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51     0/0:61:2
5      20 1234567 microsat1 GTCT G,GTACT   50   PASS                    NS=3;DP=9;AA=G    GT:GQ:DP       0/1:35:4       0/2:17:2     1/1:40:3


If we were to convert this data to a "tidyr" format, we could use the following:

> vcf %>%
gather(individual, value, -c(X.CHROM:FORMAT)) %>%
separate(value, into=strsplit('GT:GQ:DP:HQ', ':')[[1]], ':', extra='drop') %>%
separate('GT', into=c('allele1', 'allele2'), '[|/]') %>%
gather(allele, genotype, -c(X.CHROM:individual, GQ:HQ)) %>%
arrange(X.CHROM, POS, ID, individual) %>%
select(-INFO, -FORMAT, -FILTER) %>%  # let's omit this for better visualization
subset(ID!='microsat1')              # let's omit this for better visualization

X.CHROM  POS          ID     REF ALT QUAL individual GQ DP    HQ  allele genotype
20       14370   rs6054257   G   A   29   NA00001    48  1 51,51 allele1        0
20       14370   rs6054257   G   A   29   NA00001    48  1 51,51 allele2        0
20       14370   rs6054257   G   A   29   NA00002    48  8 51,51 allele1        1
20       14370   rs6054257   G   A   29   NA00002    48  8 51,51 allele2        0
20       14370   rs6054257   G   A   29   NA00003    43  5   .,. allele1        1
20       14370   rs6054257   G   A   29   NA00003    43  5   .,. allele2        1
20       17330           .   T   A    3   NA00001    49  3 58,50 allele1        0
20       17330           .   T   A    3   NA00001    49  3 58,50 allele2        0
20       17330           .   T   A    3   NA00002     3  5  65,3 allele1        0
20       17330           .   T   A    3   NA00002     3  5  65,3 allele2        1
20       17330           .   T   A    3   NA00003    41  3  <NA> allele1        0
20       17330           .   T   A    3   NA00003    41  3  <NA> allele2        0
20       1110696 rs6040355   A G,T   67   NA00001    21  6 23,27 allele1        1
20       1110696 rs6040355   A G,T   67   NA00001    21  6 23,27 allele2        2
20       1110696 rs6040355   A G,T   67   NA00002     2  0  18,2 allele1        2
20       1110696 rs6040355   A G,T   67   NA00002     2  0  18,2 allele2        1
20       1110696 rs6040355   A G,T   67   NA00003    35  4  <NA> allele1        2
20       1110696 rs6040355   A G,T   67   NA00003    35  4  <NA> allele2        2
20       1230237         .   T   .   47   NA00001    54  7 56,60 allele1        0
20       1230237         .   T   .   47   NA00001    54  7 56,60 allele2        0
20       1230237         .   T   .   47   NA00002    48  4 51,51 allele1        0
20       1230237         .   T   .   47   NA00002    48  4 51,51 allele2        0
20       1230237         .   T   .   47   NA00003    61  2  <NA> allele1        0
20       1230237         .   T   .   47   NA00003    61  2  <NA> allele2        0


(note that I omitted a few columns for visualization purpose and that I should also separate the HQ and INFO columns, but I think the concept is clear)

On one hand, this tidy format seems a lot more redundant than what is needed. Lot of information is repeated multiple times, and this file would require a lot of disk space to be stored (although maybe an efficient compression or a data base would reduce the problem).

On the other hand, this tidy format may be easier to use. To me, it seems easier to get the status of a given allele in a given individual, as I just need to find the corresponding row. Another advantage is that it would be easier to add new information to the file, like the population group of each individual, just by adding a new column.

> vcf.tidy %>% mutate(genotype.nt = ifelse(genotype==0, as.character(REF), as.character(ALT)), pop=ifelse(individual %in% c('NA00001', 'NA00002'), 'CHB', 'YRI'))

X.CHROM POS     ID        REF ALT QUAL individual GQ DP HQ    allele  gen gen.nt pop
20        14370 rs6054257 G   A   29   NA00001    48 1  51,51 allele1 0   G      CHB
20        14370 rs6054257 G   A   29   NA00001    48 1  51,51 allele2 0   G      CHB
20        14370 rs6054257 G   A   29   NA00002    48 8  51,51 allele1 1   A      CHB
20        14370 rs6054257 G   A   29   NA00002    48 8  51,51 allele2 0   G      CHB
20        14370 rs6054257 G   A   29   NA00003    43 5  .,.   allele1 1   A      YRI
20        14370 rs6054257 G   A   29   NA00003    43 5  .,.   allele2 1   A      YRI
20        17330 .         T   A    3   NA00001    49 3  58,50 allele1 0   T      CHB
20        17330 .         T   A    3   NA00001    49 3  58,50 allele2 0   T      CHB
20        17330 .         T   A    3   NA00002    3  5  65,3  allele1 0   T      CHB
20        17330 .         T   A    3   NA00002    3  5  65,3  allele2 1   A      CHB
20        17330 .         T   A    3   NA00003    41 3  <NA>  allele1 0   T      YRI
20        17330 .         T   A    3   NA00003    41 3  <NA>  allele2 0   T      YRI
20      1110696 rs6040355 A   G,T 67   NA00001    21 6  23,27 allele1 1   G,T    CHB
20      1110696 rs6040355 A   G,T 67   NA00001    21 6  23,27 allele2 2   G,T    CHB
20      1110696 rs6040355 A   G,T 67   NA00002    2  0  18,2  allele1 2   G,T    CHB
20      1110696 rs6040355 A   G,T 67   NA00002    2  0  18,2  allele2 1   G,T    CHB
20      1110696 rs6040355 A   G,T 67   NA00003    35 4  <NA>  allele1 2   G,T    YRI
20      1110696 rs6040355 A   G,T 67   NA00003    35 4  <NA>  allele2 2   G,T    YRI
20      1230237 .         T   .   47   NA00001    54 7  56,60 allele1 0   T      CHB
20      1230237 .         T   .   47   NA00001    54 7  56,60 allele2 0   T      CHB
20      1230237 .         T   .   47   NA00002    48 4  51,51 allele1 0   T      CHB
20      1230237 .         T   .   47   NA00002    48 4  51,51 allele2 0   T      CHB
20      1230237 .         T   .   47   NA00003    61 2  <NA>  allele1 0   T      YRI
20      1230237 .         T   .   47   NA00003    61 2  <NA>  allele2 0   T      YRI


What do you think? Will we all convert to tidy VCF in the far future?

vcf tidy-data • 8.3k views
1
Entering edit mode

Yes, a format more geared towards tidy data would be highly beneficial for the field.

5
Entering edit mode
7.4 years ago
lh3 33k

VCF encodes two-dimensional information: position*sample. Both position (row) and sample (column) are variable names. This is the key reason that makes VCF complex and inconsistent to some extent. I don't see a clear solution to this problem. You basically use (position,sample) pair as row. This leads to redundant information and redundancy leads to inconsistencies and format parsing problems. For example, it is possible for the first two NA00001 lines to have different IDs and ALT alleles. The parser has to group rows and check consistencies in the group. For another example, we have have the same AF on different lines. When we want to retrieve AF, which line we should use? These are non-trivial. IMO, a good format should have minimal redundancies unless doing this greatly hurts expressiveness. In addition, your transformed format contains less information in the original VCF. You don't have AA/AF/etc. When you make a lossless transformation, the pair-as-row format will also become messier. "Tidy" is good for intermediate data processing, but it is not a good principle for format design.

For VCF, JSON would be a cleaner representation, but the problems caused by JSON may outweigh its benefit.

Separating genotypes is interesting. Some are advocating a single-allele model to describe variants. It is somewhat related.

0
Entering edit mode

"This leads to redundant information and redundancy leads to inconsistencies and format parsing problems" - No, that's what tidy data is all about.

You are correct that a validator script would have to read several lines to validate the file. This would not pose a problem using for example the tidyr package.

2
Entering edit mode

It is a problem if the data are not self-consistent, which may happen frequently in practice. Adding redundancies just adds more ways to shoot ourselves in the foot. Tidy is a good principle for intermediate analyses, but a bad one for format design. A good format should enforce self-consistencies via its structure.

0
Entering edit mode

I can see the point. If a user accidentally modifies one of the redundant values, e.g. by changing the SNV id in one line, then it would be very difficult to reconstruct the data correctly. However, this possibility may be reduced if the file is compressed or the data is kept in a database.

2
Entering edit mode

Variant callers, annotators and every VCF writer are also more likely to generate inconsistent information if they directly operate on tidyVCF. A tidyVCF generated from a normal VCF is always consistent, within the limit of VCF.

2
Entering edit mode
7.4 years ago

I do like this representation quite a bit. IMHO what it does need however is an approach to select and display all attributes at a given position. For example a simple command line utility that finds and produces all ALT columns at say position 1110696 but also transposes them (and does it really fast).

cutvcf -p 14370,1119696 -f ALT, individual


otherwise one has to keep writing programs that combine on the POS column.

0
Entering edit mode

It may be something like:

vcf.tidy %>% subset(POS > 14370 & POS < 1119696) %>% group_by(individual) %>% summarise(ALT=paste0( unique(gen[allele=='allele2']), collapse=','))

individual ALT
1    NA00001 0,2
2    NA00002   1
3    NA00003 0,2

2
Entering edit mode
7.4 years ago

I think this is a great point, and a worthy contribution. The "let's cram in a bunch of serialized stuff into one column" problem has been an issue with VCF and other formats like GFF.

I think we should face the fact that if we include annotation, each transcript and variant effect will result in a "cartesian explosion" that will multiply the rows ad infinitum.

Because everyone cooks up their own goofy annotation scheme, VCF is probably not going to be a long-term solution for warehousing annotated variants much longer, so maybe tidy-VCF would be a good idea for communicating unannotated variants.

In that case I would dispense with all superfluous columns (even REF), and just stick with the basics

CHR POS   ALT SAMPLE   READS QUAL
20  14370 A   NA00001  51    29
20  14370 G   NA00001  50    29
20  15000 +G  NA00002  50    29


Not sure what to do about gVCF reference ranges.

1
Entering edit mode
7.4 years ago

Very redundant call format, perhaps. If a new format is being developed and memory is not a consideration, you might make every feature an instance of a proper S4 R object.

A class member is a list if it needs to store 1..∞ of that feature element, or a numeric or character, otherwise. Standard object design principles apply. Accessors and methods provide access to an object or apply filters to generate subsets.

Where there is redundancy, e.g., chromosome and position keys, you don't need to make a brand new object - just push another set of fields on the redundant object.

Genomics formats should, in principle, render well to object representation. Rules for fields are usually fairly clearly defined, and sensible conventions can be chosen where there are gaps in definition. Methods provide a contract for accessing data.