Question: Best/most common file format for CNV data
4
gravatar for thondeboer
5.2 years ago by
thondeboer40
thondeboer40 wrote:

Hi,

 

We are developing some software with our Archer NGS kits that produces some CNV data and I was wondering what the best (or, more appropriately, most common) standard data file format is for CNV data? I hate to invent YAF (Yet Another Format) since we are already drowning in data file formats.

I found a list but that is mostly based on SNP arrays, and what we want to produce is NGS coverage-based data so not sure that most of these formats are relevant/appropriate.

  • Affymetrix ChAS: Copy Number Segment Data (tsv)
  • Affymetrix CNAG: Copy Number Data File (txt), Copy Number Segment Data (txt), LOH Segment Data (txt)
  • Affymetrix CNAT: Affymetrix Copy number CNT file (cn.cnt)
  • Affymetrix GTC 3: Copy Number or LOH Data File (cnchp|lohchp), Copy Number Segment Data (cn_segments|tsv)
  • Agilent: Aberration and LOH Interval Report (tsv|xls), Agilent Interval-based aberration report (tsv|xls), Agilent Probe-based aberration report (tsv|xls)
  • ArrayCGHBase: ArrayCGHBase aberration report (txt)
  • BlueGnome: BlueFuse CGH Summary (xls)
  • Illumina GenomeStudio: Copy Number Data File (txt), QuantiSNP GenomeStudio Plugin bookmark file (txt)
  • Illumina KaryoStudio: KaryoStudio regions file (txt)
  • Nexus Copy Number: Nexus regions file (txt)
  • NimbleGen: NimbleGen data summary file (txt), NimbleGen Segtable file (txt)
  • OGT CytoSure: OGT aberration report (txt)
  • QuantiSNP: QuantiSNP result file (txt), QuantisSNP GenomeStudio Plugin bookmark file (txt)

Thanks

Thon

Enzymatics, Inc.

cnv data format • 4.6k views
ADD COMMENTlink modified 5.2 years ago by Chris Miller21k • written 5.2 years ago by thondeboer40
4
gravatar for Chris Miller
5.2 years ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:

The most common format I'm aware of is similar to the default output of the dnaCopy package:

    Chr   Start    Stop    Num_Probes    Segment_Mean

Variations include a leading column that has the sample name. For WGS data, num_probes is generally 'number of genomic windows'.

ADD COMMENTlink written 5.2 years ago by Chris Miller21k

this format seems the most obvious and sensible. It's BED format, so can be manipulated by a lot of tools or simple scripting and conveys all the needed info.

ADD REPLYlink written 5.2 years ago by brentp23k

I used also this simple and reduced format.

If needed, you can add fields such like "Probe_Values" and "Probe_p.values" concatenated.

ADD REPLYlink written 5.2 years ago by Christophe Poulet0
2
gravatar for Pierre Lindenbaum
5.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum126k wrote:

The 1000 genomes project stores the CNV in the VCF files.

 

$ curl -s "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz" | gunzip -c | grep CN2 | cut -f 1-5
##ALT=<ID=CN2,Description="Copy number allele: 2 copies">
##ALT=<ID=CN20,Description="Copy number allele: 20 copies">
##ALT=<ID=CN21,Description="Copy number allele: 21 copies">
##ALT=<ID=CN22,Description="Copy number allele: 22 copies">
##ALT=<ID=CN23,Description="Copy number allele: 23 copies">
##ALT=<ID=CN24,Description="Copy number allele: 24 copies">
##ALT=<ID=CN25,Description="Copy number allele: 25 copies">
##ALT=<ID=CN26,Description="Copy number allele: 26 copies">
##ALT=<ID=CN27,Description="Copy number allele: 27 copies">
##ALT=<ID=CN28,Description="Copy number allele: 28 copies">
##ALT=<ID=CN29,Description="Copy number allele: 29 copies">
1    668630    DUP_delly_DUP20532    G    <CN2>
1    713044    DUP_gs_CNV_1_713044_755966    C    <CN0>,<CN2>
1    773090    DUP_gs_CNV_1_773090_852664    T    <CN0>,<CN2>
1    963826    DUP_gs_CNV_1_963826_974172    C    <CN2>
1    1171539    DUP_gs_CNV_1_1171539_1179729    C    <CN2>
1    1249799    DUP_gs_CNV_1_1249799_1265722    G    <CN2>
1    1304952    DUP_gs_CNV_1_1304952_1323528    T    <CN2>
1    1393861    DUP_gs_CNV_1_1393861_1427383    T    <CN0>,<CN2>

 

ADD COMMENTlink modified 5.2 years ago • written 5.2 years ago by Pierre Lindenbaum126k

CNV representation in VCF always feels awkward to me, like they're being shoehorned into a format really not designed for them. (which they are).  I also understand the desire for a single unified format for all genome variation, though.  Same goes for SVs

ADD REPLYlink written 5.2 years ago by Chris Miller21k
1
gravatar for deanna.church
5.2 years ago by
deanna.church1.1k
Bethesda, MD
deanna.church1.1k wrote:

dbVar (http://www.ncbi.nlm.nih.gov/dbvar) and DGVa (http://www.ebi.ac.uk/dgva/) emit data as GVF as well:

http://www.sequenceontology.org/resources/gvf.html

##gff-version 3
##gvf-version 1.07
##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606
##genome-build NCBI GRCh37.p13
##file-date 2014-11-14
# assembly-name GRCh37.p13 
# assembly-accession GCF_000001405.25
# Study_accession: nstd51 Dbxref=URL:www.ncbi.nlm.nih.gov/dbvar/studies/nstd51
# Display_name: User_submitted_curated_variants
# Study_description: User submitted curated variants from OMIM, GeneReviews, or ClinVar
browser position chrX:1150927-1151051
track name=User_submitted_curated_variants variants description=User_submitted_curated_variants sv variants visibility=3
chrX    dbVar    complex_structural_alteration    1150927    1151051    .    .    .    ID=1;Name=nsv1067847;Alias=59488;Dbxref=URL:www.ncbi.nlm.nih.gov/dbvar/variants/nsv1067847;remapScore=1;Variant_seq=~,.;Reference_seq=~
chrX    dbVar    copy_number_variation    2967446    3009217    .    .    .    ID=2;Name=nsv1067908;Alias=131947;Dbxref=URL:www.ncbi.nlm.nih.gov/dbvar/variants/nsv1067908;remapScore=1;Variant_seq=~,.;Reference_seq=~
chrX    dbVar    copy_number_variation    2959963    2982405    .    .    .    ID=3;Name=nsv1067909;Alias=131948;Dbxref=URL:www.ncbi.nlm.nih.gov/dbvar/variants/nsv1067909;remapScore=1;Variant_seq=~,.;Reference_seq=~
ADD COMMENTlink written 5.2 years ago by deanna.church1.1k
0
gravatar for thondeboer
5.2 years ago by
thondeboer40
thondeboer40 wrote:

Thanks for the help, but it is clear that there is not much of a standard out there...Other then shoehorning it into VCF or GVF with its own set of attributes...But I guess that is we'll have to do...

BTW...It's not clear to me what the CNV is in the GVF format...it only seems to indicate that there IS a CNV for the two variants listed there, but not what the actual CNV value is...

I am looking for a file format that can represent the CNV values we have deduced from a set of sequence probes, based both on coverage and perhaps also using the SNP allele fractions...Here's an example of what data we have

There are 6-10 regions (equivalent of probes in aCGH I guess), a P value for each probe and a CN for the gene it represents

It looks more and more to me I should use some of the aCGH data formats that are out there since the data most resembles that data format...


Thanks


Thon

ADD COMMENTlink modified 5.2 years ago • written 5.2 years ago by thondeboer40

I only showed a subset of the GVF file. I recommend looking at this- it works really well with array data (there is a lot of that in dbVar) and can cleanly handle the breakpoint ambiguity associated with arrays. 

ADD REPLYlink written 5.2 years ago by deanna.church1.1k

Well..It may be a subset, but there is no information in the few entries you showed of the actual SIZE of the copy number.

I downloaded the GVF for the somatic mutations and none of the entries in the GVF file seem to have any information about the actual copy number of the region..Or am I missing something? Is the reMap score the CN? The definition in the readme file is saying it's "

remapScore: indicates score of remapped placements

" which does not mean much to me but does not seem to be the CN for that region/gene

ADD REPLYlink written 5.2 years ago by thondeboer40

I believe only records with a copy_number_loss or copy_number_gain have a copy number record like the following:

ID=4;Name=essv4367667;Alias=3;parent=esv1791726;Dbxref=URL:www.ncbi.nlm.nih.gov/dbvar/variants/esv1791726;var_origin=de novo;Zygosity=heterozygous; Start_range=6848878,6854451;End_range=6903599,6912597;clinical_int=Likely pathogenic;copy_number=1;remapScore=1;validated=Pass;sample_name=3-1; phenotype=Cerebellar ataxia;phenotype_id=MeSH:D002524,MedGen:C0007758;Variant_seq=-,.;Reference_seq=~'

The remap score is related to how well the record has been moved from one build to another. (unfortunately, I'm still trying to work out what this means as the values go from 0.5 to 3.28)

ADD REPLYlink written 3.1 years ago by gabriel.aldam0

Looking into this some more, I guess it only is a database of regions of variations. Each study/sample will have a different CN associated with this ofcourse...

So, what I am looking for is the actual study data I think...Looking some more

ADD REPLYlink written 5.2 years ago by thondeboer40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2086 users visited in the last hour