convert CSQ and ANN fields to a tab delimitet document with its headers
1
1
Entering edit mode
6.3 years ago
paumarc ▴ 20

Hello everybody,

I am trying to parse CSQ and AND fields from a VCF using PERL. The line describing the fields looks like

INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format:Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|ALLELE_NUM|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|MINIMISED|SYMBOL_SOURCE|HGNC_ID|CANONICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|GENE_PHENO|SIFT|PolyPhen|DOMAINS|HGVS_OFFSET|GMAF|AFR_MAF|AMR_MAF|EAS_MAF|EUR_MAF|SAS_MAF|AA_MAF|EA_MAF|ExAC_MAF|ExAC_Adj_MAF|ExAC_AFR_MAF|ExAC_AMR_MAF|ExAC_EAS_MAF|ExAC_FIN_MAF|ExAC_NFE_MAF|ExAC_OTH_MAF|ExAC_SAS_MAF|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|LoF|LoF_filter|LoF_flags|LoF_info">

and the line with the information looks like

CSQ=A|synonymous_variant|LOW|IL17RE|ENSG00000163701|Transcript|ENST00000295980|protein_coding|15/17||ENST00000295980.3:c.1344G>A|ENST00000295980.3:c.1344G>A(p.%3D)|1461|1344|448|P|ccG/ccA|rs455863|1||1||SNV|1|HGNC|18439|YES|||CCDS2589.1|ENSP00000295980|Q8NFR9||UPI000003E87E||||hmmpanther:PTHR15583&hmmpanther:PTHR15583:SF5&Pfam_domain:PF15037||A:0.3620|A:0.3511|A:0.4054|A:0.4625|A:0.0933|A:0.5447|A:0.3211|A:0.4151|A:0.5324|A:0.4368|A:0.461|A:0.4018|A:0.4609|A:0.08017|A:0.5853|A:0.538|A:0.4945||||||||||||

when i use a split command to build a @info_csq ad @values_csq files i get a different number of element in both arrays

Can anyone tell me how to correlate the field header and its value ? i just tried to do it with the Vcf.pm library, but it sems not possible

thanks

VCF CSQ ANN PARSING PERL • 4.6k views
ADD COMMENT
4
Entering edit mode
6.3 years ago

my tools vcf2table breaks the ANN field: http://lindenb.github.io/jvarkit/VcfToTable.html

 ANN
 +-----------------------+--------+----------+----------+-----------------+-------------+-----------------+-----------------+-------------------------+----------+
 | SO                    | Allele | Impact   | GeneName | GeneId          | FeatureType | FeatureId       | BioType         | HGVsc                   | Distance |
 +-----------------------+--------+----------+----------+-----------------+-------------+-----------------+-----------------+-------------------------+----------+
 | upstream_gene_variant | T      | MODIFIER | G....    | EN............. | transcript  | EN............. | protein_coding  | c.-18651_-18647delCAAAA | 2326     |
 | upstream_gene_variant | T      | MODIFIER | G....    | EN............. | transcript  | EN............. | protein_coding  | c.-18651_-18647delCAAAA | 2282     |
 | upstream_gene_variant | T      | MODIFIER | G....    | EN............. | transcript  | EN............. | protein_coding  | c.-18651_-18647delCAAAA | 2282     |
 | upstream_gene_variant | T      | MODIFIER | G....    | EN............. | transcript  | EN............. | protein_coding  | c.-18651_-18647delCAAAA | 2282     |
 | upstream_gene_variant | T      | MODIFIER | G....    | EN............. | transcript  | EN............. | protein_coding  | c.-18651_-18647delCAAAA | 2281     |
 | upstream_gene_variant | T      | MODIFIER | G....    | EN............. | transcript  | EN............. | retained_intron | n.-2285_-2281delCAAAA   | 2281     |
 +-----------------------+--------+----------+----------+-----------------+-------------+-----------------+-----------------+-------------------------+----------+

Can anyone tell me how to correlate the field header and its value ? i

you need to split by ', ' to get a prediction for for each transcript and then split by '|' to get the fields: http://snpeff.sourceforge.net/VCFannotationformat_v1.0.pdf

ADD COMMENT
0
Entering edit mode

that is perfect, thanks, anyway your script do not parse the csq field, where i have the problem between the different number of fields and headaers

ADD REPLY
1
Entering edit mode

your script do not parse the csq field

it does

VEP
 +--------------------------+------+----------------+------------+-----------------+--------+------------------+-----------------------------------------------+-------------+---------+-----------------+----------------------+
 | PolyPhen                 | EXON | SIFT           | ALLELE_NUM | Gene            | SYMBOL | Protein_position | Consequence                                   | Amino_acids | Codons  | Feature         | BIOTYPE              |
 +--------------------------+------+----------------+------------+-----------------+--------+------------------+-----------------------------------------------+-------------+---------+-----------------+----------------------+
 | probably_damaging(0.956) | 8/9  | deleterious(0) | 1          | ENSG00000102967 | DHODH  | 346/395          | missense_variant                              | R/W         | Cgg/Tgg | ENST00000219240 | protein_coding       |
 |                          | 3/4  |                | 1          | ENSG00000102967 | DHODH  |                  | non_coding_exon_variant&nc_transcript_variant |             |         | ENST00000571392 | retained_intron      |
 |                          |      |                | 1          | ENSG00000102967 | DHODH  |                  | downstream_gene_variant                       |             |         | ENST00000572003 | retained_intron      |
 |                          |      |                | 1          | ENSG00000102967 | DHODH  |                  | downstream_gene_variant                       |             |         | ENST00000573843 | retained_intron      |
 |                          |      |                | 1          | ENSG00000102967 | DHODH  |                  | downstream_gene_variant                       |             |         | ENST00000573922 | processed_transcript |
 |                          |      |                | 1          | ENSG00000102967 | DHODH  | -/193            | intron_variant                                |             |         | ENST00000574309 | protein_coding       |
 | probably_damaging(0.946) | 8/9  | deleterious(0) | 1          | ENSG00000102967 | DHODH  | 344/393          | missense_variant                              | R/W         | Cgg/Tgg | ENST00000572887 | protein_coding       |
 +--------------------------+------+----------------+------------+-----------------+--------+------------------+-----------------------------------------------+-------------+---------+-----------------+----------------------+
ADD REPLY
0
Entering edit mode

ups

i did not saw it, thanks

ADD REPLY
0
Entering edit mode

i finally have managed with the csq field, thanks

ADD REPLY
0
Entering edit mode

Hi, I am running into the same problem of parsing the CSQ field, I was wondering how you solved it? It'd be a huge help, thanks!

ADD REPLY
0
Entering edit mode

it can be done by two steeps

first, i parsed the info csq field to keep the "headers" in an array, let's call it @CSQheaders

the second steps is parsing the data field ($datafield), this field must be processed in a two nested loops (in Perl a foreach loop and a for loop, I don't know your favorite language).

the final structure will look like:

     foreach $spliẗ_by_comma(split(/,/,$datafield)) <-$datafield contain the data parsed from the csq field

            @split_by_bar =split(/\|/,$split_by_comma) #put the data into an array

           #here comes the triki part

                   for $i(0..$#CSQheaders) (for those not proraming in perl $# returns the numers of elements in a array

                                print "$CSQheaders[$i]: $split_by_bar[$i]

i hope i could explain clearly enough. If you have any question just ask

ADD REPLY

Login before adding your answer.

Traffic: 2543 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6