Parsing VCF Data
2
0
Entering edit mode
4.8 years ago
gcooper1245 ▴ 10

So I have about 150 of these VCF files and I forgot to parse the reference before running all 150. For downstream analysis with snpEff, I need to have the chromsome ID only contain JTAI01000001 -> JTAI01000053 in that collumn, not all that other junk. Does anyone have a way in which I could potentially substitute out everything but the middle JTAI01000001 part of these GVCF's so I can proceed with my analysis.

##contig=<ID=ENA|JTAI01000001|JTAI01000001.1,length=360176>
##contig=<ID=ENA|JTAI01000002|JTAI01000002.1,length=959544>
##contig=<ID=ENA|JTAI01000003|JTAI01000003.1,length=208220>
##contig=<ID=ENA|JTAI01000004|JTAI01000004.1,length=470636>
##contig=<ID=ENA|JTAI01000005|JTAI01000005.1,length=225370>
##contig=<ID=ENA|JTAI01000006|JTAI01000006.1,length=364413>
##contig=<ID=ENA|JTAI01000007|JTAI01000007.1,length=1279890>
##contig=<ID=ENA|JTAI01000008|JTAI01000008.1,length=18993>
##contig=<ID=ENA|JTAI01000009|JTAI01000009.1,length=291696>
##contig=<ID=ENA|JTAI01000010|JTAI01000010.1,length=821>
##contig=<ID=ENA|JTAI01000011|JTAI01000011.1,length=128648>
##contig=<ID=ENA|JTAI01000012|JTAI01000012.1,length=66483>
##contig=<ID=ENA|JTAI01000013|JTAI01000013.1,length=592675>
##contig=<ID=ENA|JTAI01000014|JTAI01000014.1,length=1554>
##contig=<ID=ENA|JTAI01000015|JTAI01000015.1,length=3499>
##contig=<ID=ENA|JTAI01000016|JTAI01000016.1,length=5436>
##contig=<ID=ENA|JTAI01000017|JTAI01000017.1,length=1198>
##contig=<ID=ENA|JTAI01000018|JTAI01000018.1,length=6108>
##contig=<ID=ENA|JTAI01000019|JTAI01000019.1,length=9709>
##contig=<ID=ENA|JTAI01000020|JTAI01000020.1,length=523589>
##contig=<ID=ENA|JTAI01000021|JTAI01000021.1,length=97817>
##contig=<ID=ENA|JTAI01000022|JTAI01000022.1,length=268453>
##contig=<ID=ENA|JTAI01000023|JTAI01000023.1,length=215216>
##contig=<ID=ENA|JTAI01000024|JTAI01000024.1,length=79716>
##contig=<ID=ENA|JTAI01000025|JTAI01000025.1,length=121647>
##contig=<ID=ENA|JTAI01000026|JTAI01000026.1,length=31279>
##contig=<ID=ENA|JTAI01000027|JTAI01000027.1,length=3130>
##contig=<ID=ENA|JTAI01000028|JTAI01000028.1,length=340737>
##contig=<ID=ENA|JTAI01000029|JTAI01000029.1,length=5801>
##contig=<ID=ENA|JTAI01000030|JTAI01000030.1,length=4981>
##contig=<ID=ENA|JTAI01000031|JTAI01000031.1,length=318753>
##contig=<ID=ENA|JTAI01000032|JTAI01000032.1,length=45350>
##contig=<ID=ENA|JTAI01000033|JTAI01000033.1,length=114418>
##contig=<ID=ENA|JTAI01000034|JTAI01000034.1,length=1682>
##contig=<ID=ENA|JTAI01000035|JTAI01000035.1,length=28211>
##contig=<ID=ENA|JTAI01000036|JTAI01000036.1,length=117188>
##contig=<ID=ENA|JTAI01000037|JTAI01000037.1,length=188157>
##contig=<ID=ENA|JTAI01000038|JTAI01000038.1,length=3440>
##contig=<ID=ENA|JTAI01000039|JTAI01000039.1,length=373676>
##contig=<ID=ENA|JTAI01000040|JTAI01000040.1,length=996>
##contig=<ID=ENA|JTAI01000041|JTAI01000041.1,length=618>
##contig=<ID=ENA|JTAI01000042|JTAI01000042.1,length=211284>
##contig=<ID=ENA|JTAI01000043|JTAI01000043.1,length=87165>
##contig=<ID=ENA|JTAI01000044|JTAI01000044.1,length=873289>
##contig=<ID=ENA|JTAI01000045|JTAI01000045.1,length=795>
##contig=<ID=ENA|JTAI01000046|JTAI01000046.1,length=590>
##contig=<ID=ENA|JTAI01000047|JTAI01000047.1,length=705>
##contig=<ID=ENA|JTAI01000048|JTAI01000048.1,length=1262>
##contig=<ID=ENA|JTAI01000049|JTAI01000049.1,length=1307>
##contig=<ID=ENA|JTAI01000050|JTAI01000050.1,length=766>
##contig=<ID=ENA|JTAI01000051|JTAI01000051.1,length=795>
##contig=<ID=ENA|JTAI01000052|JTAI01000052.1,length=724>
##contig=<ID=ENA|JTAI01000053|JTAI01000053.1,length=619>
##reference=file:///scratch/gwc32007/crypto_genomes/30976_hominis_genome.fasta
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ERR1305010
 ENA|JTAI01000001|JTAI01000001.1    1   .   A   <NON_REF>   .   .   END=2   GT:DP:GQ:MIN_DP:PL   
 0:7:0:7:0,0
 ENA|JTAI01000001|JTAI01000001.1    3   .   A   <NON_REF>   .   .   END=3   GT:DP:GQ:MIN_DP:PL   
 0:7:99:7:0,298
 ENA|JTAI01000001|JTAI01000001.1    4   .   C   <NON_REF>   .   .   END=8   GT:DP:GQ:MIN_DP:PL   
 0:7:0:7:0,0
ENA|JTAI01000001|JTAI01000001.1 9   .   A   <NON_REF>   .   .   END=9   GT:DP:GQ:MIN_DP:PL   
0:7:99:7:0,300
ENA|JTAI01000001|JTAI01000001.1 10  .   C   <NON_REF>   .   .   END=11  GT:DP:GQ:MIN_DP:PL   
0:7:0:7:0,0
ENA|JTAI01000001|JTAI01000001.1 12  .   C   <NON_REF>   .   .   END=12  GT:DP:GQ:MIN_DP:PL   
0:7:99:7:0,284
 ENA|JTAI01000001|JTAI01000001.1    13  .   T   <NON_REF>   .   .   END=14  GT:DP:GQ:MIN_DP:PL   
0:7:0:7:0,0
 ENA|JTAI01000001|JTAI01000001.1    15  .   A   <NON_REF>   .   .   END=18  GT:DP:GQ:MIN_DP:PL   
0:7:99:7:0,276
vcf parsing chromosome tag • 1.0k views
ADD COMMENT
0
Entering edit mode
4.8 years ago

bcftools annotate

Usage:   bcftools annotate [options] <in.vcf.gz>
(...)
       --rename-chrs <file>       rename sequences according to map file: from\tto
ADD COMMENT
0
Entering edit mode

Pierre,

Which version do you see the from\tto description? The manual (both of them actually) says white-space separated.

ADD REPLY
1
Entering edit mode
Version: 1.9 (using htslib 1.9)
ADD REPLY
0
Entering edit mode

That's odd - the manual on htslib.org still says white space separated, but when I run bcftools annotate on my local machine, I see the from\tto. Could it be that the online manual is not being maintained properly?

ADD REPLY
0
Entering edit mode
4.8 years ago
Ram 43k

This has been addressed multiple times on the forum. The best way to do this is to create a whitespace-separated file with new and old contig names like so:

ENA|JTAI01000001|JTAI01000001.1 JTAI01000001
ENA|JTAI01000002|JTAI01000002.1 JTAI01000002
..
..
ENA|JTAI01000053|JTAI01000053.1 JTAI01000053

and use that file with bcftools annotate --rename-chrs. See the bcftools manual to understand the exact syntax.

ADD COMMENT

Login before adding your answer.

Traffic: 2957 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6