How to find Chromosome length from Annotated VCF file?
0
0
Entering edit mode
20 months ago
rjsgur789 • 0
##contig=<ID=Cla97Chr01,length=36935898,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Chr02,length=37915939,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Chr03,length=31872261,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Chr04,length=27110815,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Chr05,length=35887987,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Chr06,length=29507460,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Chr07,length=31939013,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Chr08,length=28201227,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Chr09,length=37727573,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Chr10,length=35099344,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Chr11,length=30886124,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf001,length=233319,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf002,length=184572,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf003,length=114230,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf004,length=101662,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf005,length=94675,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf006,length=84208,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf007,length=81764,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf008,length=72605,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf009,length=71887,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf010,length=71582,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf011,length=65584,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf012,length=60969,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf013,length=58660,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf014,length=54089,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf015,length=52299,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf016,length=46704,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf017,length=44379,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf018,length=42155,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf019,length=33490,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf020,length=31069,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf021,length=30825,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf022,length=28329,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf023,length=27981,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf024,length=25595,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf025,length=25293,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf026,length=24210,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf027,length=23408,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf028,length=21950,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf029,length=20859,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf030,length=20840,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf031,length=19595,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf032,length=19375,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf033,length=18359,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf034,length=17995,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf035,length=17722,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf036,length=17502,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf037,length=16522,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf038,length=14416,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf039,length=13492,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf040,length=12677,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf041,length=12562,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf042,length=12490,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf043,length=12451,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf044,length=12340,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf045,length=12169,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf046,length=11870,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf047,length=11623,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf048,length=11207,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf049,length=10862,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf050,length=10717,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf051,length=10458,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf052,length=9933,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf053,length=9867,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf054,length=9408,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf055,length=9274,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf056,length=8131,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf057,length=7996,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf058,length=7778,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf059,length=7696,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf060,length=7606,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf061,length=7597,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf062,length=7489,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf063,length=7365,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf064,length=7286,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf065,length=7025,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf066,length=6976,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf067,length=6921,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf068,length=6573,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf069,length=6212,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf070,length=5790,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf071,length=5405,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf072,length=5173,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf073,length=5121,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf074,length=4784,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf075,length=4023,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf076,length=3821,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf077,length=3610,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf078,length=3275,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf079,length=2787,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf080,length=2584,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf081,length=2456,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf082,length=2421,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf083,length=2406,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf084,length=2401,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf085,length=2127,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf086,length=2108,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf087,length=1913,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf088,length=1838,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf089,length=1798,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf090,length=1621,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf091,length=1603,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf092,length=1527,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf093,length=1513,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf094,length=1512,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf095,length=1462,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf096,length=890,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf097,length=807,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf098,length=802,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf099,length=787,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf100,length=623,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf101,length=600,assembly=97103_genome_v2.fa>
##contig=<ID=Cla97Scf102,length=504,assembly=97103_genome_v2.fa>

I'm researching plants, and I use snpeff to perform annotations on crops such as peppers and radishes. In the case of plants, Chromosomes are often arbitrary, and in many cases, the length must be obtained by directly removing them.

The method I want to use is the method of obtaining the length of Chromosome in Meta information while being Annotated.

I also tried to find a way to filter, sort 1 column, uniq, and then grep -w to get it, but it fails frequently because there are contigs remaining after the filter.

I'm trying to check through vcftools or bcftools, but I can't seem to find the direction I want so far.

In the picture, Cla97Chr is the Chromosome information I want, and Cla07Scf is Contig.

I want to extract only the information that contains Chr without removing them one by one.

Please advise so I can choose what I want.

Thank you in advance

Chromosome VCF • 1.1k views
ADD COMMENT
1
Entering edit mode

Not sure if this is what you are looking for. Following one-liner will extract chromosome names and their lengths.

$ awk -F "=|," '{OFS="\t"}{print $3,$5}' file_w_header_example_above
Cla97Chr08      28201227
Cla97Chr09      37727573
Cla97Chr10      35099344
Cla97Chr11      30886124
Cla97Scf001     233319
Cla97Scf002     184572
Cla97Scf003     114230
Cla97Scf004     101662

If it is then I can move this comment to an answer.

ADD REPLY
0
Entering edit mode

Thank you very much for the reply. However, I wanted to extract only the Chromosome, and the purpose was to extract it from other vcfs other than the example file. Again, thanks for the reply.

ADD REPLY
1
Entering edit mode

If you only need to get the entries that have Chr in them then you can use the following. This solution should work with any plain text VCF file.

$ awk -F "=|," '{OFS="\t"}($3~/Chr/){print $3,$5}' file_w_header
Cla97Chr08      28201227
Cla97Chr09      37727573
Cla97Chr10      35099344
Cla97Chr11      30886124
ADD REPLY
0
Entering edit mode

Your code is not a perfect answer, but I think it will be of great help to the scripts I'm making. Thanks again for your advice.

ADD REPLY

Login before adding your answer.

Traffic: 1808 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6