Sort VCF File by Position?
2
2
Entering edit mode
6.2 years ago
Niell ▴ 20

Previously, I split out a vcf file by chromosome, and for my project, I have combined the X and XY vcf files into a single one. After changing the "XY" chromosome designation to "X" via:

awk '{gsub(/"XY"/, "X"); print;}' Genome_newX.vcf > Genome_newX2.vcf

I'm running into the issue of sorting this new "Genome_newX2.vcf" by position. The idea is that I'll subsequently run the vcf through GenotypeHarmonizer.

Are there any suggestions on how to do this easily? I'm brand new to this style of work, and I'd love some direction on where to read up on it as well. Thank you!

chromosome vcf • 32k views
ADD COMMENT
9
Entering edit mode
6.2 years ago
ATpoint 81k

Edit: 02/23

Just use bcftools sort https://samtools.github.io/bcftools/bcftools.html#sort


Original answer with awk-fu:

cat in.vcf | awk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1 -k2,2n"}' > out_sorted.vcf

It takes a VCF and prints the sorted file including the header.

ADD COMMENT
0
Entering edit mode

excellent, this solved the problem. I really appreciate!

ADD REPLY
1
Entering edit mode
sort -k1,1 -k2,2n

This works well in your case, as you seem to have just on chromosome. For sorting a vcf file I prefer this:

sort -k1,1V -k2,2n my.vcf

This makes sure that your chromosomes are sorted correctly. WIthout the 'V' "2" comes behind "19" for example.

fin simmer

ADD REPLY
0
Entering edit mode

I do not recommend to use natural sorting on genomic data. Most other tools, e.g. samtools (for sorting bam files) do not support this by default. If you ever do operations like intersections with bedtools on two or more files that require files to be sorted, the different sort orders would/could cause conflict, e.g. bedtools intersect with the -sorted option

ADD REPLY
1
Entering edit mode

too bad vcf-sort is garbage and the -c flag doesnt work even with the newest version

ADD REPLY
0
Entering edit mode

Hello ATPoint,

funny. This is exact the same reason why I use natural sorting. :) The data I've worked with (human) was always sorted this way and I got problems it a part in the analyse pipeline wasn't.

fin swimmer

ADD REPLY
7
Entering edit mode
4.2 years ago
beausoleilmo ▴ 580

Would there be an equivalent for a BCF? bcftools view | [...] code? Or why not using bcftools sort -Oz output.bcf -o output_sort.vcf.gz?

ADD COMMENT
3
Entering edit mode

bcftools sort is absolute the right way and the way I would go today :)

ADD REPLY
1
Entering edit mode

Would also recommend SortVCF

ADD REPLY
1
Entering edit mode

Ok just to double check here because I may be a chop, but dosnt SortVcf from GATK also just use the header formatting as well?

ADD REPLY
0
Entering edit mode

Thank you! I have struck out my comment about not needing the seq-dict dependency (not sure what I was doing to make me think that) Appreciate it!

ADD REPLY
0
Entering edit mode

Good Solution!

ADD REPLY

Login before adding your answer.

Traffic: 1944 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6