Question: How to sort a VCF file lexicographically by chromosome number?
2
gravatar for Parth Patel
4.5 years ago by
Parth Patel40
United States
Parth Patel40 wrote:

 Hello, 

     I have a large 155mb vcf file that I am trying to sort lexicographically by chromosome number (in the order 1,10, 11, 12 .. etc ) and then by position. I am trying to use vcf-sort from the vcf-tools perl package. 

 

I have been running the command :

perl vcf-sort myfile.vcf > output.vcf 

however, the output file loses all chromosome, position, and ref information for some reason and sorts variants by ID. According to the help page, this is suppose to sort the variants in lexicographic order. 

 

I tried the command:

perl -c vcf-sort myfile.vcf > output.vcf 

and the output file is sorted in chromosomally chronological order (1,2,3,4..) and all of the necessary information is retained. 

Any idea on how to obtain a lexicographical sort by chromosome number (and then position)?

- Parth

 

 

ADD COMMENTlink modified 22 months ago by ATpoint21k • written 4.5 years ago by Parth Patel40

Could you please post the command again...I am having some issues with the following command. Thanks.

grep -v '^#' in.vcf | LC_ALL=C sort -t '\t' -k1,1 -k2,2n >> out.vcf

error as follows:

sort: multi-character tab `\\t'
ADD REPLYlink modified 7 months ago by RamRS23k • written 4.3 years ago by lavibharath0

Use $'\t'

grep '^#' in.vcf > out.vcf && grep -v '^#' in.vcf | LC_ALL=C sort -t $'\t' -k1,1 -k2,2n >> out.vcf
ADD REPLYlink modified 7 months ago by RamRS23k • written 2.3 years ago by Shaun Jackman420

Oops... I got it. Thanks.

ADD REPLYlink modified 7 months ago by RamRS23k • written 4.3 years ago by lavibharath0

But the random chromosomes are really giving some problems...

ADD REPLYlink modified 7 months ago by RamRS23k • written 4.3 years ago by lavibharath0
8
gravatar for Pierre Lindenbaum
4.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum122k wrote:
grep '^#' in.vcf > out.vcf && grep -v '^#' in.vcf | LC_ALL=C sort -t $'\t' -k1,1 -k2,2n >> out.vcf
ADD COMMENTlink modified 7 months ago by RamRS23k • written 4.5 years ago by Pierre Lindenbaum122k
1

Running on Debian 8 is slight different

grep '^#' in.vcf > out.vcf && grep -v '^#' in.vcf |  sort -k1,1 -k2,2n >> out.vcf

Great solution Pierre

ADD REPLYlink modified 7 months ago by RamRS23k • written 18 months ago by dsmarcoantonio40

Hi Pierre,

Thanks for the quick response. I tried your grep method and it yielded the same results as vcf-sort. The variants are sorted lexicographically by id, but the chromosome, position, and ref information is gone:

##fileformat=VCFv4.1
#CHROM POS   ID                REF   ALT   QUAL  FILTER INFO
             rs10110847_dbSNP        C    100   PASS
             rs10130831_dbSNP        G    100   PASS
             rs10132024_dbSNP        T    100   PASS
             rs10133030_dbSNP        A    100   PASS
             rs10135051_dbSNP        T    100   PASS
             rs10136467_dbSNP        C    100   PASS
             rs10147247_dbSNP        T    100   PASS
             rs10149947_dbSNP        T    100   PASS
             rs10152678_dbSNP        A    100   PASS
             rs10188587_dbSNP        T    100   PASS
             rs10277675_dbSNP        A    100   PASS

Is vcf-sort supposed to be doing this? I am using grep off of cygwin, so I am beginning to wonder if there is something wrong with my grep installation..

Parth

ADD REPLYlink modified 7 months ago by RamRS23k • written 4.5 years ago by Parth Patel40
1

If

awk '/^\t/ {print}' in.vcf

produces any ouput then your VCF is corrupted.

ADD REPLYlink modified 7 months ago by RamRS23k • written 4.5 years ago by Pierre Lindenbaum122k

That was really helpful -- a formatting issue was in fact giving me a corrupt VCF file. Thanks!

ADD REPLYlink written 4.5 years ago by Parth Patel40

Hi Pierre,

Thanks. Sorry, I am bit lost. Where to give tab (control+v+tab) as I still get the following error if cut and paste the aforesaid command. Thanks.

sort: multi-character tab `\\t'
ADD REPLYlink modified 7 months ago by RamRS23k • written 4.3 years ago by lavibharath0

You don't really to write the word '\t' but under linux: single-quote CTRL-V TAB single-quote

ADD REPLYlink modified 7 months ago by RamRS23k • written 4.3 years ago by Pierre Lindenbaum122k

Pierre, in bash, you can use $'\t' which is the same thing as "Ctrl-V tab" but can be copy/pasted from stackoverflow.

ADD REPLYlink written 2.9 years ago by Maximilian Haeussler1.3k
2
gravatar for ATpoint
22 months ago by
ATpoint21k
Germany
ATpoint21k wrote:

A solution without any grep:

cat in.vcf | awk '$1 ~ /^#/ {print $0;next} {print $0 | "LC_ALL=C sort -k1,1 -k2,2n"}' > sorted.vcf
ADD COMMENTlink modified 22 months ago • written 22 months ago by ATpoint21k

Is there any reason for using cat first and then awk instead of using directly awk and send the file as input?

ADD REPLYlink written 17 months ago by ibelcarri10

no, but people often do this when they're exploring the data, inserting some tools in the pipeline. Imagine it could be gunzip -c in.vcf.gz instead of cat in.vcf

ADD REPLYlink written 17 months ago by Pierre Lindenbaum122k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1498 users visited in the last hour