vcftools: the length of each row is not same
1
0
Entering edit mode
8.6 years ago
zwang10 ▴ 30

Hello! I have vcf.gz file, and I want to change it into 012 matrix. I use follow command

vcftools --gzvcf chr1.vcf.gz --out chr1 --012

Then, it outputs chr1.012, chr1.012.pos, and chr1.012.indv. But I found the length of each row of chr1.012 is not equal. And the length of rows of chr12.012 is not same as chr1.012.

vcftools • 3.1k views
ADD COMMENT
0
Entering edit mode

Just to cover all bases, what are the commands you used to find

  1. the length of each row in a file
  2. the number of rows in a file

This is just so we are sure there was no error in the counting logic.

ADD REPLY
0
Entering edit mode

Hello! I use bash scripting. To print the length of each row, I use

cat chr1.012 | awk '{print NF}'

To print the number of rows of chr1.012.pos, I use

wc -l chr1.012.pos
ADD REPLY
0
Entering edit mode

Did you account for anomalies in separation? Maybe a case of multiple separators at places where it's not supposed to happen? Given the 012 file has an empty value indicator (-1), maybe squeeze the separator using a tr -s before the awk?

ADD REPLY
0
Entering edit mode

I found the length of chr1.012.pos is always much larger than the length of row of chr1.012. So the case you mentioned would not happen.

ADD REPLY
0
Entering edit mode

Maybe they use different separators?

EDIT: Scratch that - doesn't look like it; they're all tab separated. It has something to do with the actual variants then.

ADD REPLY
0
Entering edit mode

The real name of my vcf.gz file is GAZ00001016581_1.ALSPAC.beagle.anno.csq.shapeit.20131101.vcf.gz (this is for chromsome 1). Also there is a file called _EGAZ00001016604_ALSPAC.beagle.anno.csq.shapeit.20131101.sites.vcf.gz. I do not know whether this file is useful or not.

ADD REPLY
0
Entering edit mode

Do you have alternative tools to convert vcf.gz file into 012 matrix?

ADD REPLY
0
Entering edit mode

Not really. Sorry, I cannot help you with this now - I would have asked for a bit of the file to examine, but I am busy with my day job this week.

ADD REPLY
0
Entering edit mode

Sure. But my file is so large. The smallest file for chromsome is chr22 (about 4.1G).

ADD REPLY
0
Entering edit mode

Hello zwang10!

It appears that your post has been cross-posted to another site: http://stackoverflow.com/questions/36256693

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY
0
Entering edit mode

Thanks for your suggestion. I deleted cross-posted one in stackoverflow.

ADD REPLY
0
Entering edit mode
7.9 years ago
c.v.oflynn ▴ 100

Hi zwang10,

did you solve this?

the first column of chr1.012 is numeric sample id, not genotype. remove this and you have the same lengths.

Ciaran

ADD COMMENT

Login before adding your answer.

Traffic: 1285 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6