vcf to SNPmatrix through vcftools --012 produces extra column
1
0
Entering edit mode
18 months ago
liyong ▴ 80

Hello all,

I am using vcftools to produce a SNPmatrix from vcf file for eQTL analysis, the vcf file contain 73,592 SNPs (grep -c -v "#" vcf.file).

However, after (vcftools --vcf vcf.file --012 --out matrix), I got three separate files --- matrix.012.indv, matrix.012.pos, matrix.012. However, when I check the column number of matrix.012 through (head -n 1 matrix.012 | awk '{print NF}'), it gives me 73,593, which is 1 more than the actual SNPs number. Weird thing is the row number of matrix.012.pos (wc -l matrix.012.pos) is 73,592. Any suggestions about how I could fix this?

Thanks a lot.

SNPmatrix vcf vcftools • 1.4k views
ADD COMMENT
4
Entering edit mode
18 months ago
Ram 43k

Ensure you're only excluding header lines, not variant records that contain a #: grep -vc "^#" vcf.file. Also, try specifying the separator explicitly for awk in your NF calculation.

Did you look at the first column in matrix.012 and make sure it's a sample GT column? I think it may be a metadata or some such column.

ADD COMMENT
0
Entering edit mode

Thanks Ram. I re-check the vcf.file with grep -c -v "^#", it gives me the same number (73,592). Yes, the first line of matrix.012 is a sample GT column. Actually, I use head -n 5 matrix.012 | awk -F "\t" '{print NF}', it gives me the same number (73,593) for all 5 lines. Confusing...

ADD REPLY
0
Entering edit mode

Try

head matrix.012 | cut -f1-10 | column -ts $'\t' | less -S
head matrix.012 | cut -f 73585-73593 | column -ts $'\t' | less -S

Maybe the last column is blank. Also try

head -n 1 matrix.012 | tr "\t" "\n" | head -n 100 | cat -te | less -SN
head -n 1 matrix.012 | tr "\t" "\n" | tail -n 100 | cat -te | less -SN
ADD REPLY
0
Entering edit mode

Thanks for the follow up Ram. Not sure I fully understand the commands. The result from head matrix.012 | cut -f 1-10 | column -ts $'\t' | less -S isenter image description here

The result from head matrix.012 | cut -f 73585-73593 | column -ts $'\t' | less -S is enter image description here

ADD REPLY
0
Entering edit mode

Hello Ram,

I think the first column of matrix.012 is just sample number (from 0-48). The real GT column start from 2nd column.

Sorry for misunderstanding your above comment.

Thanks a lot!

ADD REPLY
1
Entering edit mode

No problem, the metadata nature was difficult to spot as they're all numbers. My commands did the following:

  1. The first set picked the first and last few columns of the first 10 rows and pretty-printed them so columns would line up regardless of "cell" content length
  2. The second set picked the first and last 100 columns in the first row and transposed the vector to vertical (tabs -> new lines) and then printed all characters including invisibles. This would show us if there were a trailing empty column owing to lines ending in a delimiter character.
ADD REPLY
0
Entering edit mode

Thank you Ram for the detailed explanation! It's very great to know how these commands work. I am a rookie in bioinformatics. So much to learn...

Thank you again for all your help Ram. Really appreciate it!

ADD REPLY
0
Entering edit mode

You're very welcome. These data debugging strategies are from experience - run into parsing problems every other day and you will know exactly what's causing that odd thing you see.

I've moved my comment to an answer - please accept it to mark the post as solved.

ADD REPLY
0
Entering edit mode

haha, thanks for the tips. For sure, already accept it.

ADD REPLY

Login before adding your answer.

Traffic: 2551 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6