Hello all,
I am using vcftools to produce a SNPmatrix from vcf file for eQTL analysis, the vcf file contain 73,592 SNPs (grep -c -v "#" vcf.file
).
However, after (vcftools --vcf vcf.file --012 --out matrix
), I got three separate files --- matrix.012.indv, matrix.012.pos, matrix.012
. However, when I check the column number of matrix.012 through (head -n 1 matrix.012 | awk '{print NF}'
), it gives me 73,593, which is 1 more than the actual SNPs number. Weird thing is the row number of matrix.012.pos (wc -l matrix.012.pos
) is 73,592. Any suggestions about how I could fix this?
Thanks a lot.
Thanks Ram. I re-check the vcf.file with
grep -c -v "^#"
, it gives me the same number (73,592). Yes, the first line of matrix.012 is a sample GT column. Actually, I usehead -n 5 matrix.012 | awk -F "\t" '{print NF}'
, it gives me the same number (73,593) for all 5 lines. Confusing...Try
Maybe the last column is blank. Also try
Thanks for the follow up Ram. Not sure I fully understand the commands. The result from
head matrix.012 | cut -f 1-10 | column -ts $'\t' | less -S
isThe result from
head matrix.012 | cut -f 73585-73593 | column -ts $'\t' | less -S
isHello Ram,
I think the first column of matrix.012 is just sample number (from 0-48). The real GT column start from 2nd column.
Sorry for misunderstanding your above comment.
Thanks a lot!
No problem, the metadata nature was difficult to spot as they're all numbers. My commands did the following:
Thank you Ram for the detailed explanation! It's very great to know how these commands work. I am a rookie in bioinformatics. So much to learn...
Thank you again for all your help Ram. Really appreciate it!
You're very welcome. These data debugging strategies are from experience - run into parsing problems every other day and you will know exactly what's causing that odd thing you see.
I've moved my comment to an answer - please accept it to mark the post as solved.
haha, thanks for the tips. For sure, already accept it.