Entering edit mode
5.2 years ago
oars ▴ 190
I have a .vcf file where I already have the dbSNP identifiers listed in the ID column. I am simply trying to count the number of variants in the dataset with a dbSNP (rs...) present (nulls are denoted as ".").
I've tried the below linux command but it seems incorrect as I've manually counted a different number:
grep -v rs SRR1611183.gatk.vcf | wc -l 1812
I'm looking for something that really targets the ID column, either counting dbSNP's or the nulls.
thanks for the reply. This returns 0.
I've also tried to count the nulls with the following linux command but also returned 0:
Please post the first lines of your vcf file, including the header and some variants.
The actual header starts on line 52...
...Then the actual data:
This continues to line 1813.
I don't know whether this is a problem with the forum software. But when I copy and paste your example in a new file the columns aren't delimited by tabs but by different number of spaces.
Let's go through the command above step-by-step. What's the output of
cut -f3 SRR1611183.gatk.vcf?
A bunch of header information, then a series of "."
Ok, so I guess you have no rs Id's in your vcf file. The command you use (
grep -v ".") tries to get all lines which does not contain a ".". As all your line have a "." it could not find any which meets this criteria.
Why do you think you must have rs IDs in your vcf file? Which exact command did you use to annotate your vcf?
Thanks - I actually do have rs numbers in the ID field so I'm not sure whats going on. For example, I pasted the data into excel and found out of 1761 variants (POS) and 1467 (83%) had dbSNPs (rs...) identified. I just can't figure out a simple way to cut and count the rs#'s in the file as is. I guess I could remove all the information before line 52 (which is where actual header begins) and try again?
Is it possible that you upload the vcf file somewhere? A shortend version with some variants having a rs id and some not should be enough. And please let all the header Informations in.
I came up with a solution:
Congrats if you find a solution for you.
But I don't understand why my solution did'nt work for you if this awk way works.
Ok, the modified version with
grep -v "."could not work correct, as "." have the special meaning "any character". So if you want to grep the dot you need to escape it like this
grep -v "\."But as your not interested in all rows of the third column containing no dot (you would grep the header of the column as well) this is not the right solution. You want to check the first characters if the line, so you need
grep -e "^rs".
Please write it as an answer and close it so that others can easily find it and get help from. Thanks for providing better solution.