I wondered if anyone could help me please? (apologies I can't give screen shots - as I work on a secure server).
I want to extract all the sample IDs that relate to a specific variant. I have a multisample VCF file for each chromosome.
I used to be able to extract this information using the code below and get all the IDs seperated by columns with the relevant VCF info (qual, filter, info, format, format_ouput). My variant text file (tab delimited) contains: Chr Pos Ref Alt
bcftools view -R "variants.txt" "mymultisample.vcf" > "newoutput.vcf"
However, all I get now is a few columns with lots of mixed mish mash of information that doesn't make sense. I also tried bcftools view -t, bcftools view -r, bcftools view -T, different formatting of the variant.txt file but no luck.
Thank you so so much for your help in advance.
It would help if you showed a concrete example. Try finding a position in the
variants.txt
file for which you manually found a match in the VCF but the operation does now show expected output. This process will help you figure out why you're not seeing what you think you should be seeing.Thank you Ram, just saw your response. I will try that. When you say "manually found", is that inspecting the file with the grep command or viewing it for example in excel?
Yes, I mean finding a match manually with a piece of software that can do plain text searching.
grep
would be the way to go. Excel brings its own complications with it.Ok great, thank you Ram. I will try that. Here is a screen shot of what i used to get. This is a slightly made up example, as the real output is on the secure server.
I'd recommend splitting multi-allelics and left-aligning the VCF before doing the operation.
Your data looks transposed to me - is that something you did on purpose? Can you describe why you call this a "mish-mash" I'm assuming you abstracted the values in the
ID
fields.Hello Ram,
i have taken some screen shots from the ouput as I get it to explain it properly (sorry the other file was somehwat transposed).
This is the mish mash example I now get using bcftools view -R genotype.txt myfile.vcf.gz > output.vcf
This is the good example that I used to get with the same command (bcftools view -R genotype.txt myfile.vcf.gz > output.vcf)
EGAN is the IDs that correspond to the patients where I need to retrieve the phenotype. That is the the bit I don't get with the tabix file. But the format of the columns is correct, just missing the EGAN IDs.
Thank you
I mean, could I assume that the order is the exact same one as the list I get when i do:
That order seems to be mantained in the original files where I was still getting the EGAN IDs. I just don't want to assume.
This how the current ouput looks like.
i have just tried:
This give me the nice column output but again no EGAN IDs.
Your "mishmash" file seems to be have the content just with different delimiters. Please stop using Excel and use
grep
andhead
instead to show content in plain text.How does the same command give different output? Are you using different input files? Maybe a different version of bcftools?
Thank you Ram. The chromosomes files are all split into seperate chromosome files. That might the problem why the same command gives different output.
Noted, will aim to use grep and head. The bcftool version is the same.
Please see if you can find a linux pro near you. This looks like a problem that needs to be eyeballed and probed on your machine (as in there are too many variables to be accounted for and an online discussion might go on forever). I can't help you with this right now, sorry.
You may need to examine EOF characters, specify options explicitly, and try a bunch of things.
Ok, thank you very much Ram. Really appreciate your help. I will try and find someon locally. Thank you, Julia
randomly, this code made it work:
Glad you found something that works. Now for the fun part of figuring out why something else didn't work as it should have.