HELP! extract variants for individual sample IDs in a multisample VCF
1
0
Entering edit mode
2.9 years ago

Hello - please could you help me? I only can do very basic bioinformatics!

I have been given a multi sample VCF file on a linux server.

Samples

There are 5236 samples / patients.

Question

I am interested in the variants of 38 patients. I have the potient's ID numbers. How do I extract them? Do I use bcftools? (that's isntalled). As it is on a secure server I don't have the permission to install software.

Thank you!

Commands • 2.5k views
ADD COMMENT
1
Entering edit mode
2.9 years ago
4galaxy77 2.8k

There are lots of answers for this question already - try searching for them first. The answer is here Extract subset of samples from multigenome vcf file. Also people might be more willing to help if you don't shout with capitals in the title :)

ADD COMMENT
0
Entering edit mode

thank you 4galaxy77 - that is really kind. I tried searching a lot yesterday but didn't get far. I suppose - the novice not knowing what to look for. The shouting is more my desperation in trying to understand bioinformatics, not aimed at anyone. Apologies.

ADD REPLY
3
Entering edit mode

If the VCF file has the patient IDs as its sample IDs, you should be able to use bcftools view -Ov -S <ids_file_with_one_id_per_line> input.vcf > output.vcf

If your patient IDs don't match the sample IDs in the VCF, you'll need to find the sample IDs that correspond to your patient IDs and then do the above.

To view all the sample IDs in your VCF file, use:

bcftools query -l input.vcf > list_of_sample_ids.txt
## less is a page-by-page viewer, press SPACE to go to next page, b to go to previous page and q to quit
less -S list_of_sample_ids.txt
ADD REPLY
0
Entering edit mode

I can't thank you enough!!!!!! thank you so so so much. This worked like a dream for me. Such a relief!!! Julia :)

ADD REPLY
0
Entering edit mode

Could I ask for help again? I now need to do it the other way around. I have variants that I am interested in and I need to extract from the multi sample VCF file the patient IDs that these variants match to. Should I just create a list_of_variants.txt? What would the format for that txt file (position ref alt - 1x per line?) Thank you again, Julia

ADD REPLY
0
Entering edit mode

You should look at the -R and -T options. Start small - use a file with 3-5 loci. Once you get that working, expand to your full set of loci.

ADD REPLY
0
Entering edit mode

Thank you very much for that. I will give that a go. Julia

ADD REPLY
0
Entering edit mode

Thank you Ram, for your help that has worked for me. Could I please ask for some further advice? The file itself has all the>5000 patients results for the specfic "position ref alt" i am looking at and then lists all the patients per tab with their GT at the position . I would prefer not to manually filter the het/hom. I have tried to use various commands suggested on here but no success. I have tried ( no change to the file when i look at the output, still all GT counts whether o/o, o/1, or 1/1 still included):

grep '(^#|1/1)' file.vcf > homo.vcf
egrep '(^#|1/1)' file.vcf > homo.vcf
more file.vcf | grep "1/1" >> homo.vcf 
bcftools view -c1 input.vcf > new_file.vcf
bcftools filter --exclude 'TYPE="ref"' input.vcf

I also tried this (but thsi removes all my GT information, including my info on position, reference etc ...).

cat raw_variants.vcf |
awk '($0~/^#/)($0!~/^#/){split($10,x,":"); if(x[1]!="0/0") print }' > hard_filtered.vcf

Any other command I could use?

many thanks, Julia

ADD REPLY
0
Entering edit mode

I would keep exploring bcftools - unfortunately, I cannot help you more than this as we'd need quite a bit of back and forth. You're on the right track though. Just keep in mind that at some point, you may want to get data in a tabular format using bcftools query and then move to R to make calculations easier. Counting, grouping etc become a lot easier when you're working with statistical/data management software.

ADD REPLY
0
Entering edit mode

Hello Ram, that is really helpful. Thank you so much. Ok, I need to familiarise myself with R as well then, makes sense. I'll persist with bcftools. I need to grow my confidence in using these softwares. Julia

ADD REPLY

Login before adding your answer.

Traffic: 1489 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6