Question: Plot number of mutations in each cancer type in VCF files?
0
gravatar for DanielC
13 months ago by
DanielC90
Canada
DanielC90 wrote:

Dear Friends,

I am trying to plot "number of variants in each cancer type in vcf files". Could you please let me know how to do this using R/python or bash? I have a text file of the samples and the cancer type associated with it, like this below:

Samples                         Cancer Type
TCGA-XXX.barcode                ACC
.
.

I am new to this and learning. Thank much!

DK

plot snp vcf • 677 views
ADD COMMENTlink modified 13 months ago by WouterDeCoster40k • written 13 months ago by DanielC90

It is unclear which data you have - be specific, e.g. number of samples - and which type of plot you aim to obtain. Please elaborate and show an example.

ADD REPLYlink written 13 months ago by WouterDeCoster40k

Thanks! It is tcga cancer data vcf files. These are merged vcf files of about 10000 samples for each chromosome. I do not know the name of the plot but am looking for is a plot of "number of variants in each cancer type in the vcf files". Please let me know if am clear and what you think could be done to obtain this? Thanks much!

ADD REPLYlink written 13 months ago by DanielC90

Can you perhaps draw the plot you have in mind on a piece of paper, take a picture and show us? How should the 10k samples be summarized?

ADD REPLYlink written 13 months ago by WouterDeCoster40k

Thanks for your reply! This is the type of plot I am looking to generate from VCF files: https://www.dropbox.com/s/rfvyw8b8v62lhuz/example.jpg?dl=0

Please let me know how to generate this plot from vcf files. Thanks

ADD REPLYlink written 13 months ago by DanielC90

See also How to add images to a Biostars post

How do you link the sample identifiers in the vcf to the cancer types?

Please update your initial question when adding information. We are losing valuable time here because I have to ask for clarification every time. I assume people who can help you don't want to go through all these comments.

ADD REPLYlink written 13 months ago by WouterDeCoster40k

Thanks! I will remember that next time.

To link the identifiers to the cancer type I have a text file like this:

Sample                           Cancer Type
TCGA-XXX-barcode      ACC

Now, am trying to figure out how to use this information to plot "number of variants for each cancer type" using this above file and the merged vcf file for 10000 samples. Please let me know if am missing any info here, I will provide them. Thanks.

ADD REPLYlink written 13 months ago by DanielC90
1

Also that file is important information which should have been part of your first post. We have now wasted 11 hours until we found all required information.

Please update your initial question when adding information.

I can solve this in Python, but not in R.

ADD REPLYlink written 13 months ago by WouterDeCoster40k

Thanks, I have updated the question. Could you please share with me how we can plot such a plot with these available info? I would really appreciate. Thanks.

ADD REPLYlink written 13 months ago by DanielC90
4
gravatar for WouterDeCoster
13 months ago by
Belgium
WouterDeCoster40k wrote:

I wrote a script which should be able to handle this.

The script below takes two arguments:

  • --samples: your list of samples with their cancer type. No header, just a (space) separated file with two columns.
  • --vcf: your vcf file, for which all sample names are found in the file specified by --samples

This script requires cyvcf2 and matplotlib which you can install from pip:

pip install -U cyvcf2
pip install -U matplotlib

Save the code, e.g. as samples_to_hist.py and execute as (fill in the proper files)

python samples_to_hist.py --samples samples.txt --vcf variants.vcf

Please let me know how it goes.

ADD COMMENTlink written 13 months ago by WouterDeCoster40k

Thanks much! please clarify these queries:

a) my samples.txt has 8000 list and vcf file has 10389 samples. Will the program run in this scenario?

b) can the program be made to run on vcf.gz files?

Thanks much!

ADD REPLYlink written 13 months ago by DanielC90
1

a) No it will explicitly fail because it encountered samples in the vcf which were not in samples.txt (function test_vcf, line 29).

b) It already supports vcf.gz files ;-)

ADD REPLYlink written 13 months ago by WouterDeCoster40k

Thanks ! is it possible to modify the program for scenario a) where the number of samples in samples.txt is less than the number of samples in vcf file. :-)

ADD REPLYlink written 13 months ago by DanielC90

Yeah that's possible but I don't know when I'll have time to adapt that. Or you could adapt your input file.

ADD REPLYlink written 13 months ago by WouterDeCoster40k

Thanks for the reply! Since the number of samples is 10389 in the vcf file, I will have to find the unmatched ones from samples.txt and then delete those fields from the vcf file which has a huge size so will take lot of time. If possible, I would really appreciate if modifications could be made in the script. :-) Thanks.

ADD REPLYlink written 13 months ago by DanielC90

bcftools can do what you are looking for

ADD REPLYlink written 13 months ago by WouterDeCoster40k

Great! am learning useful stuff here. Could you please guide me to the source on how to do this using bcftools. Thanks much!

ADD REPLYlink written 13 months ago by DanielC90

Have you gone through the manual?

ADD REPLYlink written 13 months ago by WouterDeCoster40k

Yes, please let me know if this is right:

bcftools -S samples-to-remove.txt XX.vcf.gz > filtered-vcf.vcf

samples-to-remove.txt:
^TCGA...1
^TCGA...2

. .

ADD REPLYlink written 13 months ago by DanielC90
4

Whether that command works right should be really easy to confirm on your own. I think I did enough here - time for you to show some effort too.

ADD REPLYlink written 13 months ago by WouterDeCoster40k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 668 users visited in the last hour