Question: Filtering a multi-person structural variant VCF file by type of SV
1
gravatar for tacrolimus
11 months ago by
tacrolimus50
tacrolimus50 wrote:

Hello all,

First post so apologies for any etiquette breaches.

I have a large cohort of over 1000 patients who have had WGS done and then Canvas/Manta structural variant caller applied to their data. I have then merged these into a single VCF file.

In the ID (3rd column) I have the types of call made by one of the two alogrithms e.g. MantaDEL, MantaINV, CanvasGAIN etc etc. After each piece of text there is a string of number representing the graph plot associated with how that call was generated e.g. MantaDel0:1:3:00 which is unique to that call.

I am trying to create individual VCFs of each unique type of call. However, if I run the following code to see what the unique SV types are:

bcftools query -f '%ID/n' SV.vcf.gz

I unsurprisingly get a massive list of each "ID" with the unique number string after it when all I want is the first part, as in to know what are the types of SV call so I can then use bcftools to create individual file types for each.

I hope this makes sense and apologies if this has been covered elsewhere.

Many thanks

bcftools sv vcf • 302 views
ADD COMMENTlink modified 11 months ago by Pierre Lindenbaum130k • written 11 months ago by tacrolimus50

just curious: how did you merge those SV calls ?

ADD REPLYlink written 11 months ago by Pierre Lindenbaum130k

I used bcftools to merge the files.

ADD REPLYlink written 11 months ago by tacrolimus50

it won't work. The very "same" CNVs can be localized at the same place but with a few difference of number of bases.

eg:

chr1 100 . DEL . . END=1000
chr1 101 . DEL . . END=999
ADD REPLYlink written 11 months ago by Pierre Lindenbaum130k

I was hoping once I had created the filed to merge calls based on similar distances. I was going to use Plink for the CNV analysis and potentially a few other tools. What would your approach be?

All the best

Omid

ADD REPLYlink written 11 months ago by tacrolimus50

when all I want is the first part, as in to know what are the types of SV call

an example is needed (input / output )

ADD REPLYlink written 11 months ago by Pierre Lindenbaum130k

Input would be a standard vcf file with the ID column being made up of calls similar to this:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT

1 2827693 MantaDEL:0:2:5000 CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA C . PASS SVTYPE=DEL;END=2827680;BKPTID=Pindel_LCS_D1099159;HOMLEN=1;HOMSEQ=C;SVLEN=-66 GT:GQ 1/1:13.9

2 321682 . T MantaBND:5:7:1:0 6 PASS IMPRECISE;SVTYPE=DEL;END=321887;SVLEN=-105;CIPOS=-56,20;CIEND=-10,62 GT:GQ 0/1:12

2 14477084 . C MantaINS:88:22:00:3 12 PASS IMPRECISE;SVTYPE=DEL;END=14477381;SVLEN=-297;MEINFO=AluYa5,5,307,+;CIPOS=-22,18;CIEND=-12,32 GT:GQ 0/1:12

3 9425916 . C MantaINS:5:333:000 23 PASS IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15

And the output in this case would be three VCFs for "MantaINS" "MantaBND" and "MantaDEL".

I was planning on generating a list of all the IDs and then using Grep e.g. grep "MantaINS*" but was wondering if it could be done in one line from BCFtools.

Many thanks

ADD REPLYlink modified 11 months ago • written 11 months ago by tacrolimus50
4
gravatar for Pierre Lindenbaum
11 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum130k wrote:
for T in  MantaDEL MantaINS MantaBND
do
   bcftools view input.vcf  | awk -v T=${T} '{split($3,a,/\:/);if(a[1]==T) print $1,$2,$3;}'  > ${T}.txt
done
ADD COMMENTlink written 11 months ago by Pierre Lindenbaum130k
1

Dear Pierre,

Many thanks for this, it worked delightfully well (I changed to "print $0" as I wanted the whole line). I have upvoted and accepted your answer (is that the right thing to do?, let me know if I should celebrate your helpfulness any further)

All the best

Omid

ADD REPLYlink written 11 months ago by tacrolimus50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1912 users visited in the last hour