Question: Filtering multiple single person structural variant VCF files by type of SV
0
gravatar for omid.alavijeh
8 months ago by
omid.alavijeh40 wrote:

Dear Biostars community,

Pretty similar question to my last one (C: Filtering a multi-person structural variant VCF file by type of SV) but with a twist!

I have 100 VCF files each a "structural" VCF file which is a per-person VCF file with made up of calls made with Manta and Canvas. The files look like this:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT

1 2827693 MantaDEL:0:2:5000 CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA C . PASS SVTYPE=DEL;END=2827680;BKPTID=Pindel_LCS_D1099159;HOMLEN=1;HOMSEQ=C;SVLEN=-66 GT:GQ 1/1:13.9

2 321682 . T MantaBND:5:7:1:0 6 PASS IMPRECISE;SVTYPE=DEL;END=321887;SVLEN=-105;CIPOS=-56,20;CIEND=-10,62 GT:GQ 0/1:12

2 14477084 . C MantaINS:88:22:00:3 12 PASS IMPRECISE;SVTYPE=DEL;END=14477381;SVLEN=-297;MEINFO=AluYa5,5,307,+;CIPOS=-22,18;CIEND=-12,32 GT:GQ 0/1:12

3 9425916 . C MantaINS:5:333:000 23 PASS IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15

I want subdivide the files by type of structural variant which are found in the ID column and are made up of: - MantaINS - MantaINV - MantaDEL - MantaBND - MantaDUP - Canvas:REF - Canvas: GAIN - Canvas:LOSS

I.e. for each input vcf come out with 8 subdivided files and in total have 800 output files.

I have run:

for T in  MantaDEL MantaINS MantaBND
do
   bcftools view -S paths_to_files.txt  | awk -v T=${T} '{split($3,a,/\:/);if(a[1]==T) print $1,$2,$3;}'  > ${T}.txt
done

Which I am sure is wrong as I need to run things in parallel I think or do some scripting.

Your help would be greatly appreciated.

Many thanks

ADD COMMENTlink modified 8 months ago • written 8 months ago by omid.alavijeh40
1

Make a list file with Manta codes:

echo "MantaINS" >> SV.list
echo "MantaINV" >> SV.list
echo "MantaDEL" >> SV.list
echo "MantaBND" >> SV.list
echo "MantaDUP" >> SV.list
echo "Canvas:REF" >> SV.list
...

Then do,

 cat SV.list | while read i do;
         grep $i your VCF.file > ${i}.sub.vcf
done

There are thousands more alternative for doing this!

ADD REPLYlink written 8 months ago by reza.jabal370

Thanks Reza! when using grep my "your VCF.file" is in fact a text file with the file paths to each file: looking like this (just s sample):

genomes/by_date/2015-09-03/batch1/patient30/patient30.SV.vcf.gz
genomes/by_date/2016-03-05/batch1/patient4/patient4.SV.vcf.gz
genomes/by_date/2018-10-14/batch1/patient16/patient16.SV.vcf.gz
genomes/by_date/2018-012-28/batch1/patient100/patient100.SV.vcf.gz

genomes/by_date/2018-03-14/batch1/patient1/patient1.SV.vcf.gz

Will grep work in this case?

Many thanks

Omid

ADD REPLYlink modified 8 months ago • written 8 months ago by omid.alavijeh40
1

I see! I was under impression that you have a multi-call vcf file containing all variants across your samples. In that case you need to generate a multi-call vcf. You can use either vcftools or GATK but I recommend GATK: https://gatkforums.broadinstitute.org/gatk/discussion/53/combining-variants-from-different-files-into-one

The syntax for your purpose would be along these lines:

 java -jar GenomeAnalysisTK.jar \
   -T CombineVariants \
   -R reference.fasta \
   --variant genomes/by_date/2015-09-03/batch1/patient30/patient30.SV.vcf.gz \
   --variant genomes/by_date/2016-03-05/batch1/patient4/patient4.SV.vcf.gz \
   --variant genomes/by_date/2018-10-14/batch1/patient16/patient16.SV.vcf.gz \
   --variant genomes/by_date/2018-012-28/batch1/patient100/patient100.SV.vcf.gz \
   --variant genomes/by_date/2018-03-14/batch1/patient1/patient1.SV.vcf.gz \
   .
   .
   -o output.vcf \
   -genotypeMergeOptions UNIQUIFY
ADD REPLYlink modified 8 months ago • written 8 months ago by reza.jabal370

Thanks again Reza for taking the time to reply. I actually have a multi-callVCF file already but was more interested in looking at SV variation between my cohort so thought generating individual patient VCF files by type of SV would be informative.

Many thanks

Omid

ADD REPLYlink written 8 months ago by omid.alavijeh40

So grep should work absolutely fine with your vcf. Lets assume your multi-call VCF file is: output.vcf

cat SV.list | while read i do;
         grep $i output.vcf > ${i}.sub.vcf
done

should do the trick!

Also have a look at SnpEff "Primary variant annotation and quality control": http://snpeff.sourceforge.net/protocol.html

ADD REPLYlink written 8 months ago by reza.jabal370
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1090 users visited in the last hour