Analyze Promoter Regions In A Vcf File
Entering edit mode
9.4 years ago
farah ▴ 30

I have a VCF file. My aim is to check promoter variants within this file. I have got a list of all the promoter regions in a fasta format (supposedly, this file has a chromosome number, end and start position of the prmoter region which i defined as 2000bp upstream). how do I proceed? any help?

promoter vcf • 2.7k views
Entering edit mode
9.4 years ago
fo3c ▴ 450

Assuming the calls in the vcf file have been filtered and are high-confidence calls, extract the intervals from the fasta file (headers) and intersect them with the vcf file to see what mutations, if any, have been called in the promoter regions.

Entering edit mode
9.4 years ago
Rm 8.2k

Use Bedtools intersect or intersectbed or windowBed depending on your input

bedtools intersect -a promotor.vcf -b input.VCF -f 1 -wao

(you can use -wa or -wb etc accordingly)

Info from Bedtools:

Note: When intersecting SNPs, make sure the coordinate conform to the UCSC format. That is, the start position for each SNP should be SNP position - 1 and the end position should be SNP position. E.g. chr7 10000001 10000002 rs123464

Report the base-pair overlap between sequence alignments and genes.

intersectBed -a reads.bed -b genes.bed

If you have Gene cordinates as bed file: to Report all SNPs that are within 5000 bp upstream (-l) or 1000 bp downstream (-r) of genes. Define upstream and downstream based on strand (-sw).

windowBed -a genes.bed –b snps.bed –l 5000 –r 1000 -sw
Entering edit mode
9.4 years ago

Convert the FASTA-formatted promoter regions to a UCSC-formatted BED file and use a tool like BEDOPS bedmap to map promoter regions that overlap your VCF calls. A discussion follows which explains how you might approach this.

To start, let's say your promoter regions are formatted as follows:

>promoter1 chrN:A-B
>promoter2 chrN:C-D

Convert this to a five-column BED file, where the chromosome, start, stop and ID columns are taken from the FASTA header, and a dummy score value is added for convenience. For example:

#!/usr/bin/env perl

use strict;
use warnings;

while (<>) {
  if ($_ =~ /^>/) {
    my $ln = $_;
    $ln =~ s/^>//s;
    my ($id, $chr, $start, $stop) = split(/[\s+|:|-]/, $ln);
    my $score = 0;
    print STDOUT join("\t", ($chr, $start, $stop, $id, $score))."\n";

Convert the FASTA data to a sorted BED file with a command like:

$ ./ < myPromoters.fa | sort-bed - > myPromoters.bed

This is a "map" file that contains promoter regions, IDs and (placeholder) scores that we can use as an input to bedmap:

$ more myPromoters.bed
chrN    A    B    promoter1    0
chrN    C    D    promoter2    0

We're now ready to answer the question. We'll first convert your VCF calls to 0-indexed BED using the vcf2bed conversion script. We pipe the conversion results into sort-bed to sort them, and then pipe that into bedmap (along with myPromoters.bed) to do the actual mapping:

$ vcf2bed < myCalls.vcf \
    | sort-bed - \
    | bedmap --echo --echo-map-id --delim '\t' - myPromoters.bed \
    > myAnswer.bed

The result (myAnswer.bed) will be a BED-formatted file. The first four columns will be the (0-indexed) positional data and variant ID of the VCF call. The last column (either the eleventh or twelfth column) will be the ID (or IDs) of promoter regions which overlap that call. (If you want the full promoter element — region and ID — use --echo-map in place of --echo-map-id.)

If you need to adjust the boundaries of your definition of a promoter region, you can use BEDOPS bedops to pre-process your myPromoters.bed file. As an example, here is how to use the --range operator to widen the boundaries of each promoter upstream by 10000 bases and downstream by 500 bases, along with the --everything operator to process all elements:

$ bedops --range 10000:500 --everything myPromoters.bed > myWidenedPromoters.bed

Then use the myWidenedPromoters.bed file with mapping operations, etc.

Entering edit mode

thank you very much for your help. well I did convert promoter reference file to a BED format but then I used Vcftools to intersect the VCF file of the patient's genome and the promoter reference file. I found vcftools very easy for beginners.


Login before adding your answer.

Traffic: 1376 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6