Question: output feature/gene name from a known base position
gravatar for Dan
3.5 years ago by
Dan0 wrote:

Just a little background so that I can put my question into context. I have a list of SNPs and I want to find out which gene/CDS feature these SNPs occur in. For each reference bacterial genome I have gff and gbk files containing my sequence + annotations. Currently, I am using my gff files in Artemis and manually using the navigator to 'goto' my base of interest and manually recording the feature ID (more specifically the product and gene name that this base falls within). With over 200 SNPs to look at per bacterial genome this is understandably taking its time.

Therefore, my question is; is there an easier or more efficient way to do this?

Any advice on this would be very much appreciated and I would like to give my thanks to anyone who might be able to help in advance.

snp sequence gene • 1.0k views
ADD COMMENTlink modified 3.5 years ago by Emily_Ensembl21k • written 3.5 years ago by Dan0
gravatar for Alex Reynolds
3.5 years ago by
Alex Reynolds30k
Seattle, WA USA
Alex Reynolds30k wrote:

With BEDOPS, you can convert your VCF and GFF files to sorted BED files, and then perform set operations on the BED files to find overlaps between both sets. This toolkit will help you automate doing these operations over hundreds of files.

For example, here's how to convert your SNPs and gene annotations to sorted BED files.

$ vcf2bed < snps.vcf > snps.bed
$ gff2bed < genes.gff > genes.bed

Now you can use bedmap to map genes to SNPs:

$ bedmap --echo --echo-map-id-uniq --delim '\t' snps.bed genes.bed > answer.bed

The file answer.bed contains, on each line, a SNP and the ID field of any gene annotations that overlap the position of that SNP.

There are other --echo-map-* operators, in case you want more or other details about the gene annotations. See the documentation for more detail and examples.

If you want to make this even faster and you don't want to keep around intermediate files, you can use process substitution to convert directly:

$ bedmap --echo --echo-map-id-uniq --delim '\t' <(vcf2bed < snps.vcf) <(gff2bed < genes.gff) > answer.bed

With this revision, there's need to first convert to separate files — all the intermediate data are written to Unix streams and processed directly within bedmap.

Once you do this for one pair of SNPs and gene annotations, you could think about writing a shell script to loop over all your pairs of files and do these operations. This will help deal with the N pairs of files you may need to deal with.

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Alex Reynolds30k

IntersectBed (from bedtools) can work with bed, vcf and gff; no previous conversion needed.

ADD REPLYlink written 3.5 years ago by abascalfederico1.1k
gravatar for Emily_Ensembl
3.5 years ago by
Emily_Ensembl21k wrote:

You can use the Ensembl VEP to query against our known bacterial species, or if you want to use your own annotation, you can download the script and build a cache using a FASTA and GFF file.

ADD COMMENTlink written 3.5 years ago by Emily_Ensembl21k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1074 users visited in the last hour