Question: Which SNPs fall within genes and what's their description?
0
gravatar for paulotyama
10 days ago by
paulotyama0
paulotyama0 wrote:

I have a bed file with 13k snps and want to intersect this with a gff file with all the genes annotated for this species. My aim is to find which of my snps fall within genes or other annotated features. Does any one have any suggestions on how to go about this? I'd really appreciate some suggestions and pointers.

Here's my gff file;

head Aradu.Araip_v02.gff | column -t

Aradu.A01      gene  17735860  17739171  ID=Aradu.B2QWP;Name=Aradu.B2QWP;Note=uncharacterized    protein  LOC100797259  isoform  X4  [Glycine  max]%3B  IPR004332  (Transposase%2C  MuDR%2C  plant)
Aradu.A01      mRNA  17735860  17739171  ID=Aradu.B2QWP.1;Parent=Aradu.B2QWP;Name=Aradu.B2QWP.1
Aradu.A01      exon  17735860  17736516  ID=Aradu.B2QWP:exon:0;Parent=Aradu.B2QWP.1
Aradu.A01      exon  17737010  17737038  ID=Aradu.B2QWP:exon:1;Parent=Aradu.B2QWP.1
Aradu.A01      exon  17737236  17737802  ID=Aradu.B2QWP:exon:2;Parent=Aradu.B2QWP.1
Aradu.A01      exon  17738326  17738425  ID=Aradu.B2QWP:exon:3;Parent=Aradu.B2QWP.1
Aradu.A01      exon  17738558  17738633  ID=Aradu.B2QWP:exon:4;Parent=Aradu.B2QWP.1
Aradu.A01      exon  17738752  17738962  ID=Aradu.B2QWP:exon:5;Parent=Aradu.B2QWP.1
Aradu.A01      exon  17739071  17739171  ID=Aradu.B2QWP:exon:6;Parent=Aradu.B2QWP.1

Here's my bed file with snps;

head Allsnps.bed | column -t

chrom      pos     pos     Marker        conversion.type
Aradu.A01  230496  230496  AX-147207636  Polyhigh
Aradu.A01  230616  230616  AX-147207637  Other
Aradu.A01  231463  231463  AX-147207638  Polyhigh
Aradu.A01  253683  253683  AX-147207640  Other
Aradu.A01  390660  390660  AX-147207661  Nominor
Aradu.A01  405426  405426  AX-147207663  Polyhigh
Aradu.A01  413869  413869  AX-147207665  Polyhigh
Aradu.A01  461328  461328  AX-147207676  Nominor
Aradu.A01  785293  785293  AX-147207749  Nominor

Thanks, Paul

gff file bed file snp gene • 104 views
ADD COMMENTlink modified 10 days ago by Petr Ponomarenko730 • written 10 days ago by paulotyama0
2
gravatar for Floris Brenk
10 days ago by
Floris Brenk750
USA
Floris Brenk750 wrote:

http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html

intersectBed -a file1 -b file2 -wb

ADD COMMENTlink modified 10 days ago • written 10 days ago by Floris Brenk750
2
gravatar for Petr Ponomarenko
10 days ago by
United States/Los Angeles/ALAPY
Petr Ponomarenko730 wrote:

BEDTools are great! Use it myself a lot. At the same time, it is interesting to learn new ways of doing things if you plan to have a career in bioinformatics. Here is a way to same data intersection with awk:

awk 'NR==FNR{print $1, $3,"e_start",$0;print $1, $4,"e_end",$0;}NR!=FNR{print $1, $3,"SNP", $0}' Aradu.Araip_v02.gff Allsnps.bed | \
sort -k1,1 -k2,2n -k7,7nr | \
awk '$3=="e_start"{in_level=1;tmp=$0;sub(/([^ ]+ +){3}/,"",tmp);line[in_level]=tmp; \
  while(in_level!=0){getline; \
    if($3=="e_start"){in_level+=1;tmp=$0;sub(/([^ ]+ +){3}/,"",tmp);line[in_level]=tmp}; \
    if($3=="e_end"){in_level-=1}; \
    if($3=="SNP"){tmp=$0;sub(/([^ ]+ +){3}/,"",tmp); \
      for(i=1;i<=in_level;i++){print tmp,line[i]} \
    } \
  } \
}'

awk reads line by line. NR==FNR is true only for the first file. We add extra rows for each feature in GFF so one line is for start and another line is for the end of the feature. That way we can sort using Unix sort. -k2,2n sorts numerically by second column only. -k7,7nr sorts numerically in reverse order only by the seventh column. As a result, you have a structure that has information about starts and ends of GFF features as well as all SNP from the second file and everything is already sorted by chromosome and position. Now we start another awk on that structure to print only information within features' boundaries together with information about features themselves. We hold information about how many levels we are within the structure because we have features inside features inside features and want to print out all the information. We have technical information that we added in the first two columns and these can be removed with tmp=$0;sub(/([^ ]+ +){3}/,"",tmp); This is a good way to print out all columns after a given one (third column in this case).

While these can be much more complicated than installing and running bedtools, it gives you some experience with Unix, sort and awk that might be useful in the future. Also this gives you way more opportunities to play with data.

ADD COMMENTlink written 10 days ago by Petr Ponomarenko730

Thank you @Petr! I used Bedtools and it worked just fine. I agree with you that it is nice to learn new ways of doing something and definitely writing your own scripts for a job is good practice for much needed skills.

I will give you more direct feedback a little later about how well the script worked. Thanks again!

ADD REPLYlink modified 9 days ago • written 9 days ago by paulotyama0
1
gravatar for Dan Gaston
10 days ago by
Dan Gaston6.5k
Canada
Dan Gaston6.5k wrote:

You can do this with BEDTools on the command line, or you can also write a small python script with PyBEDTools, which interfaces with BEDTools to accomplish this very easily.

For instance given a BED file of exons and a BED file of SNPS, identifying the SNPs in exons is quite easy:

import pybedtools
exons = pybedtools.example_bedtool('a.bed')
snps = pybedtools.example_bedtool('b.bed')
exons_with_snps = exons.intersect(snps, u=True)

They have plenty of examples in the documentation of working with GFF files, and filtering those files using specific criteria.

ADD COMMENTlink written 10 days ago by Dan Gaston6.5k
1

@Dan, is there any advantage to running PyBEDTools vs. using BEDTools directly? (not being snarky, merely showing my ignorance)

ADD REPLYlink written 10 days ago by harold.smith.tarheel3.5k

The only advantage for me is repeatability. If I'm doing something more than just a straight intersection or one line bedtools command its nice to have a script on hand so I'm always doing it the exact same way everytime I do it. There are also some advantages where the sheer amount of lines of code you need to type can be lower, or more clear and straightforward.

ADD REPLYlink written 10 days ago by Dan Gaston6.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1731 users visited in the last hour