Question: use bedtools to find overlaps of bed and gff files
gravatar for biolab
4.3 years ago by
biolab1.1k wrote:

Dear all
I am trying to use bedtools to compare a BED file with a GFF file.

The BED file is like below.
Chr1    3641    5640    .    .    +
The GFF file is as follows.
Chr1    TAIR10  gene    3631    5899    .       +       .       ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010
Chr1    TAIR10  mRNA    3631    5899    .       +       .       ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1
Chr1    TAIR10  protein 3760    5630    .       +       .       ID=AT1G01010.1-Protein;Name=AT1G01010.1;Derives_from=AT1G01010.1
Chr1    TAIR10  exon    3631    3913    .       +       .       Parent=AT1G01010.1
Chr1    TAIR10  five_prime_UTR  3631    3759    .       +       .       Parent=AT1G01010.1

I used the command intersectBed –a A.bed –b B.gff –wa -wb to find overlaps.  
However, my problem is: how to define intergenic, intron and intron-exon-overlapping or intergenic-gene-overlapping regions?

Thank you for any of your suggestions!

gff bedtools • 2.9k views
ADD COMMENTlink modified 4.3 years ago by Vishaka Datta80 • written 4.3 years ago by biolab1.1k

I don't understand exactly your problem, what are you trying? Do you want to extract the features of the gff file that are included within your bed region?

ADD REPLYlink written 4.3 years ago by iraun3.5k

Yes, exactly as you mentioned, I want to extract features of GFF file according to BED regions. My problem is: the GFF file does not have intron or intergenic features. More complex thing is: BED regions may have exon-intron overlap. THANKS!

ADD REPLYlink written 4.3 years ago by biolab1.1k

You should really be able to figure this out yourself. For example, do you expect purely intergenic regions to have any overlaps (hint: no, you don't).

ADD REPLYlink written 4.3 years ago by Devon Ryan90k

Hi Devon

Sorry for my unclear description.  As airan said, I want to define every BED region.  The difficulty is: the GFF file does not have intergenic and intron information. THANKS!

ADD REPLYlink written 4.3 years ago by biolab1.1k
gravatar for Vishaka Datta
4.3 years ago by
Vishaka Datta80 wrote:

You can find intergenic regions of your GFF file by using complementBed. You require a fasta index file as an input to this method though (usually a file with a .fai extension). The fasta index file can be created by using the faidx tool from SAMtools. Here's an example pipeline for finding overlaps of your BED file with intergenic regions :

Step 1 : Create the .fai file from SAMTools faidx. This requires the genome sequence fasta file as input. 

samtools faidx <your genome.fa>

This will create a file name <your genome.fa.fai>

Step 2 : Run complementBed with this .fai file :

complementBed -i <your GFF gene annotation file> -g <your genome.fa.fai> > intergenic.bed

intergenic.bed now contains the chromosomal coordinates (only) of all intergenic regions.
You can now run the intersectBed command of A.bed with intergenic.bed .

To generate a intronic regions bed file, you should either download the exon annotations for your genome and run complementBed on it, or download the intronic region file directly if available.

To run higher order intersections (like say, between three files), you can pass multiple files to the -b argument through wildcard characters. See the docs for help on this - I haven't tried it before.

Hope I understood your question right :)



ADD COMMENTlink written 4.3 years ago by Vishaka Datta80

Hi, Vishaka,

Thank you very much for your detailed method.  My question is not properly described, but you well guessed what I was asking. I am trying your method. Cheers!

ADD REPLYlink written 4.3 years ago by biolab1.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1106 users visited in the last hour