Question: How To Count Genes In Genomic Regions Using A Gtf/Gff3 And A Bed File Of Regions
2
gravatar for Nick Crawford
14 months ago by
Cambridge MA
Nick Crawford80 wrote:

I'd like to count the number of unique genes in a gff file falling within a list of genomic regions. With bedtools I can count the number of regions within the gff which is almost what I want, but not quite.

bedtools intersect -a regions.bed -b my.gff -c

UPDATE:

I should have made my question a bit more specific. I have a modified ensemble style gtf file (not a gff) that has unique transcript IDs. This means that simply selecting unique fields in the 9th column of the gtf file actually counts transcript IDs.

To circumvent this problem I first truncated the gtf file:

cat my.gff | sed -e 's/;.*//' > delete.me.gtf

Then I ran the bedtools map command:

bedtools map -a regions.bed -b delete.me.gtf -c 9 -o count_distinct > counts.genes_in_windows.bed

I almost forgot to delete the intermediate file:

rm delete.me.gtf

There is probably a way to make this a oneliner, without the intermediate file, but I have a dissertation to write!

ADD COMMENTlink modified 14 months ago • written 14 months ago by Nick Crawford80
5
gravatar for Aaronquinlan
14 months ago by
Aaronquinlan7.4k
United States
Aaronquinlan7.4k wrote:

Assuming your regions.bed file is 3 columns, and your gff is 9 columns, you could use the -wb option in bedtools to report each gene that overlaps a region.

$ cat regions.bed
chr1    100    200
chr1    500    600

$ cat my.gff
chr1    my_genes    transcript    150    550 1    -    .    gene_id "ENST00000370132.3"; transcript_id "ENST00000370132.3";

$ bedtools intersect -a regions.bed -b my.gff -wa -wb 
chr1    100    200    chr1    my_genes    transcript    150    550 1    -    .    gene_id "ENST00000370132.3"; transcript_id "ENST00000370132.3";
chr1    500    600    chr1    my_genes    transcript    150    550 1    -    .    gene_id "ENST00000370132.3"; transcript_id "ENST00000370132.3";

Now, as you can see, this is a redundant list. You could just extract the last 9 columns and use UNIX sort and uniq to get a unique (non-redundant) count.

# extract just the genes
####################
$ bedtools intersect -a regions.bed -b my.gff -wa -wb | cut -f 4-12
chr1    my_genes    transcript    150    550 1    -    .    gene_id "ENST00000370132.3"; transcript_id "ENST00000370132.3";
chr1    my_genes    transcript    150    550 1    -    .    gene_id "ENST00000370132.3"; transcript_id "ENST00000370132.3";

# count the distinct genes
####################
$ bedtools intersect -a regions.bed -b my.gff -wa -wb | cut -f 4-12 | sort | uniq -c
1

If you want the distinct count per region, it is better to use the groupby tool in conjunction with intersect. In this case, we are grouping on the region (columns 1-3) and counting the number of distinct gene_id strings (column 12).

$ bedtools intersect -a regions.bed -b my.gff -wa -wb | bedtools groupby -i -  -g 1-3  -c 12 -o count_distinct

Alternatively (preferably?) you can use the map utility - it works much the same way as what Alex nicely describes for bedops. In this case, the gene_id is the 9th column in the gene file, so the -c argument is 9.

$ bedtools map -a regions.bed -b my.gff -c 9 -o count_distinct
ADD COMMENTlink modified 14 months ago • written 14 months ago by Aaronquinlan7.4k

I don't think that quite works. I'd like to count the number of unique gene_ids that fall with each region. I'll update my question to better explain that.

ADD REPLYlink written 14 months ago by Nick Crawford80

I see - updated.

ADD REPLYlink written 14 months ago by Aaronquinlan7.4k
1
gravatar for Alex Reynolds
14 months ago by
Alex Reynolds6.2k
Seattle, WA USA
Alex Reynolds6.2k wrote:

Another approach is to use BEDOPS gff2bed and the bedmap application. The bedmap tool includes a new --echo-map-id-uniq operator, which computes a non-redundant list of mapped IDs, separated with semi-colons. This result can be piped into an awk statement that counts the number of elements split by the semi-colon delimiter.

Here's a command that does everything on one line:

$ gff2bed < my.gff \
    | sort-bed - \
    | bedmap --echo-map-id-uniq regions.bed - \
    | awk -F ';' '{print NF}'

The result is a list of counts of unique ID field values of overlapping GFF elements, per region.

There is also the gtf2bed conversion script, which does the same thing as gff2bed but works with GTF files and uses the gene_name or gene_id field value for populating the ID column in the BED result, in that order.

ADD COMMENTlink modified 14 months ago • written 14 months ago by Alex Reynolds6.2k
Please log in to add an answer.

Help
Access
  • RSS
  • Stats
  • API

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.0.0
Traffic: 376 users visited in the last hour