How To Count Genes In Genomic Regions Using A Gtf/Gff3 And A Bed File Of Regions
2
8
Entering edit mode
11.2 years ago
Nick Crawford ▴ 210

I'd like to count the number of unique genes in a gff file falling within a list of genomic regions. With bedtools I can count the number of regions within the gff which is almost what I want, but not quite.

bedtools intersect -a regions.bed -b my.gff -c

UPDATE:

I should have made my question a bit more specific. I have a modified ensemble style gtf file (not a gff) that has unique transcript IDs. This means that simply selecting unique fields in the 9th column of the gtf file actually counts transcript IDs.

To circumvent this problem I first truncated the gtf file:

cat my.gff | sed -e 's/;.*//' > delete.me.gtf

Then I ran the bedtools map command:

bedtools map -a regions.bed -b delete.me.gtf -c 9 -o count_distinct > counts.genes_in_windows.bed

I almost forgot to delete the intermediate file:

rm delete.me.gtf

There is probably a way to make this a oneliner, without the intermediate file, but I have a dissertation to write!

bedtools gtf gff • 12k views
ADD COMMENT
11
Entering edit mode
11.2 years ago

Assuming your regions.bed file is 3 columns, and your gff is 9 columns, you could use the -wb option in bedtools to report each gene that overlaps a region.

$ cat regions.bed
chr1    100    200
chr1    500    600

$ cat my.gff
chr1    my_genes    transcript    150    550 1    -    .    gene_id "ENST00000370132.3"; transcript_id "ENST00000370132.3";

$ bedtools intersect -a regions.bed -b my.gff -wa -wb 
chr1    100    200    chr1    my_genes    transcript    150    550 1    -    .    gene_id "ENST00000370132.3"; transcript_id "ENST00000370132.3";
chr1    500    600    chr1    my_genes    transcript    150    550 1    -    .    gene_id "ENST00000370132.3"; transcript_id "ENST00000370132.3";

Now, as you can see, this is a redundant list. You could just extract the last 9 columns and use UNIX sort and uniq to get a unique (non-redundant) count.

# extract just the genes
####################
$ bedtools intersect -a regions.bed -b my.gff -wa -wb | cut -f 4-12
chr1    my_genes    transcript    150    550 1    -    .    gene_id "ENST00000370132.3"; transcript_id "ENST00000370132.3";
chr1    my_genes    transcript    150    550 1    -    .    gene_id "ENST00000370132.3"; transcript_id "ENST00000370132.3";

# count the distinct genes
####################
$ bedtools intersect -a regions.bed -b my.gff -wa -wb | cut -f 4-12 | sort | uniq -c
1

If you want the distinct count per region, it is better to use the groupby tool in conjunction with intersect. In this case, we are grouping on the region (columns 1-3) and counting the number of distinct gene_id strings (column 12).

$ bedtools intersect -a regions.bed -b my.gff -wa -wb | bedtools groupby -i -  -g 1-3  -c 12 -o count_distinct

Alternatively (preferably?) you can use the map utility - it works much the same way as what Alex nicely describes for bedops. In this case, the gene_id is the 9th column in the gene file, so the -c argument is 9.

$ bedtools map -a regions.bed -b my.gff -c 9 -o count_distinct
ADD COMMENT
0
Entering edit mode

I don't think that quite works. I'd like to count the number of unique gene_ids that fall with each region. I'll update my question to better explain that.

ADD REPLY
1
Entering edit mode

I see - updated.

ADD REPLY
3
Entering edit mode
11.2 years ago

Another approach is to use BEDOPS gff2bed and the bedmap application.

The bedmap tool includes a new --echo-map-id-uniq operator, which computes a non-redundant list of mapped IDs, separated with semi-colons. This result can be piped into an awk statement that counts the number of elements split by the semi-colon delimiter.

Here's a command that does everything on one line:

$ bedmap --echo-map-id-uniq regions.bed <(gff2bed < my.gff) | awk -F ';' '{print NF}'

The result is a list of counts of unique ID field values of overlapping GFF elements, per region.

There is also the gtf2bed conversion script, which does the same thing as gff2bed but works with GTF files and uses the gene_name or gene_id field value for populating the ID column in the BED result, in that order.

ADD COMMENT

Login before adding your answer.

Traffic: 2519 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6