How To Count Genes In Genomic Regions Using A Gtf/Gff3 And A Bed File Of Regions
2
8
Entering edit mode
9.5 years ago
Nick Crawford ▴ 210

I'd like to count the number of unique genes in a gff file falling within a list of genomic regions. With bedtools I can count the number of regions within the gff which is almost what I want, but not quite.

bedtools intersect -a regions.bed -b my.gff -c


UPDATE:

I should have made my question a bit more specific. I have a modified ensemble style gtf file (not a gff) that has unique transcript IDs. This means that simply selecting unique fields in the 9th column of the gtf file actually counts transcript IDs.

To circumvent this problem I first truncated the gtf file:

cat my.gff | sed -e 's/;.*//' > delete.me.gtf


Then I ran the bedtools map command:

bedtools map -a regions.bed -b delete.me.gtf -c 9 -o count_distinct > counts.genes_in_windows.bed


I almost forgot to delete the intermediate file:

rm delete.me.gtf


There is probably a way to make this a oneliner, without the intermediate file, but I have a dissertation to write!

bedtools gtf gff • 10k views
11
Entering edit mode
9.5 years ago

Assuming your regions.bed file is 3 columns, and your gff is 9 columns, you could use the -wb option in bedtools to report each gene that overlaps a region.

$cat regions.bed chr1 100 200 chr1 500 600$ cat my.gff
chr1    my_genes    transcript    150    550 1    -    .    gene_id "ENST00000370132.3"; transcript_id "ENST00000370132.3";

$bedtools intersect -a regions.bed -b my.gff -wa -wb chr1 100 200 chr1 my_genes transcript 150 550 1 - . gene_id "ENST00000370132.3"; transcript_id "ENST00000370132.3"; chr1 500 600 chr1 my_genes transcript 150 550 1 - . gene_id "ENST00000370132.3"; transcript_id "ENST00000370132.3";  Now, as you can see, this is a redundant list. You could just extract the last 9 columns and use UNIX sort and uniq to get a unique (non-redundant) count. # extract just the genes ####################$ bedtools intersect -a regions.bed -b my.gff -wa -wb | cut -f 4-12
chr1    my_genes    transcript    150    550 1    -    .    gene_id "ENST00000370132.3"; transcript_id "ENST00000370132.3";
chr1    my_genes    transcript    150    550 1    -    .    gene_id "ENST00000370132.3"; transcript_id "ENST00000370132.3";

# count the distinct genes
####################
$bedtools intersect -a regions.bed -b my.gff -wa -wb | cut -f 4-12 | sort | uniq -c 1  If you want the distinct count per region, it is better to use the groupby tool in conjunction with intersect. In this case, we are grouping on the region (columns 1-3) and counting the number of distinct gene_id strings (column 12). $ bedtools intersect -a regions.bed -b my.gff -wa -wb | bedtools groupby -i -  -g 1-3  -c 12 -o count_distinct


Alternatively (preferably?) you can use the map utility - it works much the same way as what Alex nicely describes for bedops. In this case, the gene_id is the 9th column in the gene file, so the -c argument is 9.

$bedtools map -a regions.bed -b my.gff -c 9 -o count_distinct  ADD COMMENT 0 Entering edit mode I don't think that quite works. I'd like to count the number of unique gene_ids that fall with each region. I'll update my question to better explain that. ADD REPLY 1 Entering edit mode I see - updated. ADD REPLY 3 Entering edit mode 9.5 years ago Another approach is to use BEDOPS gff2bed and the bedmap application. The bedmap tool includes a new --echo-map-id-uniq operator, which computes a non-redundant list of mapped IDs, separated with semi-colons. This result can be piped into an awk statement that counts the number of elements split by the semi-colon delimiter. Here's a command that does everything on one line: $ bedmap --echo-map-id-uniq regions.bed <(gff2bed < my.gff) | awk -F ';' '{print NF}'


The result is a list of counts of unique ID field values of overlapping GFF elements, per region.

There is also the gtf2bed conversion script, which does the same thing as gff2bed but works with GTF files and uses the gene_name or gene_id field value for populating the ID column in the BED result, in that order.