Question

getting reads counts at RefSeq genes

0

Entering edit mode

9.1 years ago

sumithrasank75 ▴ 140

I have a sorted bam file. I was wondering how to get the read counts which are near refseq genes. I used bedtools to get a bed file : bedtools bamtobed -i sorted.bam > sorted.bed I am wondering how to merge this with Refseq gene coordinates.

alignment gene • 2.2k views

ADD COMMENT • link updated 9.1 years ago by Amitm ★ 2.3k • written 9.1 years ago by sumithrasank75 ▴ 140

score 1 · Answer 1 · 2016-06-21

1

Entering edit mode

9.1 years ago

Amitm ★ 2.3k

For generating read count data, you need to use tools like HTSeq-count or featureCounts. Also the gene annotation (of your organism of interest) in GTF format. Converting aligned BAM to bed is not required.

ADD COMMENT • link 9.1 years ago by Amitm ★ 2.3k

0

Entering edit mode

While not totally clear there is this requirement in original question

how to get the read counts which are near refseq genes

ADD REPLY • link 9.1 years ago by GenoMax 152k

1

Entering edit mode

hi genomax2,

I noted "near refseq genes" but discounted it assuming a matter of semantics. Nonetheless if the user actually wants to get read counts "near refseq genes" and not "within refseq genes", then a threshold is needed, lets say a flank of 1kb. Then a custom GTF file can be made which would have instead of gene coordinates, flank regions. Like this, for 1kb upstream region -

sample of the file (Ensembl GTF)

$ grep -v '^#' Homo_sapiens.GRCh38.84.gtf |awk -F'\t' 'BEGIN{OFS="\t";} $3=="gene"' |head -n 3
1   havana  gene    11869   14409   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2";
1   havana  gene    14404   29570   .   -   .   gene_id "ENSG00000227232"; gene_version "5"; gene_name "WASH7P"; gene_source "havana"; gene_biotype "unprocessed_pseudogene"; havana_gene "OTTHUMG00000000958"; havana_gene_version "1";
1   ensembl gene    17369   17436   .   -   .   gene_id "ENSG00000278267"; gene_version "1"; gene_name "MIR6859-1"; gene_source "ensembl"; gene_biotype "miRNA";

Replacing gene coordinates with upstream flank (1kb) -

$ grep -v '^#' Homo_sapiens.GRCh38.84.gtf |awk -F'\t' 'BEGIN{OFS="\t";} $3=="gene" {print $1,$2,$3,$4-1000,$4,$6,$7,$8,$9}' |head -n 3

1   havana  gene    10869   11869   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2";
1   havana  gene    13404   14404   .   -   .   gene_id "ENSG00000227232"; gene_version "5"; gene_name "WASH7P"; gene_source "havana"; gene_biotype "unprocessed_pseudogene"; havana_gene "OTTHUMG00000000958"; havana_gene_version "1";
1   ensembl gene    16369   17369   .   -   .   gene_id "ENSG00000278267"; gene_version "1"; gene_name "MIR6859-1"; gene_source "ensembl"; gene_biotype "miRNA";

Working only with gene definition lines of the GTF. The resulting custom GTF can then be fed to HTSeq-count/ featureCounts.

htseq-count \
-f bam \
-r pos \
-s no \
-t gene \
-i gene_id \
Aligned.sortedByCoord.out.bam \
Homo_sapiens.GRCh38.84_Upst1kbFlank_Clean.gtf \
>Upst1kbFlank_Counts

Sinji has already suggested use of bedtools slop. That would be okay I think when gene + Flank count is needed. Nonetheless there must be easier way of getting counts from gene flanks only using BED file gene anno. & bedtools or bedops. The only advantage of sticking with HTSeq-count/ featureCounts, I find, is the handling of overlapping reads as noted in the image here.

(Have tried editing the GTF snippets for clarity)

ADD REPLY • link 9.1 years ago by Amitm ★ 2.3k

0

Entering edit mode

If we're interested in only the Flank count then you can go ahead and use bedtools flank which will only return flank information and not gene information. Might need to do a bit of awk here, I haven't used flank in a while, but it's fairly simple. Of course your way is probably much easier assuming we are using the same flank region on both sides. If not strand information must be taken into account.

ADD REPLY • link 9.1 years ago by Sinji ★ 3.2k

0

Entering edit mode

In that case he could use bedtools slop to extend RefSeq genes X amount of bp and then use the above answer i'd assume.

ADD REPLY • link 9.1 years ago by Sinji ★ 3.2k