Question: getting reads counts at RefSeq genes
0
gravatar for sumithrasank75
4.7 years ago by
United States
sumithrasank75140 wrote:

I have a sorted bam file. I was wondering how to get the read counts which are near refseq genes. I used bedtools to get a bed file : bedtools bamtobed -i sorted.bam > sorted.bed I am wondering how to merge this with Refseq gene coordinates.

alignment gene • 1.2k views
ADD COMMENTlink modified 4.7 years ago by Amitm2.1k • written 4.7 years ago by sumithrasank75140
1
gravatar for Amitm
4.7 years ago by
Amitm2.1k
UK
Amitm2.1k wrote:

For generating read count data, you need to use tools like HTSeq-count or featureCounts. Also the gene annotation (of your organism of interest) in GTF format. Converting aligned BAM to bed is not required.

ADD COMMENTlink written 4.7 years ago by Amitm2.1k

While not totally clear there is this requirement in original question

how to get the read counts which are near refseq genes

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by GenoMax96k
1

hi genomax2,

I noted "near refseq genes" but discounted it assuming a matter of semantics. Nonetheless if the user actually wants to get read counts "near refseq genes" and not "within refseq genes", then a threshold is needed, lets say a flank of 1kb. Then a custom GTF file can be made which would have instead of gene coordinates, flank regions. Like this, for 1kb upstream region -

sample of the file (Ensembl GTF)

$ grep -v '^#' Homo_sapiens.GRCh38.84.gtf |awk -F'\t' 'BEGIN{OFS="\t";} $3=="gene"' |head -n 3
1   havana  gene    11869   14409   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2";
1   havana  gene    14404   29570   .   -   .   gene_id "ENSG00000227232"; gene_version "5"; gene_name "WASH7P"; gene_source "havana"; gene_biotype "unprocessed_pseudogene"; havana_gene "OTTHUMG00000000958"; havana_gene_version "1";
1   ensembl gene    17369   17436   .   -   .   gene_id "ENSG00000278267"; gene_version "1"; gene_name "MIR6859-1"; gene_source "ensembl"; gene_biotype "miRNA";

Replacing gene coordinates with upstream flank (1kb) -

$ grep -v '^#' Homo_sapiens.GRCh38.84.gtf |awk -F'\t' 'BEGIN{OFS="\t";} $3=="gene" {print $1,$2,$3,$4-1000,$4,$6,$7,$8,$9}' |head -n 3

1   havana  gene    10869   11869   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2";
1   havana  gene    13404   14404   .   -   .   gene_id "ENSG00000227232"; gene_version "5"; gene_name "WASH7P"; gene_source "havana"; gene_biotype "unprocessed_pseudogene"; havana_gene "OTTHUMG00000000958"; havana_gene_version "1";
1   ensembl gene    16369   17369   .   -   .   gene_id "ENSG00000278267"; gene_version "1"; gene_name "MIR6859-1"; gene_source "ensembl"; gene_biotype "miRNA";

Working only with gene definition lines of the GTF. The resulting custom GTF can then be fed to HTSeq-count/ featureCounts.

htseq-count \
-f bam \
-r pos \
-s no \
-t gene \
-i gene_id \
Aligned.sortedByCoord.out.bam \
Homo_sapiens.GRCh38.84_Upst1kbFlank_Clean.gtf \
>Upst1kbFlank_Counts

Sinji has already suggested use of bedtools slop. That would be okay I think when gene + Flank count is needed. Nonetheless there must be easier way of getting counts from gene flanks only using BED file gene anno. & bedtools or bedops. The only advantage of sticking with HTSeq-count/ featureCounts, I find, is the handling of overlapping reads as noted in the image here.

(Have tried editing the GTF snippets for clarity)

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by Amitm2.1k

If we're interested in only the Flank count then you can go ahead and use bedtools flank which will only return flank information and not gene information. Might need to do a bit of awk here, I haven't used flank in a while, but it's fairly simple. Of course your way is probably much easier assuming we are using the same flank region on both sides. If not strand information must be taken into account.

ADD REPLYlink written 4.7 years ago by Sinji3.1k

In that case he could use bedtools slop to extend RefSeq genes X amount of bp and then use the above answer i'd assume.

ADD REPLYlink written 4.7 years ago by Sinji3.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1497 users visited in the last hour
_