Question

Hg19 regions for Intergenic, Promoters, Enhancer, Exon, Intron, 5-UTR, 3-UTR and so on

3

Entering edit mode

10.0 years ago

Shicheng Guo ★ 9.6k

Hi guys,

I want to count the number for each feature: enhancer, promoter, exon, intron and so on for my bed file. Is there any prepared bed tables which can be used to make the intersect and do the counts statistic.

Perl or R script and implemented file of Java or python would be appreciated.
Is there more professional pipelines to classify batch of genome intervals into different categories? such as enhancer, promoter, exon, intron, miRNA, lncRNA, intergenic and so on?
now we can kg2bed, gff2bed, is there any tool called refgene2bed?

BTW:

The categories of the features in Pierre Lindenbaum is too limited. More features such as downstream, intergenic are needed. Martombo mentioned we can use bed2bam to transfer bed to bam and then use read_distribution.py in RSeQC to do this job

Best regards

genome • 14k views

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 10.0 years ago by Shicheng Guo ★ 9.6k

0

Entering edit mode

you may try RSeQC read_distribution.py as a start: http://dldcc-web.brc.bcm.edu/lilab/liguow/CGI/rseqc/_build/html/#read-distribution-py I'm not sure where you can find information about enhancers though

ADD REPLY • link 10.0 years ago by Martombo ★ 3.2k

0

Entering edit mode

JVARKIT mentioned by Pierre Lindenbaum is pretty cool. However the installation process of JVARKIT is complicated. Anyway, Now we can find that RSeQC and JVARKIT are suitable for such bioinformatic analysis. In addition, we can find that Python and Java have become the main source for bioinformatics research. It seems that Perl has been out of date.

obviously, gff2bed is the best choice to do it. this tools is C/C++ based. and can be download from the following website: https://github.com/bedops/bedops/releases

Thanks

ADD REPLY • link updated 3.0 years ago by Ram 45k • written 10.0 years ago by Shicheng Guo ★ 9.6k

Ram · Answer 1 · 2015-07-21

10

Entering edit mode

10.0 years ago

Pierre Lindenbaum 166k

I wrote a tool to convert UCSC knownGene to bed: https://github.com/lindenb/jvarkit/wiki/KnownGenesToBed

curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz" |\
  gunzip -c |\
  java -jar dist/kg2bed.jar
chr1    11873   14409   +   uc001aaa.3  TRANSCRIPT  uc001aaa.3
chr1    11873   12227   +   uc001aaa.3  EXON    Exon 1
chr1    12227   12612   +   uc001aaa.3  INTRON  Intron 1
chr1    11873   12227   +   uc001aaa.3  UTR UTR3
chr1    12612   12721   +   uc001aaa.3  EXON    Exon 2
chr1    12721   13220   +   uc001aaa.3  INTRON  Intron 2
chr1    12612   12721   +   uc001aaa.3  UTR UTR3
chr1    13220   14409   +   uc001aaa.3  EXON    Exon 3
chr1    13220   14409   +   uc001aaa.3  UTR UTR3
chr1    11873   14409   +   uc010nxr.1  TRANSCRIPT  uc010nxr.1
chr1    11873   12227   +   uc010nxr.1  EXON    Exon 1
chr1    12227   12645   +   uc010nxr.1  INTRON  Intron 1

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 10.0 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

The following is right example:

chr1    11873   14409   +       uc001aaa.3      TRANSCRIPT      uc001aaa.3
chr1    11873   12227   +       uc001aaa.3      EXON    Exon 1
chr1    12227   12612   +       uc001aaa.3      INTRON  Intron 1
chr1    11873   12227   +       uc001aaa.3      UTR     UTR3
chr1    12612   12721   +       uc001aaa.3      EXON    Exon 2
chr1    12721   13220   +       uc001aaa.3      INTRON  Intron 2
chr1    12612   12721   +       uc001aaa.3      UTR     UTR3
chr1    13220   14409   +       uc001aaa.3      EXON    Exon 3
chr1    13220   14409   +       uc001aaa.3      UTR     UTR3
chr1    11873   14409   +       uc010nxr.1      TRANSCRIPT      uc010nxr.1
chr1    11873   12227   +       uc010nxr.1      EXON    Exon 1
chr1    12227   12645   +       uc010nxr.1      INTRON  Intron 1
chr1    11873   12227   +       uc010nxr.1      UTR     UTR3
chr1    12645   12697   +       uc010nxr.1      EXON    Exon 2
chr1    12697   13220   +       uc010nxr.1      INTRON  Intron 2
chr1    12645   12697   +       uc010nxr.1      UTR     UTR3
chr1    13220   14409   +       uc010nxr.1      EXON    Exon 3
chr1    13220   14409   +       uc010nxr.1      UTR     UTR3

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 10.0 years ago by Shicheng Guo ★ 9.6k

0

Entering edit mode

cut -f 6 genome_feature.bed | sort | uniq
CDS
EXON
INTRON
TRANSCRIPT
UTR

Only have 5 features, I am curious about how can I get gene and intergenic regions? How to differ 5' UTR and 3' UTR?

The tools is really awesome. Thanks.

ADD REPLY • link 7.8 years ago by wangshx ▴ 10

Ram · Answer 2 · 2015-07-23

How to download the most similar annotation file as the author required from UCSC browser directly.

Go to UCSC brower and Tool, Table caterogy. and pick you reference genome and select right version under clade/genome/assembly
Make sure the group is "Genes and Gene Predictions"
Choose your preferred track (RefSeq/RefGene or UCSC gene/KnownGene)
Choose the table that gives gene information (RefSeq or KnownGene)
Select your region or the entire genome to get coordinates for
Select BED format as your output format
Name your output file
Click "get output"

Be careful, the ouput files don't have exon, intron, integenic, 5-UTR, 3-UTR informatics if you save it as a single file. You can save them as separated files so that you know the information for each subset.

Hopefully, I have complete the transformation process, you can download the file which I created: hg19_refGene.segment.bed

The format of the file is as the following which is more readable than the previous ones.

chr1    49090   65090   NM_001005484    OR4F5   +       Enhancer        69090
chr1    65090   70090   NM_001005484    OR4F5   +       Promoter        69090
chr1    69090   69090   NM_001005484    OR4F5   +       UTR5    69090
chr1    69090   70008   NM_001005484    OR4F5   +       Exon1   69090
chr1    70008   70008   NM_001005484    OR4F5   +       UTR3    69090
chr1    70008   72008   NM_001005484    OR4F5   +       Downstream      69090
chr1    347658  363658  NM_001005221    OR4F29  +       Enhancer        367658
chr1    363658  368658  NM_001005221    OR4F29  +       Promoter        367658
chr1    367658  367658  NM_001005221    OR4F29  +       UTR5    367658
chr1    367658  368595  NM_001005221    OR4F29  +       Exon1   367658
chr1    368595  368595  NM_001005221    OR4F29  +       UTR3    367658
chr1    368595  370595  NM_001005221    OR4F29  +       Downstream      367658
chr1    347658  363658  NM_001005224    OR4F3   +       Enhancer        367658
chr1    363658  368658  NM_001005224    OR4F3   +       Promoter        367658
chr1    367658  367658  NM_001005224    OR4F3   +       UTR5    367658
chr1    367658  368595  NM_001005224    OR4F3   +       Exon1   367658
chr1    368595  368595  NM_001005224    OR4F3   +       UTR3    367658
chr1    368595  370595  NM_001005224    OR4F3   +       Downstream      367658
chr1    347658  363658  NM_001005277    OR4F16  +       Enhancer        367658
chr1    363658  368658  NM_001005277    OR4F16  +       Promoter        367658
chr1    367658  367658  NM_001005277    OR4F16  +       UTR5    367658
chr1    367658  368595  NM_001005277    OR4F16  +       Exon1   367658
chr1    368595  368595  NM_001005277    OR4F16  +       UTR3    367658
chr1    368595  370595  NM_001005277    OR4F16  +       Downstream      367658
chr1    619097  621097  NM_001005221    OR4F29  -       Downstream      622034
chr1    621097  621097  NM_001005221    OR4F29  -       UTR3    622034
chr1    621097  622034  NM_001005221    OR4F29  -       Exon1   622034
chr1    621034  626034  NM_001005221    OR4F29  -       Promoter        622034
chr1    626034  642034  NM_001005221    OR4F29  -       Enhancer        622034

Ram · Answer 3 · 2015-07-21

read_distribution.py in RSeQC can do such job, however, the input file must be bam. this would decrease the application.

read_distribution.py  -i Pairend_Human_hg19.bam -r hg19.refseq.bed

Group            Total_bases    Tag_count    Tags/Kb
CDS_Exons        33302033       20002271     600.63
5’UTR_Exons      21717577       4408991      203.01
3’UTR_Exons      15347845       3643326      237.38
Introns          1132597354     6325392      5.58
TSS_up_1kb       17957047       215331       11.99
TSS_up_5kb       81621382       392296       4.81
TSS_up_10kb      149730983      769231       5.14
TES_down_1kb     18298543       266161       14.55
TES_down_5kb     78900674       729997       9.25
TES_down_10kb    140361190      896882       6.39

Ram · Answer 4 · 2015-07-21

Start with downloading annotations and converting them to BED with gff2bed:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gff3.gz \
    | gunzip --stdout - \
    | gff2bed - \
    > annotations.bed

Filter them for annotation subcategories, e.g.:

$ grep -F "exon" annotations.bed > exons.bed

Use BEDOPS bedmap --count to count exons over your BED-formatted regions of interest:

$ bedmap --echo --count regions_of_interest.bed exons.bed > number_of_exons_over_roi.bed

Repeat for other annotation subcategories.

If you need to do these operations inside a Perl or Python script, you could use system() or subprocess calls.