Question: Getting these files from different parts of genome
1
gravatar for F
8 weeks ago by
F3.4k
Iran
F3.4k wrote:

Hi,

For running ActivedriverWGS software I will need coding or non coding parts of genome in BED12 format. I have found coding part of genome (in txt format though). But I don't know how to find non coding part of genome (BED12 format) also Transcription factor binding in BED4 format. I have contacted the developer but no response. Any suggestion please?

R snp wgs bed genome • 316 views
ADD COMMENTlink modified 8 weeks ago • written 8 weeks ago by F3.4k
3

I will need coding or non coding parts of genome in BED12 format. I

are you sure that's what you want? The documentation says: "Regions of interest can be coding or noncoding should be in a BED12 format", so you basically need a BED file of the regions for which you want to do the analysis.

I also did not get the impression that TF binding sites are required, they might be nice to have, but for that you would have to identify the TF of interest first (and search e.g. ENCODE for respective binding sites).

ADD REPLYlink written 8 weeks ago by Friederike3.6k
1

I don't know a lot about this software but appears to take "regions of interest" rather than whole genome information about this data https://github.com/reimandlab/ActiveDriverWGS

I would recommend probably using the UCSC table browser to get BED output for this info also

ADD REPLYlink written 8 weeks ago by cmdcolin1.2k

If you mean this GitHub issue #10, looks like developer responded?

ADD REPLYlink written 8 weeks ago by zx87547.1k

Thank you, How I could find non-coding part of genome?

For example when downloading this software we can get coding part of genome (although in txt format)

wget https://bitbucket.org/bbglab/oncodrivefml/downloads/oncodrivefml-examples_v2.0.tar.gz

But I don't know where I could find non coding part of genome

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by F3.4k

The non-coding part of the genome is everything which is not... coding. So it would essentially be the complement of the bed file of the coding sequences. But that is unlikely to be what you need for your tool. See also the comment of Friederike You just need regions of interest.

ADD REPLYlink written 8 weeks ago by WouterDeCoster38k

Thank you, but I have already calculated driver genes for coding part of genome by another software; Now I need to do the same for non coding part of genome for which I will need a file contains non coding regions of human genome that I don't know how to get that.

ADD REPLYlink written 8 weeks ago by F3.4k
1

No, it is unlikely that your tool just expects a bed file of all non-coding regions in the human genome. But anyway, if you insist; the answer is bedtools complement.

ADD REPLYlink written 8 weeks ago by WouterDeCoster38k

Sorry, what is the input here when the expected output is non coding in BED12?

ADD REPLYlink written 8 weeks ago by F3.4k

Spend some time reading our comments here and the documentation of bedtools complement. I'm not coming to sit next to you and do your work.

ADD REPLYlink written 8 weeks ago by WouterDeCoster38k

:(

The same story

You only once sat next to me and did my work, when I was in Germany for interview

you and Genomax

Thank you

ADD REPLYlink written 8 weeks ago by F3.4k
1

Well, I'm sure you can figure this out :-)

ADD REPLYlink written 8 weeks ago by WouterDeCoster38k

Sorry,

Likely the coding and non-coding regions of human genome are here

https://www.gencodegenes.org/human/release_19.html

I have converted gtf to bed by bedops

so I have this

chr1    29553   30039   ENSG00000243485.2   .   +   HAVANA  exon    .   gene_id "ENSG00000243485.2"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_status "NOVEL"; gene_name "MIR1302-11"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "MIR1302-11-001"; exon_number 1;  exon_id "ENSE00001947070.1";  level 2; tag "not_best_in_genome_evidence"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";

How I could extract below information from this bed , for example from first line like below to whole

chr1    29553   30039   ENSG00000243485.2   +  gene_name "MIR1302-11"

I asked my question in another forum they closed my post :(

I trie my bed as a txt to extract what I want but I got error

> paste(strsplit(regions.txt, "\\s+|\t|\\\"")[[1]][c(1,2,3,4,5,6,26,28)],collapse="\t")
Error in strsplit(a, "\\s+|\t|\\\"") : non-character argument
>
ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by F3.4k
1

You are mixing up terminology.

'Coding' and 'non-coding' are confusing terms, because it can mean multiple things. In transcriptomics people would subgroup transcripts in coding and non-coding transcripts, meaning "do these RNA molecules get translated to a protein?". Here non-coding transcript means every transcript that does not lead to a protein (as far as we know!).

In genomics, however, regions of the DNA are subgrouped in coding and non-coding, roughly meaning "does this sequence get transcribed to an RNA molecule?". Here non-coding fragment means every piece of DNA that does not lead to a transcript (as far as we know!).

I'd suggest being complete with regards to what you are looking for. I don't like the term "non-coding transcript". For me it is a "non-protein-coding transcript". The transcript is coding (=has a functional product) but it just doesn't create a protein.

It seems to me you are looking for non-coding DNA regions, while what you found on Gencode are non-protein-coding transcripts.

(Note that my comment here ignores biological noise: random transcription without function. The extent of this phenomenon is an open debate.)

ADD REPLYlink written 8 weeks ago by WouterDeCoster38k

I second everything Wouter wrote. I think we need to clarify what types of regions you actually want to look at using the tool (not what the tool says it needs, tell us what the goal of your analysis is).

ADD REPLYlink written 8 weeks ago by Friederike3.6k

Thank you

Is Long non-coding RNA gene annotation non-coding DNA regions?

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by F3.4k

Wouter has addressed precisely that question.

DNA:

  • coding DNA = genes = the basis for RNA transcripts (in mammals, this is a small fraction of the genome!)
    • non-coding genes (a misnomer!) encode RNA that do not give rise to proteins, e.g. snoRNA, miRNA, rRNA, tRNA....
    • protein-coding RNA genes
  • non-coding = not genic = intergenic = no RNA transcripts (except for the transcriptional noise mentioned by Wouter)
ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by Friederike3.6k
1
1
gravatar for F
8 weeks ago by
F3.4k
Iran
F3.4k wrote:

Sorry, finally I got what I want

I actually need enhancers, promoters, or other regulatory elements of human genome to find cancer driver genes placed in these regions

People says by

 awk '$3=="transcript" && 
     $20!="\"protein_coding\";" &&
     $20!="\"translated_processed_pseudogene\";"' gencode.gtf

Will return non-coding parts of regions

like

awk '$3=="transcript" && $20!="\"protein_coding\";"{print $20}' gencode.gtf  | sort | uniq -c | sort -nk1
      1 "translated_processed_pseudogene";
      2 "Mt_rRNA";
      3 "IG_J_pseudogene";
      3 "TR_D_gene";
      4 "TR_J_pseudogene";
      5 "TR_C_gene";
     10 "IG_C_pseudogene";
     18 "IG_C_gene";
     18 "IG_J_gene";
     22 "Mt_tRNA";
     25 "3prime_overlapping_ncrna";
     27 "TR_V_pseudogene";
     37 "IG_D_gene";
     58 "non_stop_decay";
     59 "polymorphic_pseudogene";
     74 "TR_J_gene";
     97 "TR_V_gene";
    144 "IG_V_gene";
    182 "unitary_pseudogene";
    196 "IG_V_pseudogene";
    330 "sense_overlapping";
    387 "pseudogene";
    442 "transcribed_processed_pseudogene";
    531 "rRNA";
    802 "sense_intronic";
    860 "transcribed_unprocessed_pseudogene";
   1529 "snoRNA";
   1923 "snRNA";
   2050 "misc_RNA";
   2549 "unprocessed_pseudogene";
   3116 "miRNA";
   9710 "antisense";
  10623 "processed_pseudogene";
  11780 "lincRNA";
  13052 "nonsense_mediated_decay";
  25955 "retained_intron";
  28082 "processed_transcript";

But I am not sure from these regions which parts are related to enhancers, promoters, or regulatory elements

ADD COMMENTlink written 8 weeks ago by F3.4k
1
  • promoters are somewhat arbitrarily defined, for mouse usually something like 2kb upstream of the TSS plus maybe 1kb downstream or the like -- you will have to define these by yourself, e.g. using awk
  • enhancers are highly tissue-specific! they're typically defined by regions of open (and possibly transcribed) chromatin -- you will not find them within the GTF file of transcripts AFAIK, ideally, you could find a paper of DNase or ATAC-seq or PRO-seq using cell types that may be similar to your sample at hand
ADD REPLYlink written 8 weeks ago by Friederike3.6k
1

This is not in BED12 format but I find it very helpful for human regulatory regions which you appear to be looking for (I think):

See file in (use firefox or an ftp client, not chrome) ftp://ftp.ensembl.org/pub/release-95/regulation/homo_sapiens/

ADD REPLYlink written 7 weeks ago by colindaven1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1282 users visited in the last hour