Getting these files from different parts of genome
2
1
Entering edit mode
2.7 years ago
A ★ 4.0k

Hi,

For running ActivedriverWGS software I will need coding or non coding parts of genome in BED12 format. I have found coding part of genome (in txt format though). But I don't know how to find non coding part of genome (BED12 format) also Transcription factor binding in BED4 format. I have contacted the developer but no response. Any suggestion please?

BED WGS genome SNP R • 1.1k views
3
Entering edit mode

I will need coding or non coding parts of genome in BED12 format. I

are you sure that's what you want? The documentation says: "Regions of interest can be coding or noncoding should be in a BED12 format", so you basically need a BED file of the regions for which you want to do the analysis.

I also did not get the impression that TF binding sites are required, they might be nice to have, but for that you would have to identify the TF of interest first (and search e.g. ENCODE for respective binding sites).

1
Entering edit mode

I would recommend probably using the UCSC table browser to get BED output for this info also

0
Entering edit mode

If you mean this GitHub issue #10, looks like developer responded?

0
Entering edit mode

Thank you, How I could find non-coding part of genome?

For example when downloading this software we can get coding part of genome (although in txt format)

wget https://bitbucket.org/bbglab/oncodrivefml/downloads/oncodrivefml-examples_v2.0.tar.gz


But I don't know where I could find non coding part of genome

0
Entering edit mode

The non-coding part of the genome is everything which is not... coding. So it would essentially be the complement of the bed file of the coding sequences. But that is unlikely to be what you need for your tool. See also the comment of Friederike You just need regions of interest.

0
Entering edit mode

Thank you, but I have already calculated driver genes for coding part of genome by another software; Now I need to do the same for non coding part of genome for which I will need a file contains non coding regions of human genome that I don't know how to get that.

1
Entering edit mode

No, it is unlikely that your tool just expects a bed file of all non-coding regions in the human genome. But anyway, if you insist; the answer is bedtools complement.

0
Entering edit mode

Sorry, what is the input here when the expected output is non coding in BED12?

0
Entering edit mode

Spend some time reading our comments here and the documentation of bedtools complement. I'm not coming to sit next to you and do your work.

0
Entering edit mode

:(

The same story

You only once sat next to me and did my work, when I was in Germany for interview

you and Genomax

Thank you

1
Entering edit mode

Well, I'm sure you can figure this out :-)

0
Entering edit mode

Sorry,

Likely the coding and non-coding regions of human genome are here

I have converted gtf to bed by bedops

so I have this

chr1    29553   30039   ENSG00000243485.2   .   +   HAVANA  exon    .   gene_id "ENSG00000243485.2"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_status "NOVEL"; gene_name "MIR1302-11"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "MIR1302-11-001"; exon_number 1;  exon_id "ENSE00001947070.1";  level 2; tag "not_best_in_genome_evidence"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";


How I could extract below information from this bed , for example from first line like below to whole

chr1    29553   30039   ENSG00000243485.2   +  gene_name "MIR1302-11"


I asked my question in another forum they closed my post :(

I trie my bed as a txt to extract what I want but I got error

> paste(strsplit(regions.txt, "\\s+|\t|\\\"")[[1]][c(1,2,3,4,5,6,26,28)],collapse="\t")
Error in strsplit(a, "\\s+|\t|\\\"") : non-character argument
>

1
Entering edit mode

You are mixing up terminology.

'Coding' and 'non-coding' are confusing terms, because it can mean multiple things. In transcriptomics people would subgroup transcripts in coding and non-coding transcripts, meaning "do these RNA molecules get translated to a protein?". Here non-coding transcript means every transcript that does not lead to a protein (as far as we know!).

In genomics, however, regions of the DNA are subgrouped in coding and non-coding, roughly meaning "does this sequence get transcribed to an RNA molecule?". Here non-coding fragment means every piece of DNA that does not lead to a transcript (as far as we know!).

I'd suggest being complete with regards to what you are looking for. I don't like the term "non-coding transcript". For me it is a "non-protein-coding transcript". The transcript is coding (=has a functional product) but it just doesn't create a protein.

It seems to me you are looking for non-coding DNA regions, while what you found on Gencode are non-protein-coding transcripts.

(Note that my comment here ignores biological noise: random transcription without function. The extent of this phenomenon is an open debate.)

0
Entering edit mode

I second everything Wouter wrote. I think we need to clarify what types of regions you actually want to look at using the tool (not what the tool says it needs, tell us what the goal of your analysis is).

0
Entering edit mode

Thank you

Is Long non-coding RNA gene annotation non-coding DNA regions?

0
Entering edit mode

Wouter has addressed precisely that question.

DNA:

• coding DNA = genes = the basis for RNA transcripts (in mammals, this is a small fraction of the genome!)
• non-coding genes (a misnomer!) encode RNA that do not give rise to proteins, e.g. snoRNA, miRNA, rRNA, tRNA....
• protein-coding RNA genes
• non-coding = not genic = intergenic = no RNA transcripts (except for the transcriptional noise mentioned by Wouter)
2
Entering edit mode
2.7 years ago
A ★ 4.0k

Sorry, finally I got what I want

I actually need enhancers, promoters, or other regulatory elements of human genome to find cancer driver genes placed in these regions

People says by

 awk '$3=="transcript" &&$20!="\"protein_coding\";" &&
$20!="\"translated_processed_pseudogene\";"' gencode.gtf  Will return non-coding parts of regions like awk '$3=="transcript" && $20!="\"protein_coding\";"{print$20}' gencode.gtf  | sort | uniq -c | sort -nk1
1 "translated_processed_pseudogene";
2 "Mt_rRNA";
3 "IG_J_pseudogene";
3 "TR_D_gene";
4 "TR_J_pseudogene";
5 "TR_C_gene";
10 "IG_C_pseudogene";
18 "IG_C_gene";
18 "IG_J_gene";
22 "Mt_tRNA";
25 "3prime_overlapping_ncrna";
27 "TR_V_pseudogene";
37 "IG_D_gene";
58 "non_stop_decay";
59 "polymorphic_pseudogene";
74 "TR_J_gene";
97 "TR_V_gene";
144 "IG_V_gene";
182 "unitary_pseudogene";
196 "IG_V_pseudogene";
330 "sense_overlapping";
387 "pseudogene";
442 "transcribed_processed_pseudogene";
531 "rRNA";
802 "sense_intronic";
860 "transcribed_unprocessed_pseudogene";
1529 "snoRNA";
1923 "snRNA";
2050 "misc_RNA";
2549 "unprocessed_pseudogene";
3116 "miRNA";
9710 "antisense";
10623 "processed_pseudogene";
11780 "lincRNA";
13052 "nonsense_mediated_decay";
25955 "retained_intron";
28082 "processed_transcript";


But I am not sure from these regions which parts are related to enhancers, promoters, or regulatory elements

1
Entering edit mode
• promoters are somewhat arbitrarily defined, for mouse usually something like 2kb upstream of the TSS plus maybe 1kb downstream or the like -- you will have to define these by yourself, e.g. using awk
• enhancers are highly tissue-specific! they're typically defined by regions of open (and possibly transcribed) chromatin -- you will not find them within the GTF file of transcripts AFAIK, ideally, you could find a paper of DNase or ATAC-seq or PRO-seq using cell types that may be similar to your sample at hand
1
Entering edit mode

This is not in BED12 format but I find it very helpful for human regulatory regions which you appear to be looking for (I think):

See file in (use firefox or an ftp client, not chrome) ftp://ftp.ensembl.org/pub/release-95/regulation/homo_sapiens/

1
Entering edit mode
2.7 years ago
A ★ 4.0k