Question: TSS metaprofile using Deeptools
0
gravatar for mickey_95
4 months ago by
mickey_9530
mickey_9530 wrote:

Hi,

I am trying to use Deeptools' computeMatrix and plotHeatmap functions to create TSS-centered metaprofiles over all genes. Ultimately, I would like to apply some additional filters, e.g. protein-coding, non-overlapping TSSs, above a certain size. But for an initial trial, I decided to just filter Ensembl's GTF annotation file for genes:

awk 'BEGIN{FS=OFS="\t"} $3 == "gene"' GRCm38.99.gtf > test1.gtf
head -n 5 GRCm38.99.gtf | cat - test1.gtf > test.gtf # for re-adding the GTF "header"

I then use computeMatrix:

computeMatrix reference-point \
  --referencePoint TSS \
  --scoreFileName input.bw \
  --regionsFileName test.gtf \
  -out TSSmeta.gz \
  --beforeRegionStartLength 2000 --afterRegionStartLength 2000 \
  --binSize 20 \
  --missingDataAsZero \
  --sortRegions no

This resulted in the following error:

RuntimeError: None of the input BED/GTF files had valid regions

From what I understand, the problem stems from the absence of transcript features in the 3rd column of the GTF file. Hence, I tried running the above computeMatrix command, but including --metagene, which resulted in the same error. I also tried setting --transcriptID gene --exonID gene with no success.

I would really appreciate help on this!

ADD COMMENTlink modified 3 months ago by zhuobaowen10 • written 4 months ago by mickey_9530
0
gravatar for 2nelly
4 months ago by
2nelly220
Geneva,Switzerland
2nelly220 wrote:

If you can upload the first lines of your test.gtf, we can help you. Alternatively, you can try to convert gtf to bed format.

For me the bed format below works like a charm:

chr1    2985742 3355185 PRDM16  369443  +
chr1    6845384 7829766 CAMTA1  984382  +
chr1    8412464 8877699 RERE    465235  -
chr1    10696661    10856733    CASZ1   160072  -

chromosome,start,end,gene_symbol,length,strand

ADD COMMENTlink modified 4 months ago • written 4 months ago by 2nelly220

Here are the first lines of the gtf file:

chr1    ensembl_havana  gene    5588466 5606131 .   +   .   gene_id "ENSMUSG00000025905"; gene_version "14"; gene_name "Oprk1"; gene_source "ensembl_havana"; gene_biotype "protein_coding";
chr1    ensembl_havana  gene    6206197 6276648 .   +   .   gene_id "ENSMUSG00000025907"; gene_version "14"; gene_name "Rb1cc1"; gene_source "ensembl_havana"; gene_biotype "protein_coding";
chr1    ensembl_havana  gene    6359218 6394731 .   +   .   gene_id "ENSMUSG00000087247"; gene_version "3"; gene_name "Alkal1"; gene_source "ensembl_havana"; gene_biotype "protein_coding";

I was hoping to directly use the gtf file without having to convert to a bed file to avoid having to switch between 1- and 0-based coordinates.

I have now tried:

computeMatrix reference-point \
  --referencePoint TSS \
  --scoreFileName input.bw \
  --regionsFileName test.gtf \
  -out TSSmeta.gz \
  --beforeRegionStartLength 2000 --afterRegionStartLength 2000 \
  --binSize 20 \
  --missingDataAsZero \
  --sortRegions no
  --transcriptID gene \
  --transcript_id_designator gene_id

It ran through without errors and the result from plotHeatmap looks reasonable. But now I am doubting whether setting --transcript ID gene and --transcript_id_designator gene_id is in in this specific case correct (trying to get the signal over entire genes instead of transcripts)

ADD REPLYlink modified 4 months ago • written 4 months ago by mickey_9530
4

The error most likely occurred because you didn't include any entry with "transcript" in the 3rd column in your test.gtf.

But now I am doubting whether setting --transcript ID gene and --transcript_id_designator gene_id is in in this specific case correct (trying to get the signal over entire genes instead of transcripts)

Why are you in doubt? Seems like you want to focus on genes rather than transcripts.

ADD REPLYlink written 4 months ago by Friederike6.5k
1

If you want genes then that will work fine. Just remember that genes are just groups of transcripts when looking at the results.

ADD REPLYlink written 4 months ago by Devon Ryan97k
0
gravatar for zhuobaowen
3 months ago by
zhuobaowen10
zhuobaowen10 wrote:

You can just download a GTF file from Ensembl or UCSC and use that. computeMatrix will figure out where the TSS and TES for each transcript is then. ADD COMMENT • linkwritten 3.3 years ago by Devon Ryan ♦ 96k

ADD COMMENTlink written 3 months ago by zhuobaowen10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1696 users visited in the last hour