Question: How do I filter out transcripts with less than two exons
0
gravatar for alex.v.nesta
17 months ago by
alex.v.nesta10
alex.v.nesta10 wrote:

Hi,

I have a GFF file containing MCF-7 cell transcript data. There is also a fastq file if that can be helpful.

chr1 PacBio transcript 27567 29338 . - . gene_id "PB2015.1"; transcript_id "PB2015.1.1"; chr1 PacBio exon 27567 29338 . - . gene_id "PB2015.1"; transcript_id "PB2015.1.1";

These are the first two lines.

You can see that it identifies both transcripts and exons. I want to filter out any transcripts containing less than two exons for each transcript.

Additional Info:

I have attempted doing this with UCSC Table Browser and Galaxy. They both end up throwing errors.

Galaxy Error: An error occurred with this dataset: Traceback (most recent call last): File "/cvmfs/main.galaxyproject.org/galaxy/tools/filters/gff/gff_filter_by_feature_count.py", line 182, in <module> __main__() File "/cvmfs/main.galaxyproject.org/galaxy/tools/filters/gff/gff_filter_by_feature_c

File "/cvmfs/main.galaxyproject.org/galaxy/lib/galaxy/datatypes/util/gff_util.py", line 191, in __next__ self.seed_interval = GenomicIntervalReader.next(self) RuntimeError: maximum recursion depth exceeded while calling a Python object

Filter 18: MCF7 hg19.gff
Using feature name exon
With following condition >1

Table browser doesn't list exon as a possible filter option when I upload this dataset as a custom track.

I am very new to this. Does anyone have any suggestions for me here? I can use R pretty fluently and I have a little bit of python ability. I also have bedtools set up but I don't know how to use it very well.

Please point me in the right direction!

Thanks, Alex.

galaxy gff bed bedtools • 787 views
ADD COMMENTlink modified 17 months ago by cpad011212k • written 17 months ago by alex.v.nesta10
1
awk '{ if ($3 == "exon") print $0; }'your_gtf_file | grep -o "transcript_id [^;]\+;" | awk -v FS=" " '{ arr[substr($2, 2, length($2)-3)] += 1; } END { for (t in arr) { if (arr[t] > 1) { print t;} } }'

will yield a list of the ids of all multi-exon transcripts. Is that what you're looking for? (I noticed the line format you provided above is gtf not gff, btw.)

ADD REPLYlink written 17 months ago by cschu1811.9k

I just typed out a nice comment and accidentally pressed cancel instead of add comment. Lost everything. Super pissed.

Summary: Unfortunately your script identified all transcripts as multi exon. Thanks for your efforts. I think I can figure this out in R now though.

Thanks for pointing out this is GFFv2/GTF format. Now I am able to import and export win R without corruption.

here is some toy data. 2 transcripts to keep and 2 to eliminate. In case you can figure out what went wrong in your script. I will update if I can get this sorted in R. Galaxy still errors out even though I said it was GTF format.

chr1    PacBio  transcript  27567   29338   .   -   .   gene_id "PB2015.1"; transcript_id "PB2015.1.1";
chr1    PacBio  exon    27567   29338   .   -   .   gene_id "PB2015.1"; transcript_id "PB2015.1.1";
chr1    PacBio  transcript  335264  336700  .   +   .   gene_id "PB2015.2"; transcript_id "PB2015.2.1";
chr1    PacBio  exon    335264  336700  .   +   .   gene_id "PB2015.2"; transcript_id "PB2015.2.1";
chr1    PacBio  transcript  661285  663852  .   +   .   gene_id "PB2015.3"; transcript_id "PB2015.3.1";
chr1    PacBio  exon    661285  662523  .   +   .   gene_id "PB2015.3"; transcript_id "PB2015.3.1";
chr1    PacBio  exon    662866  663852  .   +   .   gene_id "PB2015.3"; transcript_id "PB2015.3.1";
chr1    PacBio  transcript  661268  663688  .   -   .   gene_id "PB2015.4"; transcript_id "PB2015.4.1";
chr1    PacBio  exon    661268  662168  .   -   .   gene_id "PB2015.4"; transcript_id "PB2015.4.1";
chr1    PacBio  exon    663125  663688  .   -   .   gene_id "PB2015.4"; transcript_id "PB2015.4.1";
chr1    PacBio  transcript  661290  663352  .   -   .   gene_id "PB2015.4"; transcript_id "PB2015.4.2";
chr1    PacBio  exon    661290  662469  .   -   .   gene_id "PB2015.4"; transcript_id "PB2015.4.2";
chr1    PacBio  exon    662928  663352  .   -   .   gene_id "PB2015.4"; transcript_id "PB2015.4.2";
ADD REPLYlink written 17 months ago by alex.v.nesta10

Hmm, when I run this on your toy data, it returns only the three real multi-exon transcripts, i.e. PB2015.3.1, PB2015.4.1, PB2015.4.2.

ADD REPLYlink written 17 months ago by cschu1811.9k

Sorry I wasn't able to make your script work. I can take another look at it in a bit. But I still have good news! I managed to sort out the issue in R! verified elimination of single exon transcripts comparing the two tracks using IGV.

This honestly wouldn't have been possible without your noticing the format.

library(rtracklayer)
library(tidyverse)
MCF7 <- as.tibble(import("MCF7hg19.gtf"))

MultiExons <- MCF7 %>% group_by(transcript_id) %>% filter(n() > 2)

nrow(MultiExons)
nrow(MCF7)

export(MultiExons, "MCFMultiTranscripts.gtf", format="GTF")
ADD REPLYlink written 17 months ago by alex.v.nesta10

Well, that is good that you got a solution and that's what counts. I really need to get my R sorted these days...

ADD REPLYlink written 17 months ago by cschu1811.9k
0
gravatar for cpad0112
17 months ago by
cpad011212k
India
cpad011212k wrote:

output:

$ sed 's/\s/\t/g' ./gencode_v22_grch38_annotation/gencode.v28.annotation.gtf | awk '$3 == "exon"'| datamash -s  -g 12 count 3 | awk '$2 < 2' | head
"ENST00000194152.3";    1
"ENST00000194155.6";    1
"ENST00000209540.2";    1
"ENST00000216407.4";    1
"ENST00000216743.6";    1
"ENST00000218249.7";    1
"ENST00000222033.5";    1
"ENST00000223076.2";    1
"ENST00000225587.1";    1
"ENST00000229030.4";    1

input:

$ head ./gencode_v22_grch38_annotation/gencode.v28.annotation.gtf

##description: evidence-based annotation of the human genome (GRCh38), version 28 (Ensembl 92)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2018-03-23
chr1    HAVANA  gene    11869   14409   .   +   .   gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  transcript  11869   14409   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "RP11-34P13.1-002"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    11869   12227   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "RP11-34P13.1-002"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    12613   12721   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "RP11-34P13.1-002"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    13221   14409   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "RP11-34P13.1-002"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
ADD COMMENTlink modified 17 months ago • written 17 months ago by cpad011212k

What is datamash ?

ADD REPLYlink written 17 months ago by cschu1811.9k
1

datamash is a command line tool provided by GNU and if you are using debian based distros, type 'sudo apt install datamash -y'. You can read about it more online here: https://www.gnu.org/software/datamash/

ADD REPLYlink modified 17 months ago • written 17 months ago by cpad011212k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1658 users visited in the last hour