Question

Filtering of transcripts with more than two exons

0

Entering edit mode

3.2 years ago

kuntalasb ▴ 10

Hello, I have an annotation.gtf of a particular plant species and a fasta file of filtered transcripts of that plant. I need to filter out certain transcripts from that fasta file having less than two exons. My gtf file looks like this :

    scaffold_1  AT_DB_2015  exon    29512   30439   .   -   .   transcript_id "AT000010.1"; gene_id "AT000010"; gene_name "AT000010";
scaffold_1  AT_DB_2015  exon    30816   31164   .   -   .   transcript_id "AT000010.1"; gene_id "AT000010"; gene_name "AT000010";

My fasta file of transcripts:

    >TCONS_00000187
GCAGACAAAAATCAGGTAAATGGTAAAGTTTACTTGCAGAAACAACCacaaaaagaaaca
agcaagaaaGCCGACAGAAAACATTtggaaatttattaatgttTCAATGTTTACAAGGCT
GATGAATATACAATTAGACCTCAGGCTCAGGAGGGTCAACGGTCTCAGCAGGCGAAGGAT
CAGCTACCTCGGGGAGTGGAAGATCAGCTGTATTaggaggaggaggcatggacgA
>TCONS_00000188
GCTCTCCCCTACTTTTCGACCATAGATTTACTTCTTCAGCTGAACCgaaaatttaaaaga
ataagAGTGTAAGCAAAGTAATACTATTTagattaggggtgggcaaaatcaaATTCGGAT
TGAGTTCGGATCAATCTGATCAGATTTCAAATTGTTCGAGTTGGGTAAAAATTAATCTGA
TGCAACCCGAACTTACAACCCGACTCAATCCGATTAGATATATATTCAAATCGGATCAGA
TCAGGTAAAAAAATCAGATTGAATTCGGATATGATTGGGTTCGAATTACAACAATCAAAT
ACTAAACATTAATTTGCTATCTATTAAATTAGTAACTATATATACCATCAAAGCTTCAAA

Is there any awk command (or any command) to filter out transcripts having < 2 exons and to keep those having >= 2 exons? Any help will be highly appreciated. Thanks.

sequence RNA-Seq • 1.5k views

ADD COMMENT • link updated 3.2 years ago by jordi.planells ▴ 480 • written 3.2 years ago by kuntalasb ▴ 10

1

Entering edit mode

what's the link between fasta and gtf? You can group by gene/transcript and filter by exon number. Please post more lines and example output.

ADD REPLY • link 3.2 years ago by cpad0112 21k

0

Entering edit mode

I am so sorry for the previous gtf file. The correct gtf file is this :

    MKYO02000001.1  StringTie   transcript  43814   47192   .   +   .   transcript_id "TCONS_00000860"; gene_id "XLOC_000001"; oId "STRG.3.2"; class_code "u"; tss_id "TSS1"; num_samples "4";
MKYO02000001.1  StringTie   exon    43814   44940   .   +   .   transcript_id "TCONS_00000860"; gene_id "XLOC_000001"; exon_number "1";
MKYO02000001.1  StringTie   exon    45202   45427   .   +   .   transcript_id "TCONS_00000860"; gene_id "XLOC_000001"; exon_number "2";
MKYO02000001.1  StringTie   exon    45532   45659   .   +   .   transcript_id "TCONS_00000860"; gene_id "XLOC_000001"; exon_number "3";
MKYO02000001.1  StringTie   exon    45763   47192   .   +   .   transcript_id "TCONS_00000860"; gene_id "XLOC_000001"; exon_number "4";
MKYO02000001.1  StringTie   transcript  43814   47192   .   +   .   transcript_id "TCONS_00000333"; gene_id "XLOC_000001"; oId "STRG.3.3"; class_code "u"; tss_id "TSS1"; num_samples "18";

From here, I want to extract TCONS ids with exon numbers >= 2. The previous fasta file is the sequence information of all tcons of this gtf. Any command based solution will be highly helpful. Thank you.

ADD REPLY • link 3.2 years ago by kuntalasb ▴ 10

1

Entering edit mode

Since you are running, Tuxedo suite, I assume that gffread is installed on your system. Run the following code and this would print transcript in the first column and number of exons in corresponding transcript would be in second column.

$ gffread -E transcripts.gtf  --bed -o- | awk -F "\t" '$10 >=2 {print $4,$10}'

If you still have issues in running the code, please upload example gtf file some where and share the location. Parsing copy/pasted gtf is time consuming.

This is more robust than previous code, but you need to know where exonCount and other attributes are coming in gffread output:

$ gffread -E transcripts.gtf  --tlf  -o- | awk -F "\t|=|;" '$12 >=3 {print$10,$16,$12}'

Transcripts.gtf is taken from example files provided by gffread on github.

in R with dplyr:

library(dplyr)
library(rtracklayer)

gtf = import('transcripts.gtf')
gtf_df=as.data.frame(gtf)

gtf_df %>% 
count(gene_id,transcript_id,type, name="Exon_count") %>% 
filter(type=='exon' & Exon_count >=2) %>% 
select(-type)

base R:

library(dplyr)
library(rtracklayer)    
gtf = import('transcripts.gtf')
gtf_df=as.data.frame(gtf)
gtf_df_xtab=as.data.frame(xtabs(~transcript_id+type, data=gtf_df))
subset(gtf_df_xtab, type %in% 'exon' & Freq >= 2)

This would print gene, transcript, and number of exons

ADD REPLY • link 3.2 years ago by cpad0112 21k

0

Entering edit mode

Thank you so much, it worked for me!

ADD REPLY • link 3.2 years ago by kuntalasb ▴ 10

score 1 · Answer 1 · 2021-02-12

You could use dplyr and group_by to count the number of transcripts/exons per gene.

library(tidyverse)

First, separate the attributes column (the last one).

gtf = separate(gtf, col = V9, sep = ";", into = c("transcript_id", "gene_id"))

Group and get the number of features by gene.

df = gtf %>% group_by(gene_id) %>% count(V3)

Filter only those that have 2 or more exons.

df = df[df$V3=="exon" & df$n >= 2,]

With this, you will have a data frame with all the genes that have 2 or more transcripts. You only need to slice the gtf annotation.

two_or_more = gtf[gtf$gene_id %in% df$gene_id,]