Question

Restrict GTF or GFF3 to final 3' exons

0

Entering edit mode

4.8 years ago

colindaven 6.4k

Hello everybody,

I need to compare old 3' tag RNA-seq data where only the last couple of exons are sequenced to modern full gene RNA-seq data.

To make things slightly more comparable I want to only count reads mapping to the last 4 or so exons per transcript.

Does anyone know a tool which can filter GTFs or GFF3 to only include the last exons of a transcript, or will I have to write something myself ? I know quite a few tools, but none have this (admittedly weird) functionality.

Thanks

Edit - we wrote a Python script here to cover this, thanks Fabian. None of the simple approaches suggested here worked.

https://github.com/Colorstorm/gtf_last_exon_filter

rna-seq annotation • 1.1k views

ADD COMMENT • link 4.4 years ago by colindaven 6.4k

1

Entering edit mode

Either filter GTF or count at exon level with featureCounts and then filter the counts to keep exons you want. You would need some custom filtering in any case.

ADD REPLY • link 4.8 years ago by GenoMax 141k

0

Entering edit mode

I think better to write a small code either in R or using awk in linux. I know there are various ways to do this.

ADD REPLY • link 4.8 years ago by archana.bioinfo87 ▴ 210

0

Entering edit mode

Try this command

grep -e "gene_id" -w -m 3 "$next" Input_file.gtf

It will extract the whole line with respect to three exons.

ADD REPLY • link 4.8 years ago by archana.bioinfo87 ▴ 210

0

Entering edit mode

Thanks, looks like this command is off though. It's extracting three different biotypes from the Gencode v28 GTF and also throwing a no such file error.

grep -e "gene_id" -w -m 3 "$next" gencode.v28.annotation.gtf

grep: : No such file or directory
gencode.v28.annotation.gtf:chr1 HAVANA  gene    11869   14409   .   +   .   gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
gencode.v28.annotation.gtf:chr1 HAVANA  transcript  11869   14409   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "RP11-34P13.1-002"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
gencode.v28.annotation.gtf:chr1 HAVANA  exon    11869   12227   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "RP11-34P13.1-002"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

ADD REPLY • link 4.8 years ago by colindaven 6.4k

0

Entering edit mode

you can do one more thing sort the data with respect to gene_id and

for next in $(cut –d" “ -f1 sorted_data.csv | sort -u); do grep -w -m 3 "$next" sorted_data.csv ; done

You can also modify the command according to your file columns.

ADD REPLY • link 4.8 years ago by archana.bioinfo87 ▴ 210