Question: Restrict GTF or GFF3 to final 3' exons
0
gravatar for colindaven
14 months ago by
colindaven2.3k
Hannover Medical School
colindaven2.3k wrote:

Hello everybody,

I need to compare old 3' tag RNA-seq data where only the last couple of exons are sequenced to modern full gene RNA-seq data.

To make things slightly more comparable I want to only count reads mapping to the last 4 or so exons per transcript.

Does anyone know a tool which can filter GTFs or GFF3 to only include the last exons of a transcript, or will I have to write something myself ? I know quite a few tools, but none have this (admittedly weird) functionality.

Thanks

Edit - we wrote a Python script here to cover this, thanks Fabian. None of the simple approaches suggested here worked.

https://github.com/Colorstorm/gtf_last_exon_filter

rna-seq annotation • 362 views
ADD COMMENTlink modified 10 months ago • written 14 months ago by colindaven2.3k
1

Either filter GTF or count at exon level with featureCounts and then filter the counts to keep exons you want. You would need some custom filtering in any case.

ADD REPLYlink written 14 months ago by genomax90k

I think better to write a small code either in R or using awk in linux. I know there are various ways to do this.

ADD REPLYlink written 14 months ago by archana.bioinfo87180

Try this command

grep -e "gene_id" -w -m 3 "$next" Input_file.gtf

It will extract the whole line with respect to three exons.

ADD REPLYlink modified 14 months ago • written 14 months ago by archana.bioinfo87180

Thanks, looks like this command is off though. It's extracting three different biotypes from the Gencode v28 GTF and also throwing a no such file error.

grep -e "gene_id" -w -m 3 "$next" gencode.v28.annotation.gtf

grep: : No such file or directory
gencode.v28.annotation.gtf:chr1 HAVANA  gene    11869   14409   .   +   .   gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
gencode.v28.annotation.gtf:chr1 HAVANA  transcript  11869   14409   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "RP11-34P13.1-002"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
gencode.v28.annotation.gtf:chr1 HAVANA  exon    11869   12227   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "RP11-34P13.1-002"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
ADD REPLYlink written 14 months ago by colindaven2.3k

you can do one more thing sort the data with respect to gene_id and

for next in $(cut –d" “ -f1 sorted_data.csv | sort -u); do grep -w -m 3 "$next" sorted_data.csv ; done

You can also modify the command according to your file columns.

ADD REPLYlink written 14 months ago by archana.bioinfo87180
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1002 users visited in the last hour