Question: Restrict GTF or GFF3 to final 3' exons
0
gravatar for colindaven
4 months ago by
colindaven1.8k
Hannover Medical School
colindaven1.8k wrote:

Hello everybody,

I need to compare old 3' tag RNA-seq data where only the last couple of exons are sequenced to modern full gene RNA-seq data.

To make things slightly more comparable I want to only count reads mapping to the last 4 or so exons per transcript.

Does anyone know a tool which can filter GTFs or GFF3 to only include the last exons of a transcript, or will I have to write something myself ? I know quite a few tools, but none have this (admittedly weird) functionality.

Thanks

rna-seq annotation • 215 views
ADD COMMENTlink modified 3 months ago by Biostar ♦♦ 20 • written 4 months ago by colindaven1.8k
1

Either filter GTF or count at exon level with featureCounts and then filter the counts to keep exons you want. You would need some custom filtering in any case.

ADD REPLYlink written 4 months ago by genomax74k

I think better to write a small code either in R or using awk in linux. I know there are various ways to do this.

ADD REPLYlink written 4 months ago by archana.bioinfo87170

Try this command

grep -e "gene_id" -w -m 3 "$next" Input_file.gtf

It will extract the whole line with respect to three exons.

ADD REPLYlink modified 4 months ago • written 4 months ago by archana.bioinfo87170

Thanks, looks like this command is off though. It's extracting three different biotypes from the Gencode v28 GTF and also throwing a no such file error.

grep -e "gene_id" -w -m 3 "$next" gencode.v28.annotation.gtf

grep: : No such file or directory
gencode.v28.annotation.gtf:chr1 HAVANA  gene    11869   14409   .   +   .   gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
gencode.v28.annotation.gtf:chr1 HAVANA  transcript  11869   14409   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "RP11-34P13.1-002"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
gencode.v28.annotation.gtf:chr1 HAVANA  exon    11869   12227   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "RP11-34P13.1-002"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
ADD REPLYlink written 4 months ago by colindaven1.8k

you can do one more thing sort the data with respect to gene_id and

for next in $(cut –d" “ -f1 sorted_data.csv | sort -u); do grep -w -m 3 "$next" sorted_data.csv ; done

You can also modify the command according to your file columns.

ADD REPLYlink written 4 months ago by archana.bioinfo87170
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1885 users visited in the last hour