Filtering Augustus GTF based on protein sequence
0
0
Entering edit mode
19 months ago
jamie.pike ▴ 80

I have recently run ab initio prediction using Augustus and now want to filter the output (see below). I would like to filter based on the length of the aa sequence, i.e. <30aa is excluded from future work. I intended to just filter using awk, however, I cannot find anything that indicates the size of the predicted protein sequence in any of the fields. Does anyone have any suggestions for filtering this GTF based on the length of the protein sequence?

The columns (fields) contain:

seqname   source     feature    start   end   score   strand   frame    transcript and gene name

# ----- prediction on sequence number 1 (length = 5210, name = AGND01000115.1:654099-659309(+)) -----
#
# Predicted genes for sequence number 1 on both strands
# start gene g1
AGND01000115.1:654099-659309(+) AUGUSTUS        gene    979     2277    0.76    -       .       g1
AGND01000115.1:654099-659309(+) AUGUSTUS        transcript      979     2277    0.76    -       .       g1.t1
AGND01000115.1:654099-659309(+) AUGUSTUS        stop_codon      979     981     .       -       0       transcript_id "g1.t1"; ge
ne_id "g1";
AGND01000115.1:654099-659309(+) AUGUSTUS        CDS     979     1071    0.99    -       0       transcript_id "g1.t1"; gene_id "g
1";
AGND01000115.1:654099-659309(+) AUGUSTUS        CDS     1120    1859    0.78    -       2       transcript_id "g1.t1"; gene_id "g
1";
AGND01000115.1:654099-659309(+) AUGUSTUS        CDS     1905    2277    0.98    -       0       transcript_id "g1.t1"; gene_id "g
1";
AGND01000115.1:654099-659309(+) AUGUSTUS        start_codon     2275    2277    .       -       0       transcript_id "g1.t1"; ge
ne_id "g1";
# protein sequence = [MPRAHDHFHGRHYHAERATGPVKSLNPTKRYLIADRKPLHAESDAGKESRPSAESPGVAYVWRSRDNRKGRHALVISV
# DPRKHDATKAPRPSNSYHQTLRGILKMFVRYPVWDVSYDVAIVFTIGSIIWVINGFFSWLPVLNPSTKFSDWAGGLTAFIGATVFEFGSILLMLEAVN
# ENRADCFGWAVEESIDGMLHLTHADNCKHAHAHKGTFVKQSSKTLDNNTTESAGNDRMWSWWPTWYELRSHYFFDIGFLACSSQTFGATVFWISGFTA
# LPPILNNLSTPAENGVYWLPQVIGGTGFIVSSTLFMVEVQPRWYIPAPGVLGWHIGLWNLIGAIGFTLCGALGFGITHPGVEYALTLSTFIGSWAFLI
# GSVIQWYESLNKYPIWVDQKIERLGKRKS]
# end gene g1
###
Augustus GTF Protein sequence Awk • 523 views
ADD COMMENT
1
Entering edit mode

why not extract all protein sequences in a separate fasta file and filter that one on length (using seqkit or such) ?

(it's not straightforward to extract the length info directly from the augustus output)

ADD REPLY
1
Entering edit mode

You can filter the GFF by the ORF length using agat_sp_filter_by_ORF_size.pl from AGAT

ADD REPLY

Login before adding your answer.

Traffic: 2423 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6