Extract longest mRNA from .GFF when I have multiple rows of same seqID
2
0
Entering edit mode
4.0 years ago
todd.ugine • 0

Hello,

I have a GFF file that has multiple rows for each seqID.

Using python:

I want to extract the longest mRNA sequence for each seqID and return in a new tab-delimited file

<seqID> <start> <end>

The original GFF is formatted as follows

<seqid><SOURCE><FEATURE><START><END><SCORE><STRAND><FRAME><ATTRIBUTES><COMMENTS>

I'm very new to this and could use a hand for what is likely easy for someone with more experience.

Thanks for any help you can provide, and I'm not likely able to successfully modify someone else script appropriately

sequence genome • 935 views
ADD COMMENT
0
Entering edit mode
  1. This post is a Question type post, not a Forum type post. I've made the necessary change but please be more careful in the future. Reading posts under the how-to tag will help.
  2. Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.
    code_formatting

  3. Please avoid the use of emojis and such in a professional/scientific setting.

ADD REPLY
0
Entering edit mode
4.0 years ago

There are some GTF parsers in python. For example pyGTF by which you can get the length of seqIDs or try to get a bed file of seqIDs and use groupBy to get the first/last coordinates and then get length. If you provide head of your file, we could help a bit further.

ADD COMMENT
0
Entering edit mode
4.0 years ago
todd.ugine • 0

Thanks for your help. This was my first-ever post. I'll be sure to put things in the appropriate place moving forward. Should the head portion of this post be considered code?

##gff-version 3
BHEC01006359.1  .   contig  1   807 .   .   .   ID=BHEC01006359.1;Name=BHEC01006359.1
###
BHEC01012377.1  .   contig  1   805 .   .   .   ID=BHEC01012377.1;Name=BHEC01012377.1
###
BHEC01004863.1  .   contig  1   8509    .   .   .   ID=BHEC01004863.1;Name=BHEC01004863.1
BHEC01004863.1  snap_masked match   4053    8297    29.048  +   .   ID=BHEC01004863.1:hit:1841:4.5.0.0;Name=snap_masked-BHEC01004863.1-abinit-gene-0.1-mRNA-1
BHEC01004863.1  snap_masked match_part  4053    4314    12.637  +   .   ID=BHEC01004863.1:hsp:6435:4.5.0.0;Parent=BHEC01004863.1:hit:1841:4.5.0.0;Target=snap_masked-BHEC01004863.1-abinit-gene-0.1-mRNA-1 1 262 +;Gap=M262
BHEC01004863.1  snap_masked match_part  8263    8297    16.411  +   .   ID=BHEC01004863.1:hsp:6436:4.5.0.0;Parent=BHEC01004863.1:hit:1841:4.5.0.0;Target=snap_masked-BHEC01004863.1-abinit-gene-0.1-mRNA-1 263 297 +;Gap=M35
BHEC01004863.1  augustus_masked match   4055    4327    0.83    +   .   ID=BHEC01004863.1:hit:1842:4.5.0.0;Name=augustus_masked-BHEC01004863.1-abinit-gene-0.0-mRNA-1
BHEC01004863.1  augustus_masked match_part  4055    4327    0.83    +   .   ID=BHEC01004863.1:hsp:6437:4.5.0.0;Parent=BHEC01004863.1:hit:1842:4.5.0.0;Target=augustus_masked-BHEC01004863.1-abinit-gene-0.0-mRNA-1 1 273 +;Gap=M273
###
BHEC01004863.1  est_gff:est2genome  expressed_sequence_match    8071    8351    805 -   .   ID=BHEC01004863.1:hit:1840:3.12.0.0;Name=Csept_BB_C55352;score=805
BHEC01004863.1  est_gff:est2genome  match_part  8071    8351    805 -   .   ID=BHEC01004863.1:hsp:6434:3.12.0.0;Parent=BHEC01004863.1:hit:1840:3.12.0.0;Target=Csept_BB_C55352 7 284 +;Gap=M281
BHEC01053345.1  .   contig  1   2142    .   .   .   ID=BHEC01053345.1;Name=BHEC01053345.1
###
BHEC01052641.1  .   contig  1   803 .   .   .   ID=BHEC01052641.1;Name=BHEC01052641.1
###
BHEC01000922.1  .   contig  1   1466    .   .   .   ID=BHEC01000922.1;Name=BHEC01000922.1
###
BHEC01008444.1  .   contig  1   9527    .   .   .   ID=BHEC01008444.1;Name=BHEC01008444.1
BHEC01008444.1  snap_masked match   3239    9054    54.086  +   .   ID=BHEC01008444.1:hit:1739:4.5.0.0;Name=snap_masked-BHEC01008444.1-abinit-gene-0.1-mRNA-1
BHEC01008444.1  snap_masked match_part  3239    3312    11.140  +   .   ID=BHEC01008444.1:hsp:6367:4.5.0.0;Parent=BHEC01008444.1:hit:1739:4.5.0.0;Target=snap_masked-BHEC01008444.1-abinit-gene-0.1-mRNA-1 1 74 +;Gap=M74
BHEC01008444.1  snap_masked match_part  5275    5320    11.772  +   .   ID=BHEC01008444.1:hsp:6368:4.5.0.0;Parent=BHEC01008444.1:hit:1739:4.5.0.0;Target=snap_masked-BHEC01008444.1-abinit-gene-0.1-mRNA-1 75 120 +;Gap=M46
BHEC01008444.1  snap_masked match_part  5469    5541    6.534   +   .   ID=BHEC01008444.1:hsp:6369:4.5.0.0;Parent=BHEC01008444.1:hit:1739:4.5.0.0;Target=snap_masked-BHEC01008444.1-abinit-gene-0.1-mRNA-1 121 193 +;Gap=M73
BHEC01008444.1  snap_masked match_part  8183    8261    10.882  +   .   ID=BHEC01008444.1:hsp:6370:4.5.0.0;Parent=BHEC01008444.1:hit:1739:4.5.0.0;Target=snap_masked-BHEC01008444.1-abinit-gene-0.1-mRNA-1 194 272 +;Gap=M79
BHEC01008444.1  snap_masked match_part  9024    9054    13.758  +   .   ID=BHEC01008444.1:hsp:6371:4.5.0.0;Parent=BHEC01008444.1:hit:1739:4.5.0.0;Target=snap_masked-BHEC01008444.1-abinit-gene-0.1-mRNA-1 273 303 +;Gap=M31
BHEC01008444.1  snap_masked match   7114    7425    -11.831 -   .   ID=BHEC01008444.1:hit:1740:4.5.0.0;Name=snap_masked-BHEC01008444.1-abinit-gene-0.2-mRNA-1
BHEC01008444.1  snap_masked match_part  7114    7425    -11.831 -   .   ID=BHEC01008444.1:hsp:6372:4.5.0.0;Parent=BHEC01008444.1:hit:1740:4.5.0.0;Target=snap_masked-BHEC01008444.1-abinit-gene-0.2-mRNA-1 1 312 +;Gap=M312
BHEC01008444.1  augustus_masked match   7114    7425    0.52    -   .   ID=BHEC01008444.1:hit:1741:4.5.0.0;Name=augustus_masked-BHEC01008444.1-abinit-gene-0.0-mRNA-1
BHEC01008444.1  augustus_masked match_part  7114    7425    0.52    -   .   ID=BHEC01008444.1:hsp:6373:4.5.0.0;Parent=BHEC01008444.1:hit:1741:4.5.0.0;Target=augustus_masked-BHEC01008444.1-abinit-gene-0.0-mRNA-1 1 312 +;Gap=M312
ADD COMMENT

Login before adding your answer.

Traffic: 2002 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6