Question

GFF to fasta

0

Entering edit mode

6.2 years ago

mmats010 ▴ 80

Hello, I am referencing an old entry about the gff2fasta.pl script talked about here, and located here. I figured asking a new question is easier than reviving a 5.5 year old thread.

Anyway, I was hoping for some advice on how the modify the script to correctly parse a somewhat unusual .gff file I was given. Here are two entries from the file:

7000000037415267        .       gene    21339   22504   .       +       .       ID=7000003035155523;Function=polygalacturonase%2C%20putative;Name=PITG_19619
7000000037415267        .       mRNA    21339   22504   .       +       .       ID=7000003035155526;Parent=7000003035155523;Function=polygalacturonase%2C%20putative;Name=PITG_19619
7000000037415267        .       exon    21339   21714   .       +       .       ID=7000003035155526.exon2;Parent=7000003035155526
7000000037415267        .       CDS     21339   21714   .       +       0       ID=cds.7000003035155526;Parent=7000003035155526
7000000037415267        .       exon    21749   22170   .       +       .       ID=7000003035155526.exon3;Parent=7000003035155526
7000000037415267        .       CDS     21749   22170   .       +       2       ID=cds.7000003035155526;Parent=7000003035155526
7000000037415267        .       exon    22307   22504   .       +       .       ID=7000003035155526.exon4;Parent=7000003035155526
7000000037415267        .       CDS     22307   22504   .       +       0       ID=cds.7000003035155526;Parent=7000003035155526

7000000037414998        .       gene    679960  682584  .       +       .       ID=7000003035181604;Function=conserved%20hypothetical%20protein;Name=PITG_09139
7000000037414998        .       mRNA    679960  682584  .       +       .       ID=7000003035181607;Parent=7000003035181604;Function=conserved%20hypothetical%20protein;Name=PITG_09139
7000000037414998        .       five_prime_UTR  679960  680620  .       +       .       ID=7000003035181607.utr5p1;Parent=7000003035181607
7000000037414998        .       five_prime_UTR  680710  680802  .       +       .       ID=7000003035181607.utr5p2;Parent=7000003035181607
7000000037414998        .       five_prime_UTR  680907  680909  .       +       .       ID=7000003035181607.utr5p3;Parent=7000003035181607
7000000037414998        .       exon    679960  680620  .       +       .       ID=7000003035181607.exon1;Parent=7000003035181607
7000000037414998        .       exon    680710  680802  .       +       .       ID=7000003035181607.exon2;Parent=7000003035181607
7000000037414998        .       exon    680907  681227  .       +       .       ID=7000003035181607.exon3;Parent=7000003035181607
7000000037414998        .       CDS     680910  681227  .       +       0       ID=cds.7000003035181607;Parent=7000003035181607
7000000037414998        .       exon    681298  681489  .       +       .       ID=7000003035181607.exon4;Parent=7000003035181607
7000000037414998        .       CDS     681298  681489  .       +       0       ID=cds.7000003035181607;Parent=7000003035181607
7000000037414998        .       exon    681563  682584  .       +       .       ID=7000003035181607.exon5;Parent=7000003035181607
7000000037414998        .       CDS     681563  682174  .       +       0       ID=cds.7000003035181607;Parent=7000003035181607
7000000037414998        .       three_prime_UTR 682175  682584  .       +       .       ID=7000003035181607.utr3p1;Parent=7000003035181607

The gff2fasta.pl script is using, as entry/gene names, whatever string is behind "ID=" entry. As in, the genes output will be:

>7000003035155523
ATGCCTTTAGCGACGATCACTCTCCTCTTCTTCGCTAGCTTACCTCCCCAATCCACTCTT...
>7000003035181604
GGTGAACATGTTGTCTGTATTGTCTGTACTTGCCGACCATGAGCTCCTCGGTAGTGCACA...

And the output for mRNA, peptides, cds will be:

>7000003035155526
MPLATITLLFFASLPPQSTLHSAICFLPTQRPLKVQPAMKLVSSAFGVFALLAAFVSGST...
>7000003035181607
MSFSKSNLPPTLPVAIKKEREDPSSLSGSMSIPGSSSSIPRKDSIGWGADDFLGMISHTP...

Is there a way to name each line of the resulting fasta file to the string following "Name="? In these cases, that would be:

>PITG_19619
ATGCCTTTAGCGACGATCACTCTCCTCTTCTTCGCTAGCTTACCTCCCCAATCCACTCTT...
>PITG_19619
GGTGAACATGTTGTCTGTATTGTCTGTACTTGCCGACCATGAGCTCCTCGGTAGTGCACA...

and

>PITG_19619
MPLATITLLFFASLPPQSTLHSAICFLPTQRPLKVQPAMKLVSSAFGVFALLAAFVSGST...
>PITG_19619
MSFSKSNLPPTLPVAIKKEREDPSSLSGSMSIPGSSSSIPRKDSIGWGADDFLGMISHTP...

Alternatively, maybe there is a way to modify the .gff file itself by swapping the string behind "ID=" with the string behind "Name="? I have only limited perl knowledge.

Much appreciated, Mike

sequencing annotation perl gff genome • 3.1k views

ADD COMMENT • link 6.2 years ago by mmats010 ▴ 80

score 1 · Accepted Answer · 2018-02-07

Found out the answer:

Lines 52 and 53 need to be changed from:

$attrs[0] =~ s/ID=//;
my $gene_name = $attrs[0];

to

$attrs[2] =~ s/Name=//;
my $gene_name = $attrs[2];

And lines 188,189 need to be changed from:

$attrs[0] =~ s/ID=//;
$mRNA_name = $attrs[0];

to

$attrs[3] =~ s/Name=//;
$mRNA_name = $attrs[3];