GFF to fasta
1
0
Entering edit mode
6.2 years ago
mmats010 ▴ 80

Hello, I am referencing an old entry about the gff2fasta.pl script talked about here, and located here. I figured asking a new question is easier than reviving a 5.5 year old thread.

Anyway, I was hoping for some advice on how the modify the script to correctly parse a somewhat unusual .gff file I was given. Here are two entries from the file:

7000000037415267        .       gene    21339   22504   .       +       .       ID=7000003035155523;Function=polygalacturonase%2C%20putative;Name=PITG_19619
7000000037415267        .       mRNA    21339   22504   .       +       .       ID=7000003035155526;Parent=7000003035155523;Function=polygalacturonase%2C%20putative;Name=PITG_19619
7000000037415267        .       exon    21339   21714   .       +       .       ID=7000003035155526.exon2;Parent=7000003035155526
7000000037415267        .       CDS     21339   21714   .       +       0       ID=cds.7000003035155526;Parent=7000003035155526
7000000037415267        .       exon    21749   22170   .       +       .       ID=7000003035155526.exon3;Parent=7000003035155526
7000000037415267        .       CDS     21749   22170   .       +       2       ID=cds.7000003035155526;Parent=7000003035155526
7000000037415267        .       exon    22307   22504   .       +       .       ID=7000003035155526.exon4;Parent=7000003035155526
7000000037415267        .       CDS     22307   22504   .       +       0       ID=cds.7000003035155526;Parent=7000003035155526

7000000037414998        .       gene    679960  682584  .       +       .       ID=7000003035181604;Function=conserved%20hypothetical%20protein;Name=PITG_09139
7000000037414998        .       mRNA    679960  682584  .       +       .       ID=7000003035181607;Parent=7000003035181604;Function=conserved%20hypothetical%20protein;Name=PITG_09139
7000000037414998        .       five_prime_UTR  679960  680620  .       +       .       ID=7000003035181607.utr5p1;Parent=7000003035181607
7000000037414998        .       five_prime_UTR  680710  680802  .       +       .       ID=7000003035181607.utr5p2;Parent=7000003035181607
7000000037414998        .       five_prime_UTR  680907  680909  .       +       .       ID=7000003035181607.utr5p3;Parent=7000003035181607
7000000037414998        .       exon    679960  680620  .       +       .       ID=7000003035181607.exon1;Parent=7000003035181607
7000000037414998        .       exon    680710  680802  .       +       .       ID=7000003035181607.exon2;Parent=7000003035181607
7000000037414998        .       exon    680907  681227  .       +       .       ID=7000003035181607.exon3;Parent=7000003035181607
7000000037414998        .       CDS     680910  681227  .       +       0       ID=cds.7000003035181607;Parent=7000003035181607
7000000037414998        .       exon    681298  681489  .       +       .       ID=7000003035181607.exon4;Parent=7000003035181607
7000000037414998        .       CDS     681298  681489  .       +       0       ID=cds.7000003035181607;Parent=7000003035181607
7000000037414998        .       exon    681563  682584  .       +       .       ID=7000003035181607.exon5;Parent=7000003035181607
7000000037414998        .       CDS     681563  682174  .       +       0       ID=cds.7000003035181607;Parent=7000003035181607
7000000037414998        .       three_prime_UTR 682175  682584  .       +       .       ID=7000003035181607.utr3p1;Parent=7000003035181607

The gff2fasta.pl script is using, as entry/gene names, whatever string is behind "ID=" entry. As in, the genes output will be:

>7000003035155523
ATGCCTTTAGCGACGATCACTCTCCTCTTCTTCGCTAGCTTACCTCCCCAATCCACTCTT...
>7000003035181604
GGTGAACATGTTGTCTGTATTGTCTGTACTTGCCGACCATGAGCTCCTCGGTAGTGCACA...

And the output for mRNA, peptides, cds will be:

>7000003035155526
MPLATITLLFFASLPPQSTLHSAICFLPTQRPLKVQPAMKLVSSAFGVFALLAAFVSGST...
>7000003035181607
MSFSKSNLPPTLPVAIKKEREDPSSLSGSMSIPGSSSSIPRKDSIGWGADDFLGMISHTP...

Is there a way to name each line of the resulting fasta file to the string following "Name="? In these cases, that would be:

>PITG_19619
ATGCCTTTAGCGACGATCACTCTCCTCTTCTTCGCTAGCTTACCTCCCCAATCCACTCTT...
>PITG_19619
GGTGAACATGTTGTCTGTATTGTCTGTACTTGCCGACCATGAGCTCCTCGGTAGTGCACA...

and

>PITG_19619
MPLATITLLFFASLPPQSTLHSAICFLPTQRPLKVQPAMKLVSSAFGVFALLAAFVSGST...
>PITG_19619
MSFSKSNLPPTLPVAIKKEREDPSSLSGSMSIPGSSSSIPRKDSIGWGADDFLGMISHTP...

Alternatively, maybe there is a way to modify the .gff file itself by swapping the string behind "ID=" with the string behind "Name="? I have only limited perl knowledge.

Much appreciated, Mike

sequencing annotation perl gff genome • 3.1k views
ADD COMMENT
1
Entering edit mode
6.2 years ago
mmats010 ▴ 80

Found out the answer:

Lines 52 and 53 need to be changed from:

$attrs[0] =~ s/ID=//;
my $gene_name = $attrs[0];

to

$attrs[2] =~ s/Name=//;
my $gene_name = $attrs[2];

And lines 188,189 need to be changed from:

$attrs[0] =~ s/ID=//;
$mRNA_name = $attrs[0];

to

$attrs[3] =~ s/Name=//;
$mRNA_name = $attrs[3];
ADD COMMENT

Login before adding your answer.

Traffic: 2842 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6