Question: split a GFF3 with many Genomes to GFF3 files with individual Genomes
0
gravatar for Chris
28 days ago by
Chris30
Chris30 wrote:

Hi all, I have a big GFF3 file that I downloaded from Batch Entrez and it contains 54 genomes. I want to split it into individual genomes. Any help would be appreciated.

Thanks

This is an example of the file

##sequence-region Z18946.1 1 52297
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=31757
Z18946.1    EMBL    region  1   52297   .   +   .   ID=id-1;Dbxref=taxon:31757;gbkey=Src;mol_type=genomic DNA
Z18946.1    EMBL    gene    411 1100    .   +   .   ID=gene-PBI_L5_1;Name=1;gbkey=Gene;gene=1;gene_biotype=protein_coding;locus_tag=PBI_L5_1
Z18946.1    EMBL    CDS 411 1100    .   +   0   ID=cds-CAA79380.1;Parent=gene-PBI_L5_1;Dbxref=InterPro:IPR025530,UniProtKB/Swiss-Prot:Q05218,NCBI_GP:CAA79380.1;Name=CAA79380.1;gbkey=CDS;gene=1;product=Hypothetical Protein;protein_id=CAA79380.1;transl_table=11
Z18946.1    EMBL    gene    1305    2084    .   +   .   ID=gene-PBI_L5_2;Name=2;gbkey=Gene;gene=2;gene_biotype=protein_coding;locus_tag=PBI_L5_2
Z18946.1    EMBL    CDS 1305    2084    .   +   0   ID=cds-CAA79381.1;Parent=gene-PBI_L5_2;Dbxref=UniProtKB/Swiss-Prot:Q05230,NCBI_GP:CAA79381.1;Name=CAA79381.1;gbkey=CDS;gene=2;product=Hypothetical Protein;protein_id=CAA79381.1;transl_table=11
Z18946.1    EMBL    gene    2084    2335    .   +   .   ID=gene-PBI_L5_3;Name=3;gbkey=Gene;gene=3;gene_biotype=protein_coding;locus_tag=PBI_L5_3
Z18946.1    EMBL    CDS 2084    2335    .   +   0   ID=cds-CAA79382.1;Parent=gene-PBI_L5_3;Dbxref=UniProtKB/Swiss-Prot:Q05242,NCBI_GP:CAA79382.1;Name=CAA79382.1;gbkey=CDS;gene=3;product=Hypothetical Protein;protein_id=CAA79382.1;transl_table=11

##sequence-region AF022214.1 1 49136
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=28369
AF022214.1  Genbank region  1   49136   .   +   .   ID=id-1;Dbxref=taxon:28369;gbkey=Src;mol_type=genomic DNA;old-name=Mycobacteriophage D29
AF022214.1  Genbank gene    401 1213    .   +   .   ID=gene-PBI_D29_1;Name=1;gbkey=Gene;gene=1;gene_biotype=protein_coding;locus_tag=PBI_D29_1
AF022214.1  Genbank CDS 401 1213    .   +   0   ID=cds-AAC18444.1;Parent=gene-PBI_D29_1;Dbxref=NCBI_GP:AAC18444.1;Name=AAC18444.1;Note=gp1%3B putative 30.3 kD protein;gbkey=CDS;gene=1;product=hypothetical protein;protein_id=AAC18444.1;transl_table=11
AF022214.1  Genbank gene    1327    2106    .   +   .   ID=gene-PBI_D29_2;Name=2;gbkey=Gene;gene=2;gene_biotype=protein_coding;locus_tag=PBI_D29_2
AF022214.1  Genbank CDS 1327    2106    .   +   0   ID=cds-AAC18445.1;Parent=gene-PBI_D29_2;Dbxref=NCBI_GP:AAC18445.1;Name=AAC18445.1;Note=gp2%3B putative 28.8 kD protein;gbkey=CDS;gene=2;product=hypothetical protein;protein_id=AAC18445.1;transl_table=11
AF022214.1  Genbank gene    2106    2357    .   +   .   ID=gene-PBI_D29_3;Name=3;gbkey=Gene;gene=3;gene_biotype=protein_coding;locus_tag=PBI_D29_3
AF022214.1  Genbank CDS 2106    2357    .   +   0   ID=cds-AAC18446.1;Parent=gene-PBI_D29_3;Dbxref=NCBI_GP:AAC18446.1;Name=AAC18446.1;Note=gp3%3B putative 9.0 kD protein;gbkey=CDS;gene=3;product=hypothetical protein;protein_id=AAC18446.1;transl_table=11

##sequence-region AF068845.1 1 52797
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=88870
AF068845.1  Genbank region  1   52797   .   +   .   ID=id-1;Dbxref=taxon:88870;gbkey=Src;mol_type=genomic DNA;old-name=Mycobacteriophage TM4
AF068845.1  Genbank gene    100 234 .   +   .   ID=gene-TM4_1;Name=1;gbkey=Gene;gene=1;gene_biotype=protein_coding;locus_tag=TM4_1
AF068845.1  Genbank CDS 100 234 .   +   0   ID=cds-AAD17569.1;Parent=gene-TM4_1;Dbxref=NCBI_GP:AAD17569.1;Name=AAD17569.1;Note=gp1;gbkey=CDS;gene=1;product=hypothetical protein;protein_id=AAD17569.1;transl_table=11
AF068845.1  Genbank gene    236 448 .   +   .   ID=gene-TM4_2;Name=2;gbkey=Gene;gene=2;gene_biotype=protein_coding;locus_tag=TM4_2
AF068845.1  Genbank CDS 236 448 .   +   0   ID=cds-AAD17570.1;Parent=gene-TM4_2;Dbxref=NCBI_GP:AAD17570.1;Name=AAD17570.1;Note=gp2;gbkey=CDS;gene=2;product=hypothetical protein;protein_id=AAD17570.1;transl_table=11

##sequence-region AF271693.1 1 50550
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=148603
AF271693.1  Genbank region  1   50550   .   +   .   ID=id-1;Dbxref=taxon:148603;gbkey=Src;mol_type=genomic DNA;old-name=Mycobacteriophage Bxb1
AF271693.1  Genbank gene    599 895 .   +   .   ID=gene-PBI_BXB1_1;Name=1;gbkey=Gene;gene=1;gene_biotype=protein_coding;locus_tag=PBI_BXB1_1
AF271693.1  Genbank CDS 599 895 .   +   0   ID=cds-AAG59706.1;Parent=gene-PBI_BXB1_1;Dbxref=NCBI_GP:AAG59706.1;Name=AAG59706.1;Note=related to L5 gp4%3B 11.4 kD hypothetical;gbkey=CDS;gene=1;product=hypothetical protein;protein_id=AAG59706.1;transl_table=11
AF271693.1  Genbank gene    930 1370    .   +   .   ID=gene-PBI_BXB1_2;Name=2;gbkey=Gene;gene=2;gene_biotype=protein_coding;locus_tag=PBI_BXB1_2
AF271693.1  Genbank CDS 930 1370    .   +   0   ID=cds-AAG59707.1;Parent=gene-PBI_BXB1_2;Dbxref=NCBI_GP:AAG59707.1;Name=AAG59707.1;Note=related to L5 gp5%3B 16.3 kD hypothetical;gbkey=CDS;gene=2;product=hypothetical protein;protein_id=AAG59707.1;transl_table=11
AF271693.1  Genbank gene    1716    2027    .   +   .   ID=gene-PBI_BXB1_3;Name=3;gbkey=Gene;gene=3;gene_biotype=protein_coding;locus_tag=PBI_BXB1_3
AF271693.1  Genbank CDS 1716    2027    .   +   0   ID=cds-AAG59708.1;Parent=gene-PBI_BXB1_3;Dbxref=NCBI_GP:AAG59708.1;Name=AAG59708.1;Note=11.4 kD hypothetical protein;gbkey=CDS;gene=3;product=hypothetical protein;protein_id=AAG59708.1;transl_table=11

##sequence-region AY129330.1 1 59471
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=205868
AY129330.1  Genbank region  1   59471   .   +   .   ID=id-1;Dbxref=taxon:205868;gbkey=Src;mol_type=genomic DNA;old-name=Mycobacteriophage Che8
AY129330.1  Genbank gene    108 563 .   +   .   ID=gene-PBI_CHE8_1;Name=1;gbkey=Gene;gene=1;gene_biotype=protein_coding;locus_tag=PBI_CHE8_1
AY129330.1  Genbank CDS 108 563 .   +   0   ID=cds-AAN12399.1;Parent=gene-PBI_CHE8_1;Dbxref=NCBI_GP:AAN12399.1;Name=AAN12399.1;gbkey=CDS;gene=1;product=hypothetical protein;protein_id=AAN12399.1;transl_table=11
AY129330.1  Genbank gene    571 2208    .   +   .   ID=gene-PBI_CHE8_2;Name=2;gbkey=Gene;gene=2;gene_biotype=protein_coding;locus_tag=PBI_CHE8_2
AY129330.1  Genbank CDS 571 2208    .   +   0   ID=cds-AAN12400.1;Parent=gene-PBI_CHE8_2;Dbxref=NCBI_GP:AAN12400.1;Name=AAN12400.1;gbkey=CDS;gene=2;product=hypothetical protein;protein_id=AAN12400.1;transl_table=11
AY129330.1  Genbank gene    2239    3609    .   +   .   ID=gene-PBI_CHE8_3;Name=3;gbkey=Gene;gene=3;gene_biotype=protein_coding;locus_tag=PBI_CHE8_3
AY129330.1  Genbank CDS 2239    3609    .   +   0   ID=cds-AAN12401.1;Parent=gene-PBI_CHE8_3;Dbxref=NCBI_GP:AAN12401.1;Name=AAN12401.1;gbkey=CDS;gene=3;product=hypothetical protein;protein_id=AAN12401.1;transl_table=11

##sequence-region AY129331.1 1 75931
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=205869
AY129331.1  Genbank region  1   75931   .   +   .   ID=id-1;Dbxref=taxon:205869;gbkey=Src;mol_type=genomic DNA;old-name=Mycobacteriophage CJW1
AY129331.1  Genbank gene    276 569 .   +   .   ID=gene-PBI_CJW1_1;Name=1;gbkey=Gene;gene=1;gene_biotype=protein_coding;locus_tag=PBI_CJW1_1
AY129331.1  Genbank CDS 276 569 .   +   0   ID=cds-AAN01616.1;Parent=gene-PBI_CJW1_1;Dbxref=NCBI_GP:AAN01616.1;Name=AAN01616.1;gbkey=CDS;gene=1;product=hypothetical protein;protein_id=AAN01616.1;transl_table=11
AY129331.1  Genbank gene    566 751 .   +   .   ID=gene-PBI_CJW1_2;Name=2;gbkey=Gene;gene=2;gene_biotype=protein_coding;locus_tag=PBI_CJW1_2
AY129331.1  Genbank CDS 566 751 .   +   0   ID=cds-AAN01617.1;Parent=gene-PBI_CJW1_2;Dbxref=NCBI_GP:AAN01617.1;Name=AAN01617.1;gbkey=CDS;gene=2;product=hypothetical protein;protein_id=AAN01617.1;transl_table=11
AY129331.1  Genbank gene    748 1038    .   +   .   ID=gene-PBI_CJW1_3;Name=3;gbkey=Gene;gene=3;gene_biotype=protein_coding;locus_tag=PBI_CJW1_3
AY129331.1  Genbank CDS 748 1038    .   +   0   ID=cds-AAN01618.1;Parent=gene-PBI_CJW1_3;Dbxref=NCBI_GP:AAN01618.1;Name=AAN01618.1;gbkey=CDS;gene=3;product=hypothetical protein;protein_id=AAN01618.1;transl_table=11
gff3 manipulation • 90 views
ADD COMMENTlink modified 28 days ago by genomax49k • written 28 days ago by Chris30
3
gravatar for genomax
28 days ago by
genomax49k
United States
genomax49k wrote:

A "not so fancy but workable" solution.

$ grep "sequence-region" your.gff3 | awk '{print $2}' > identifiers
$ for i in `cat ./identifiers`; do grep -e $i -e Species -A1 your.gff3 > $i.gff3; done
ADD COMMENTlink modified 28 days ago • written 28 days ago by genomax49k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1386 users visited in the last hour