Question: Transcript fasta created using bedtools is incorrect
0
gravatar for mosquitoes
2.7 years ago by
mosquitoes0
United States
mosquitoes0 wrote:

I am using bedops to convert a .gff file to .bed file and then bedtools getfasta to create a transcript fasta file, but I am not getting the expected result in the transcript fasta file. The mRNA sequence in the fasta also contains introns despite using the -split option.

My .gff file looks like this:

Pv_Sal1_chr08   PlasmoDB        gene    1540281 1541168 .       +       .       ID=PVX_119350;Name=PVX_119350;description=1-cys-glutaredoxin-like+protein-1%2C+putative;size=888;web_id=PVX_119350;locus_tag=PVX_119350;size=888;Alias=156094059,14578309,A5KB62,5472321,Pv119350,PVX_119350,148801941
Pv_Sal1_chr08   PlasmoDB        mRNA    1540281 1541168 .       +       .       ID=rna_PVX_119350-1;Name=PVX_119350-1;description=PVX_119350-1;size=888;Parent=PVX_119350;Ontology_term=GO:0045454,GO:0009055,GO:0015035,GO:0015038,GO:0006118;Dbxref=ApiDB_PlasmoDB:PVX_119350,EC:1.11.1.15,NCBI_gi:14578309,NCBI_gi:148801941,NCBI_gi:156094059,taxon:126793
Pv_Sal1_chr08   PlasmoDB        CDS     1540965 1541168 .       +       0       ID=cds_PVX_119350-3;Name=cds;description=.;size=204;Parent=rna_PVX_119350-1
Pv_Sal1_chr08   PlasmoDB        CDS     1540670 1540768 .       +       0       ID=cds_PVX_119350-2;Name=cds;description=.;size=99;Parent=rna_PVX_119350-1
Pv_Sal1_chr08   PlasmoDB        CDS     1540284 1540487 .       +       0       ID=cds_PVX_119350-1;Name=cds;description=.;size=207;Parent=rna_PVX_119350-1
Pv_Sal1_chr08   PlasmoDB        exon    1540281 1540487 .       +       .       ID=exon_PVX_119350-1;Name=exon;description=exon;size=207;Parent=rna_PVX_119350-1
Pv_Sal1_chr08   PlasmoDB        exon    1540670 1540768 .       +       .       ID=exon_PVX_119350-2;Name=exon;description=exon;size=99;Parent=rna_PVX_119350-1
Pv_Sal1_chr08   PlasmoDB        exon    1540965 1541168 .       +       .       ID=exon_PVX_119350-3;Name=exon;description=exon;size=204;Parent=rna_PVX_119350-1

And the resulting .bed file looks like this:

    Pv_Sal1_chr08   1540280 1540487 exon_PVX_119350-1       .       +       PlasmoDB        exon    .       ID=exon_PVX_119350-1;Name=exon;description=exon;size=207;Parent=rna_PVX_119350-1
Pv_Sal1_chr08   1540280 1541168 PVX_119350      .       +       PlasmoDB        gene    .       ID=PVX_119350;Name=PVX_119350;description=1-cys-glutaredoxin-like+protein-1%2C+putative;size=888;web_id=PVX_119350;locus_tag=PVX_119350;size=888;Alias=156094059,14578309,A5KB62,5472321,Pv119350,PVX_119350,148801941
Pv_Sal1_chr08   1540280 1541168 rna_PVX_119350-1        .       +       PlasmoDB        mRNA    .       ID=rna_PVX_119350-1;Name=PVX_119350-1;description=PVX_119350-1;size=888;Parent=PVX_119350;Ontology_term=GO:0045454,GO:0009055,GO:0015035,GO:0015038,GO:0006118;Dbxref=ApiDB_PlasmoDB:PVX_119350,EC:1.11.1.15,NCBI_gi:14578309,NCBI_gi:148801941,NCBI_gi:1560940
Pv_Sal1_chr08   1540283 1540487 cds_PVX_119350-1        .       +       PlasmoDB        CDS     0       ID=cds_PVX_119350-1;Name=cds;description=.;size=207;Parent=rna_PVX_119350-1
Pv_Sal1_chr08   1540669 1540768 cds_PVX_119350-2        .       +       PlasmoDB        CDS     0       ID=cds_PVX_119350-2;Name=cds;description=.;size=99;Parent=rna_PVX_119350-1
Pv_Sal1_chr08   1540669 1540768 exon_PVX_119350-2       .       +       PlasmoDB        exon    .       ID=exon_PVX_119350-2;Name=exon;description=exon;size=99;Parent=rna_PVX_119350-1
Pv_Sal1_chr08   1540964 1541168 cds_PVX_119350-3        .       +       PlasmoDB        CDS     0       ID=cds_PVX_119350-3;Name=cds;description=.;size=204;Parent=rna_PVX_119350-1
Pv_Sal1_chr08   1540964 1541168 exon_PVX_119350-3       .       +       PlasmoDB        exon    .       ID=exon_PVX_119350-3;Name=exon;description=exon;size=204;Parent=rna_PVX_119350-1

If I run the code:

~/bin/bedtools2/bin/bedtools getfasta -split -name -s -fi PlasmoDB-25_PvivaxSal1_Genome.fasta -bed PlasmoDB-25_PvivaxSal1.bed

My output file is unexpected. For instance both the gene and mRNA sequence are the same despite this gene containing two introns.

>PVX_119350::Pv_Sal1_chr08:1540280-1541168(+)
ACAATGATAGTAAGAAACAGTTGTAAGATCCTCCTTCGCATGAATCGAAGCATAGTGAGG
AGAAACCTAACGTTCTTTTCTTGCAATCAGGTTTCAAAAGCGAACTTCAGCAGTTCCCTT
CAAGATAAGCAAAATGGAGGACCCGAGAAGAAGATAAGCAACGGCTCGGACGACTTTAAA
GACTTTGAAAAGTCGGACGTGTACCAGGTTAACCCAGGGAGGGAAATGAGAGGGTGCGAA
TGAGATGGCACCTATGAGATGGTGCAATTTTCATTTTTAAAAATGATGCATCGTTTCCTA
TGTAAGCGCATTTTATTGGTGCCCAGCAGTTTGCCTCTATGCTAAATGACTCATGTCTGC
TTTTAACCCTTTTTGCTCCCCCTCCGTAGACTCTGAAGGGAAAAATTAAGGAAATCCTTG
AGAAGGAGAAAATTGTGCTGTTTATGAAGGGCACCCCGGAAAGTCCCCTGTGCGGGTTTA
GCGCAAAGGTGAGTTTCATGTATCTGCCGTGGGGTGGTTTTGTCCAACTGACCCCAGTTT
TTCCCACGTTTGAGAAACTTACATTATTGTTTTTTGGACTAATATGCACCCAGGAGAGGC
ACATTTTAGTATGCCCATTTTGTGCAGTGGCTAGCTTTATTATTTTTTTTTTATTTTTTT
TTTTTTCCCCTTTCCCCTGCGCAGGTGGTCCACATACTGAATAGTATGAACGTGAAGGAC
TACGTATACATCGACGTGATGAAAAACAACAACTTGCGCGAGGCGATAAAGATTTACAGC
AACTGGCCCTACATCCCCCACTTATACGTAAATAACAACTTCATCGGTGGCTACGACATC
ATCTCAGATTTGTACACCAGTGGAGAGCTACAGGGGCTGGTCAAGTAA
>rna_PVX_119350-1::Pv_Sal1_chr08:1540280-1541168(+)
ACAATGATAGTAAGAAACAGTTGTAAGATCCTCCTTCGCATGAATCGAAGCATAGTGAGG
AGAAACCTAACGTTCTTTTCTTGCAATCAGGTTTCAAAAGCGAACTTCAGCAGTTCCCTT
CAAGATAAGCAAAATGGAGGACCCGAGAAGAAGATAAGCAACGGCTCGGACGACTTTAAA
GACTTTGAAAAGTCGGACGTGTACCAGGTTAACCCAGGGAGGGAAATGAGAGGGTGCGAA
TGAGATGGCACCTATGAGATGGTGCAATTTTCATTTTTAAAAATGATGCATCGTTTCCTA
TGTAAGCGCATTTTATTGGTGCCCAGCAGTTTGCCTCTATGCTAAATGACTCATGTCTGC
TTTTAACCCTTTTTGCTCCCCCTCCGTAGACTCTGAAGGGAAAAATTAAGGAAATCCTTG
AGAAGGAGAAAATTGTGCTGTTTATGAAGGGCACCCCGGAAAGTCCCCTGTGCGGGTTTA
GCGCAAAGGTGAGTTTCATGTATCTGCCGTGGGGTGGTTTTGTCCAACTGACCCCAGTTT
TTCCCACGTTTGAGAAACTTACATTATTGTTTTTTGGACTAATATGCACCCAGGAGAGGC
ACATTTTAGTATGCCCATTTTGTGCAGTGGCTAGCTTTATTATTTTTTTTTTATTTTTTT
TTTTTTCCCCTTTCCCCTGCGCAGGTGGTCCACATACTGAATAGTATGAACGTGAAGGAC
TACGTATACATCGACGTGATGAAAAACAACAACTTGCGCGAGGCGATAAAGATTTACAGC
AACTGGCCCTACATCCCCCACTTATACGTAAATAACAACTTCATCGGTGGCTACGACATC
ATCTCAGATTTGTACACCAGTGGAGAGCTACAGGGGCTGGTCAAGTAA
ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by mosquitoes0

I've tried gffutils to create a bed12 file from the gff.

I'm trying to convert a .gff file to a .bed12 file format using gffutils. I'm missing something about the syntax for the command though:

import gffutils
from sys import argv

script, database = argv

db = gffutils.FeatureDB(database)

db.bed12(block_featuretype=['exon'], thick_featuretype=['CDS'], thin_featuretype=None)

where database is correctly created gffutils.db. I get the error " TypeError: bed12() takes at least 2 arguments (2 given)"

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by mosquitoes0
0
gravatar for Devon Ryan
2.7 years ago by
Devon Ryan89k
Freiburg, Germany
Devon Ryan89k wrote:

The BED file that bedops is producing is only sort of a BED file (it's apparently BED6 with extra information). For example, it doesn't contain exon information in mRNA entries, so it's impossible for bedtools to do what you want. Your best bet is to use R with rtracklayer and GenomicFeatures. Those can be combined to do what you want.

ADD COMMENTlink written 2.7 years ago by Devon Ryan89k

Thanks Devon, but I would like to solve this problem in python or bash.

ADD REPLYlink written 2.7 years ago by mosquitoes0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 723 users visited in the last hour