Question: Genebank to GTF recommended tool?
0
gravatar for caggtaagtat
14 days ago by
caggtaagtat240
caggtaagtat240 wrote:

Hi everybody,

I have to convert the genebank file of my virus genome to a gtf file for the aligment of my reads with STAR. I tried different tools, which I found by google search, but I doesn't work unfortunately.

I tried to do it with a perl script I found, called genbank2gtf_mRNA.pl from genebank2gtf and I tried a python script, called gb2gtf.py from lpryszcz, but when using the perl script with a custom chromosome file, I get an empty gtf file and when I tried to use the python script, I get the error: "command didn't found".

Can someone maybe recommend a converter tool or platform which I could use?

And; the genebank file contains splice site annotation from genious, does the conversion translate these coordinates?

Any help is greatfully appreciated :)

Edited: data exemple and tools, I struggled with

My gb data looks something like that:

> FEATURES             Location/Qualifiers
>      repeat_region   552..634
>                      /vntifkey="34"
>                      /label=U5
>      CDS             5830..8414
>                      /vntifkey="4"
>                      /label=TAT
>                      /note="HIV-1 tat protein"
>      splicing_signal 5934..5935
>                      /vntifkey="38"
>                      /label=SA4c
>      splicing_signal 5952..5953
>                      /vntifkey="38"
>                      /label=SA4a
>      splicing_signal 5958..5959
>                      /vntifkey="38"
>                      /label=SA4b
>      CDS             8787..9407
>                      /vntifkey="4"
>                      /label=NEF
>                      /note="HIV-1 nef protein"
>      splicing_signal 5775..5776
>                      /vntifkey="38"
>                      /label=SA3
>      splicing_signal 6045..6046
>                      /vntifkey="38"
>                      /label=SD4
>      CDS             5969..8643
>                      /vntifkey="4"
>                      /label=REV
>                      /note="HIV-1 rev protein"
>      splicing_signal 5974..5975
>                      /vntifkey="38"
>                      /label=SA5
>      CDS             2085..5096
>                      /vntifkey="4"
>                      /label=POL
>                      /note="HIV-1 pol polyprotein;  (NH2-terminus uncertain)"
>      CDS             5041..5619
>                      /vntifkey="4"
>                      /label=VIF
>                      /note="HIV-1 vif protein"
>      CDS             5559..5849
>                      /vntifkey="4"
>                      /label=VPR
>                      /note="HIV-1 vpr protein"
>      repeat_region   9529..9626
>                      /vntifkey="34"
>                      /label=R
>                      /note="HIV-1 R repeat 3' copy"
>      CDS             6061..6306
>                      /vntifkey="4"
>                      /label=VPU
>                      /note="HIV-1 vpu protein"
>      CDS             6221..8785
>                      /vntifkey="4"
>                      /label=ENV
>                      /note="HIV-1 envelope polyprotein"
>      splicing_signal 6602..6603
>                      /vntifkey="38"
>                      /label=(SA6)
>      splicing_signal 6720..6721
>                      /vntifkey="38"
>                      /label=(SD5)
>                      /note="Mutation von GT in anderen Isolaten zu  AT"
>      splicing_signal 8367..8368
>                      /vntifkey="38"
>                      /label=SA7
>      LTR             1..634
>                      /vntifkey="19"
>                      /label=5'_LTR
>                      /note="HIV-1 5' LTR"
>      repeat_region   454..551
>                      /vntifkey="34"
>                      /label=R
>                      /note="HIV-1 R repeat 5' copy"
>      intron          744..5776
>                      /vntifkey="15"
>                      /label=TAT/REV/NEF_I
>                      /note="HIV-1 tat, rev, nef mRNA intron 1"
>      misc_feature    5743..5744
>                      /vntifkey="21"
>                      /label=JNCTN_NY5/LAV
>                      /note="HIV-1 isolate NY5 DNA end/HIV-1 isolate LAV DNA start"
>      intron          6045..8368
>                      /vntifkey="15"
>                      /label=TAT_II
>                      /note="HIV-1 tat cds intron 2"
>      CDS             790..2292
>                      /vntifkey="4"
>                      /label=GAG
>                      /note="HIV-1 gag polyprotein"
>      insertion_seq   1186..1186
>                      /vntifkey="14"
>                      /label=p17/p24
>      splicing_signal 4963..4964
>                      /vntifkey="38"
>                      /label=SD2
>      splicing_signal 744..745
>                      /vntifkey="38"
>                      /label=SD1
>      splicing_signal 4911..4912
>                      /vntifkey="38"
>                      /label=SA1
>      splicing_signal 5464..5465
>                      /vntifkey="38"
>                      /label=SD3
>      polyA_signal    9602..9607
>                      /vntifkey="25"
>                      /label=POLY_A
>                      /note="HIV-1 mRNA polyadenlyation signal"
>      splicing_signal 5388..5389
>                      /vntifkey="38"
>                      /label=SA2
>      intron          6045..8368
>                      /vntifkey="15"
>                      /label=REV_II
>                      /note="HIV-1 rev cds intron 2"
>      intron          6045..8368
>                      /vntifkey="15"
>                      /label=TAT/REV/NEF_II
>                      /note="HIV-1 tat, rev, nef mRNA intron 2"
>      LTR             9076..9709
>                      /vntifkey="19"
>                      /label=3'_LTR
>                      /note="HIV-1 3' LTR"
>      misc_feature    5059..5060
>                      /vntifkey="21"
>                      /label=SD2b
>      splicing_signal 8335..8336
>                      /vntifkey="38"
>                      /label=SA7a
>      splicing_signal 8440..8441
>                      /vntifkey="38"
>                      /label=SA7b
>      splicing_signal 4541..4542
>                      /vntifkey="38"
>                      /label=SA1a
>      splicing_signal 4722..4723
>                      /vntifkey="38"
>                      /label=SD1a BASE COUNT     3423 a      1756 c      2364 g      2166 t  ORIGIN
>         1 tggaa....
rna-seq annotation gtf • 126 views
ADD COMMENTlink modified 14 days ago • written 14 days ago by caggtaagtat240

Hello,

I tried different tools, which I found by google search, but I doesn't work unfortunately.

it would be useful if you tell us which tools you've already tested and what was the problem with them. Otherwise you may get the same recommendations here.

fin swimmer

ADD REPLYlink written 14 days ago by finswimmer3.6k

Oh yes, thank you, that makes sense, I will add it in the question.

ADD REPLYlink written 14 days ago by caggtaagtat240

Did google search and found genbank2gtf. Never tried. you can explore though

ADD REPLYlink modified 14 days ago • written 14 days ago by pbpanigrahi170

Thank you, I forgot to mention in the question, what I failed to use already. I tried using that tool, but I somehow can not execute it.

ADD REPLYlink written 14 days ago by caggtaagtat240

It is possible to download viral genomes in the GFF format from the GenBank, have you tried that?

PS: an example of the genome of interest would be useful.

ADD REPLYlink written 14 days ago by Sej Modha3.1k

Thank you! I know that, however, I need the file to be in GTF format for the mapping. I could download it as GFF3, however, I also struggeld using availible GFF3 to GTF converter..

ADD REPLYlink written 14 days ago by caggtaagtat240
0
gravatar for Sej Modha
14 days ago by
Sej Modha3.1k
Glasgow, UK
Sej Modha3.1k wrote:

I think you can use GFF file with STAR aligner.

https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf

2.2.3 Annotations in GFF format.

In addition to the aforementioned options, for GFF3 formatted annotations you need to use --sjdbGTFtagExonParentTranscript Parent. In general, for --sjdbGTFfile files STAR only processes lines which have --sjdbGTFfeatureExon (=exon by default) in the 3rd field (column). The exons are assigned to the transcripts using parent-child relationship defined by the --sjdbGTFtagExonParentTranscript (=transcript id by default) GTF/GFF attribute.

If you would like to convert GFF3 to GTF then you can use gffread. Please look at this post for further details: Convertion Of Gff3 To Gtf

ADD COMMENTlink modified 14 days ago • written 14 days ago by Sej Modha3.1k

Ok since I want to create a fused genome with the viral DNA as a additional chromosome, I will have to convert my human gtf file in gff3 file format

ADD REPLYlink written 14 days ago by caggtaagtat240

Ok since I want to create a fused genome with the viral DNA as a additional chromosome, I will have to convert my human gtf file in gff3 file format

Do you mean GFF3 to GTF? You can use gffread for either conversion.

ADD REPLYlink written 14 days ago by Sej Modha3.1k

Sry, I meant, that I either have to convert my big GTF file to a GFF3 file or my small GFF3 file (or genebank file) to GTF file, in order to fuse them for the aligment

ADD REPLYlink written 14 days ago by caggtaagtat240

Unfortunately gffread does'nt work. I get the error:

    Warning: invalid GTF record, transcript_id not found:
pNL4-3  Geneious        CDS     2085    5096    .       +       .       Name=POL
Warning: invalid GTF record, transcript_id not found:
pNL4-3  Geneious        CDS     5041    5619    .       +       .       Name=VIF
Warning: invalid GTF record, transcript_id not found:
pNL4-3  Geneious        CDS     5559    5849    .       +       .       Name=VPR
Warning: invalid GTF record, transcript_id not found:
pNL4-3  Geneious        CDS     6061    6306    .       +       .       Name=VPU

My Gff3 file looks like this:

##gff-version 3
##source-version geneious 10.1.3
##Type DNA pNL4-3
##DNA pNL4-3
##TGGAA...
##end-DNA
pNL4-3  Geneious    region  1   9709    .   +   0   Is_circular=true
pNL4-3  Geneious    insertion   1186    1186    .   +   .   Name=p17/p24
pNL4-3  Geneious    polyA_signal    9602    9607    .   +   .   Name=POLY_A
pNL4-3  Geneious    LTR 9076    9709    .   +   .   Name=3'_LTR
pNL4-3  Geneious    LTR 1   634 .   +   .   Name=5'_LTR
pNL4-3  Geneious    invisible_Parent    8888    15012   .   +   .   Name=GvHzdFvgSWWztDH65o8llFeG9ws.1346934513538.20
pNL4-3  Geneious    invisible_Parent    5304    8887    .   +   .   Name=GvHzdFvgSWWztDH65o8llFeG9ws.1346934513382.19
pNL4-3  Geneious    misc_feature    5005    5034    .   .   .   Name=Fragment3
pNL4-3  Geneious    misc_feature    5743    5744    .   +   .   Name=JNCTN_NY5/LAV
pNL4-3  Geneious    repeat_region   454 551 .   +   .   Name=R
pNL4-3  Geneious    repeat_region   9529    9626    .   +   .   Name=R
pNL4-3  Geneious    repeat_region   552 634 .   +   .   Name=U5
pNL4-3  Geneious    intron  744 5776    .   +   .   Name=TAT/REV/NEF_I
pNL4-3  Geneious    intron  6045    8368    .   +   .   Name=TAT_II
pNL4-3  Geneious    intron  6045    8368    .   +   .   Name=TAT/REV/NEF_II
pNL4-3  Geneious    intron  6045    8368    .   +   .   Name=REV_II
pNL4-3  Geneious    CDS 2085    5096    .   +   .   Name=POL
pNL4-3  Geneious    CDS 5969    8643    .   +   .   Name=REV
pNL4-3  Geneious    CDS 5830    8414    .   +   .   Name=TAT
pNL4-3  Geneious    CDS 6221    8785    .   +   .   Name=ENV
pNL4-3  Geneious    CDS 790 2292    .   +   .   Name=GAG
pNL4-3  Geneious    CDS 8787    9407    .   +   .   Name=NEF
pNL4-3  Geneious    CDS 5041    5619    .   +   .   Name=VIF
pNL4-3  Geneious    CDS 5559    5849    .   +   .   Name=VPR
pNL4-3  Geneious    CDS 6061    6306    .   +   .   Name=VPU
pNL4-3  Geneious    splicing signal 5059    5060    .   +   .   Name=SD2b
pNL4-3  Geneious    splicing signal 4963    4964    .   +   .   Name=SD2
pNL4-3  Geneious    splicing signal 5974    5975    .   +   .   Name=SA5
pNL4-3  Geneious    splicing signal 6720    6721    .   +   .   Name=(SD5)
pNL4-3  Geneious    splicing signal 744 745 .   +   .   Name=SD1
pNL4-3  Geneious    splicing signal 6045    6046    .   +   .   Name=SD4
pNL4-3  Geneious    splicing signal 5388    5389    .   +   .   Name=SA2
pNL4-3  Geneious    splicing signal 8367    8368    .   +   .   Name=SA7
pNL4-3  Geneious    splicing signal 5464    5465    .   +   .   Name=SD3
pNL4-3  Geneious    splicing signal 5775    5776    .   +   .   Name=SA3
pNL4-3  Geneious    splicing signal 5952    5953    .   +   .   Name=SA4a
pNL4-3  Geneious    splicing signal 5934    5935    .   +   .   Name=SA4c
ADD REPLYlink written 14 days ago by caggtaagtat240

Can you post the command and version of the gffread that produces this warning?

ADD REPLYlink written 13 days ago by Sej Modha3.1k

I downloaded the precomiled version of gffread called gffread-0.9.12.Linux_x86_64 and executed this command /Apps/gffread-0.9.12.Linux_x86_64/gffread pNL4-3.gff -T -o pNL4-3.gtf

ADD REPLYlink modified 13 days ago • written 13 days ago by caggtaagtat240

Ok the error was int the last column, where the transcript ID is stated with Name=transcript_name whereas it only worked with another file were the transcript id was stated with ID=transcript_name However the gffread transformation does not lead to an gtf file with all exons, of corse. That means I'm still searching the internet.

ADD REPLYlink written 5 days ago by caggtaagtat240
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1424 users visited in the last hour