GTF file for HIV strain pNL4-3
1
1
Entering edit mode
7.3 years ago
caggtaagtat ★ 1.9k

Hi,

I have RNA-seq data of HIV infected cells, which I now want to map to a mixed human-HIV genome. For the creation of that genome, I need the GTF file of my HIV strand. I din't find strain specific annotation files for HIV. Do you maybe know where one could find something like that, or a better way to evaluate transcript abundance of HIV in RNA-seq data?

Ok I thought I could convert my annotations in genius by hand in the gff text file, to convert it to a GTF file, but I am very uncertain, if my annotations a sufficient for that.

My GFF file looks like this:

pNL4-3  Geneious    region  1   9709    .   +   0   Is_circular=true
pNL4-3  Geneious    insertion   1186    1186    .   +   .   Name=p17/p24
pNL4-3  Geneious    polyA_signal    9602    9607    .   +   .   Name=POLY_A
pNL4-3  Geneious    LTR 9076    9709    .   +   .   Name=3'_LTR
pNL4-3  Geneious    LTR 1   634 .   +   .   Name=5'_LTR
pNL4-3  Geneious    invisible_Parent    8888    15012   .   +   .   Name=GvHzdFvgSWWztDH65o8llFeG9ws.1346934513538.20
pNL4-3  Geneious    invisible_Parent    5304    8887    .   +   .   Name=GvHzdFvgSWWztDH65o8llFeG9ws.1346934513382.19
pNL4-3  Geneious    misc_feature    5005    5034    .   .   .   Name=Fragment3
pNL4-3  Geneious    misc_feature    5743    5744    .   +   .   Name=JNCTN_NY5/LAV
pNL4-3  Geneious    repeat_region   454 551 .   +   .   Name=R
pNL4-3  Geneious    repeat_region   9529    9626    .   +   .   Name=R
pNL4-3  Geneious    repeat_region   552 634 .   +   .   Name=U5
pNL4-3  Geneious    intron  744 5776    .   +   .   Name=TAT/REV/NEF_I
pNL4-3  Geneious    intron  6045    8368    .   +   .   Name=TAT_II
pNL4-3  Geneious    intron  6045    8368    .   +   .   Name=TAT/REV/NEF_II
pNL4-3  Geneious    intron  6045    8368    .   +   .   Name=REV_II
pNL4-3  Geneious    CDS 2085    5096    .   +   .   Name=POL
pNL4-3  Geneious    CDS 5969    8643    .   +   .   Name=REV
pNL4-3  Geneious    CDS 5830    8414    .   +   .   Name=TAT
pNL4-3  Geneious    CDS 6221    8785    .   +   .   Name=ENV
pNL4-3  Geneious    CDS 790 2292    .   +   .   Name=GAG
pNL4-3  Geneious    CDS 8787    9407    .   +   .   Name=NEF
pNL4-3  Geneious    CDS 5041    5619    .   +   .   Name=VIF
pNL4-3  Geneious    CDS 5559    5849    .   +   .   Name=VPR
pNL4-3  Geneious    CDS 6061    6306    .   +   .   Name=VPU
pNL4-3  Geneious    splicing signal 5059    5060    .   +   .   Name=SD2b
pNL4-3  Geneious    splicing signal 4963    4964    .   +   .   Name=SD2
pNL4-3  Geneious    splicing signal 5974    5975    .   +   .   Name=SA5
pNL4-3  Geneious    splicing signal 6720    6721    .   +   .   Name=(SD5)
pNL4-3  Geneious    splicing signal 744 745 .   +   .   Name=SD1
pNL4-3  Geneious    splicing signal 6045    6046    .   +   .   Name=SD4
pNL4-3  Geneious    splicing signal 5388    5389    .   +   .   Name=SA2
pNL4-3  Geneious    splicing signal 8367    8368    .   +   .   Name=SA7
pNL4-3  Geneious    splicing signal 5464    5465    .   +   .   Name=SD3
pNL4-3  Geneious    splicing signal 5775    5776    .   +   .   Name=SA3
pNL4-3  Geneious    splicing signal 5952    5953    .   +   .   Name=SA4a
pNL4-3  Geneious    splicing signal 5934    5935    .   +   .   Name=SA4c
pNL4-3  Geneious    splicing signal 5958    5959    .   +   .   Name=SA4b
pNL4-3  Geneious    splicing signal 4911    4912    .   +   .   Name=SA1
pNL4-3  Geneious    splicing signal 6602    6603    .   +   .   Name=(SA6)
pNL4-3  Geneious    invisible_Parent    5786    7812    .   +   .   Name=GvHzdFvgSWWztDH65o8llFeG9ws.1346938163915.21
pNL4-3  Geneious    invisible_Parent    7813    15494   .   +   .   Name=GvHzdFvgSWWztDH65o8llFeG9ws.1346938164320.22
pNL4-3  Geneious    invisible_Parent    5304    7812    .   +   .   Name=GvHzdFvgSWWztDH65o8llFeG9ws.1346938256243.23
pNL4-3  Geneious    invisible_Parent    7813    15012   .   +   .   Name=GvHzdFvgSWWztDH65o8llFeG9ws.1346938256306.24
pNL4-3  Geneious    invisible_Parent    639 5785    .   +   .   Name=GvHzdFvgSWWztDH65o8llFeG9ws.1346938340538.25
pNL4-3  Geneious    invisible_Parent    5786    10347   .   +   .   Name=GvHzdFvgSWWztDH65o8llFeG9ws.1346938340632.26
pNL4-3  Geneious    invisible_Parent    639 5303    .   +   .   Name=GvHzdFvgSWWztDH65o8llFeG9ws.1346938465400.27
pNL4-3  Geneious    invisible_Parent    5304    10347   .   +   .   Name=GvHzdFvgSWWztDH65o8llFeG9ws.1346938465494.28
pNL4-3  Geneious    invisible_Parent    712 5785    .   +   .   Name=GvHzdFvgSWWztDH65o8llFeG9ws.1346938540694.29
pNL4-3  Geneious    invisible_Parent    5786    10420   .   +   .   Name=GvHzdFvgSWWztDH65o8llFeG9ws.1346938540787.30
pNL4-3  Geneious    invisible_Parent    712 5303    .   +   .   Name=GvHzdFvgSWWztDH65o8llFeG9ws.1346938613678.31
pNL4-3  Geneious    invisible_Parent    5304    10420   .   +   .   Name=GvHzdFvgSWWztDH65o8llFeG9ws.1346938613756.32
pNL4-3  Geneious    invisible_Parent    5786    8465    .   +   .   Name=GvHzdFvgSWWztDH65o8llFeG9ws.1346941525286.33
pNL4-3  Geneious    invisible_Parent    8466    15494   .   +   .   Name=GvHzdFvgSWWztDH65o8llFeG9ws.1346941525411.34
pNL4-3  Geneious    invisible_Parent    5304    8465    .   +   .   Name=GvHzdFvgSWWztDH65o8llFeG9ws.1346941600376.35
pNL4-3  Geneious    invisible_Parent    8466    15012   .   +   .   Name=GvHzdFvgSWWztDH65o8llFeG9ws.1346941600470.36
pNL4-3  Geneious    invisible_Parent    712 5303    .   +   .   Name=GvHzdFvgSWWztDH65o8llFeG9ws.1347027336363.0
pNL4-3  Geneious    invisible_Parent    5304    10420   .   +   .   Name=GvHzdFvgSWWztDH65o8llFeG9ws.1347027336628.1

Do I have to look up the exon borders and insert them manually? Shoudl I delete the first line and do I have to delete the splice signal entries?

Do you maybe know another way to get to e.g. an exemplary HIV GTF file fro comaprison? Or even the one I need?

HIV annotation mapping • 3.2k views
ADD COMMENT
1
Entering edit mode

If you have the genbank file you could try using a genbank2gtf type program to make one up. Here is one repo.

ADD REPLY
0
Entering edit mode

Thank you! I have the annotation in genious and can download the GFF file from there. I just have to convert it then, which I guess can be done by hand, since the file is not that large.

ADD REPLY
0
Entering edit mode

Hi, caggtaagtat ,

I wonder if your HIV NL4-3 GFF/GTF file works? I have the same question and I could not find GFF/GTF of NL4-3 despite intensive google search.

Best,

Xiao

ADD REPLY
0
Entering edit mode

Sequence for HIV NL4-3 is available here. You could download the genbank format file and then try to make the GTF file.

ADD REPLY
0
Entering edit mode

The GTF file should contain all the transcripts of NL4-3, not just the DNA sequences. There are no such annotations of NL4-3 transcripts on the Internet.

ADD REPLY
0
Entering edit mode

just wanted to follow up that to address this specific issue we just released HIV transcriptome annotations of alternative splicing featuring all major donors and acceptors (full description below): https://ccb.jhu.edu/HIV_Atlas/ . Hope this helps!

ADD REPLY
0
Entering edit mode
3 hours ago
Ales ▴ 50

For anyone stumbling on this. We have recently completed reference-grade annotation for several thousand HIV-1 genomes including the HXB2 (K03455.1), 89.6 and NL4-3 reference genomes: https://ccb.jhu.edu/HIV_Atlas/. Each annotation features full set of US, PS and FS messages along with protein assignment and major donor and acceptor sites. The annotations are provided in GTF and GFF formats, you can browse them directly on the web interface via integrated JBrowse2 and use them with any transcriptomic utilities you would use for human or other genomes (assembly, quantification, gene/tx expression, etc).

This project started from my personal need to improve spliced alignment with HISAT2/STAR and minimap around splice sites and to be able to compute transcript and junction expression effectively. Eventually, this took me down the rabbit hole of creating several methods for annotation transfer in HIV and annotation of thousands of LANL complete genome assemblies. You can also read more about the resource in our preprint

ADD COMMENT
0
Entering edit mode

This project started from my personal need to improve spliced alignment

Are there plans to submit these annotations to NCBI/EBI? Are you using reference genomes available from those databases or are these genomes something your lab has assembled and thus also need to be submitted to public repositories.

While your lab may be the leading experts on these genomes, users generally look to NCBI/EBI for authoritative reference sequences/annotation as those can be referenced back to official NCBI/EBI accessions instead of a link from a web site that may or may not stay live in long term.

ADD REPLY
0
Entering edit mode

All of the genome assemblies are from the LANL compendiu, match accession with GENBANK (>2,000 complete genome assemblies from the 2021 LANL sequence compendium) and include most isolates people commonly work with (HXB2, 89.6, NL4-3, etc). You can download the actual genome sequence (eg. K03455.1) from NCBI or LANL and then search it on our interface and obtain the respective annotation (if that accession is included in our database).

I would like to have them made available over on Genbank/NCBI once the review is over to automatically replace current limited CDS-only annotations. However, that process might be tricky and take a bit of time (genbank submissions are not the easiest and I'll need to find a good way to have new annotations added to the respective accessions). For now, I think the important part was to make the resource publicly available as is and handle the bureaucracy next.

ADD REPLY

Login before adding your answer.

Traffic: 3011 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6