Question

convert genome file to transcriptome file

1

Entering edit mode

6.3 years ago

harry ▴ 40

i want to know about which package i download which can convert my genome file into transcript file which i want to use in kallisto pseudomapper tell me which package and how to download. thanks

RNA-Seq • 8.2k views

ADD COMMENT • link updated 1 hour ago by Ales ▴ 50 • written 6.3 years ago by harry ▴ 40

2

Entering edit mode

I changed your title, as for package download doesn't tell us anything about your question. In addition, it is usually helpful to be as precise as possible. You write "genome file" but more accurate would probably be that you have a genome fasta, and want a transcriptome fasta. We also have no idea which organism you are working on, so specifying that would be good as well.

ADD REPLY • link 6.3 years ago by WouterDeCoster 48k

1

Entering edit mode

If your genome has been assembled and annotated by you, you have to tell a bit more about how the genome has been assembled and annotated, in particular, what kind of annotation do you have.

If the genome has been assembled by a third party and is available at NCBI or Ensembl, a suitable transcripts fasta is probably already available, you just have to find it.

ADD REPLY • link 6.3 years ago by h.mon 35k

0

Entering edit mode

I have HIV genome in fasta format but i don't have there whole transcripts because kallisto work on transcript file. So please tell me how to convert fasta format of HIV genome into transcript file.

ADD REPLY • link 6.3 years ago by harry ▴ 40

score 8 · Answer 1 · 2019-06-27

8

Entering edit mode

6.3 years ago

husensofteng ▴ 410

So basically you could (given that you work on a linux/mac machine):

Download and extract the genome fasta file for HIV1*:

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/864/765/GCF_000864765.1_ViralProj15476/GCF_000864765.1_ViralProj15476_genomic.fna.gz

gunzip GCF_000864765.1_ViralProj15476_genomic.fna.gz
Download and extract gene annotations for HIV1:

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/864/765/GCF_000864765.1_ViralProj15476/GCF_000864765.1_ViralProj15476_genomic.gff.gz

gunzip GCF_000864765.1_ViralProj15476_genomic.gff.gz
Generate the transcriptome fasta using gffread:

gffread -F -w transcriptome.fa -g GCF_000864765.1_ViralProj15476_genomic.fna GCF_000864765.1_ViralProj15476_genomic.gff

*Of course step 1 is not needed if you already have the genome fasta file. In such case, make sure you download a GTF file that corresponds to the genome fasta. Otherwise, make sure the listed files in step 1 and 2 belong to your HIV subtype.

ADD COMMENT • link 6.3 years ago by husensofteng ▴ 410

2

Entering edit mode

Your answer is correct, but you are slightly over-complicating: the FTP repository you linked already contain the transcripts in fasta format, in the file GCF_000864765.1_ViralProj15476_cds_from_genomic.fna.gz.

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/864/765/GCF_000864765.1_ViralProj15476/GCF_000864765.1_ViralProj15476_cds_from_genomic.fna.gz

ADD REPLY • link 6.3 years ago by h.mon 35k

0

Entering edit mode

you are right, I thought the question is on generating a transcripts fasta based on the user's genome fasta and gave the links as an example to make it complete.

ADD REPLY • link 6.3 years ago by husensofteng ▴ 410

0

Entering edit mode

Should be noted that the gtf/gff and genomic.fna files for HIV only contain CDS annotations of the 9 ORFs and not the full transcripts. Only one splice junction (D4-A7) is actually described in the CDS annotations since the rest of the major donor and acceptor sites are in the UTRs.

There are dozens (if not hundreds according to some reports) of alternatively spliced transcripts for HIV/SIV. Those annotations however were only available as figures and excel sheets scattered around very detailed manuscripts, but incompatible (without a lot of manual work) with computational workflows.

To address this specific issue we have recently completed reference-grade annotation for several thousand HIV-1 genomes including the HXB2 (K03455.1), 89.6 and NL4-3 reference genomes: https://ccb.jhu.edu/HIV_Atlas/. Each annotation features full set of US, PS and FS messages along with protein assignment and major donor and acceptor sites. The annotations are provided in GTF and GFF formats, you can browse them directly on the web interface via integrated JBrowse2 and use them with any transcriptomic utilities you would use for human or other genomes (assembly, quantification, gene/tx expression, etc).

This project started from my personal need to improve spliced alignment with HISAT2/STAR and minimap around splice sites and to be able to compute transcript and junction expression effectively. Eventually, this took me down the rabbit hole of creating several methods for annotation transfer in HIV and annotation of thousands of LANL complete genome assemblies. You can also read more about the resource in our preprint: https://www.biorxiv.org/content/10.1101/2025.09.24.675449v1 Hope this helps!

ADD REPLY • link 1 hour ago by Ales ▴ 50

score 0 · Answer 2 · 2019-06-27

You will need to also obtain a transcript annotation file - typically a GTF or a GFF file. From that GTF file and your genome fasta file you can extract the transcript nucleotide sequence as a fasta file using tools such as gffread a tool which incidentially is also included in both the Cufflinks and Stringtie binary releases so it is typically easier to download those binaries and access the program though there.