Download GTF/GFF annotation data for NT database, not by organism (for STAR alignment)
0
3
Entering edit mode
3.7 years ago
Malka ▴ 70

Hello,

I would like to run some alignments against the full NT database, using STAR. I am having difficulty finding the necessary annotation files for the NT database. I have found GTF/GFF files for specific organisms or species, but not for the full NT database. Where can I find that?

RNA-Seq rna-seq alignment next-gen • 2.4k views
2
Entering edit mode

as far as I know, that will not exist (and frankly don't see the point of it either).

Why would you run STAR against the whole nt DB in the first place?

0
Entering edit mode

I want to find out how much of my sequence aligns to each known organism and species. For example, in the Run Browser/Analysis tab in this link, there's a full Taxonomy Analysis for the particular read. I would like to do basically the same thing, and I thought running STAR alignment against the full NT database would achieve that. Am I wrong? How would you suggest going about this?

1
Entering edit mode

For starters they use a very different software than STAR.

Did you had a look at the How taxonomy analysis is done? section they provided? It seems that tool is available from github. Also pay attention to the disclaimer at the top of the page and especially the "take these results with a boatload of salt" ;)

0
Entering edit mode

Yes, thank you.

Specifically because of the disclaimer at the top of the page, I want to use a different tool, which is not experimental...

Shouldn't the results be fairly similar regardless of the tool being used?

1
Entering edit mode

I fear we're ending up in some circular reasoning here.

To me it seems NCBI came up with this 'experimental' approach because none of the non-experimental ones are likely not capable of doing it (see also genomax comments).

I still wonder if it's really absolutely necessary to try to align to the complete nt? You should have some idea where your read data comes from, no?

0
Entering edit mode

I see.

I do know where my data comes from, but the taxonomy analysis at the above link seems to suggest that that would only be part of the story for any SRA run, with some reads likely aligning with other organisms, whose identity is unknown to me. I want to avoid running a separate alignment on each known organism, so I figured running it against the full NT at once would be a fast and accurate way to find out. I now understand that that is not really feasible...

I wonder what you think about using the data at the above link for SRA runs which I download from the site. They do say to use the results 'with a boatload of salt', but I assume the NCBI wouldn't publish results that they are not fairly sure are accurate. Would you rely on that available data for your research?

Thanks.

0
Entering edit mode

You need to specify what kind of data you are working with. Generally a normal sample would have only one genome (or a defined number, if that is true by design, e.g. you created a mix of N bacteria). If you had contamination with more than 1-2 other organismal genomes (besides the one of your interest) then you need to seriously think about going further with that data.

Only case where you would be legitimately working with a mixture of genomes would be a metagenomic sample. In that case, the tools I already mentioned below would be useful to assign reads to closest taxonomy.

0
Entering edit mode

I am working with existing SRA runs from the database.

I am interested in HGT (horizontal gene transfers), for example. Viruses transfer genes into host DNA, and it stays there. It has even been shown to potentially cause cancer, as in the case of HPV causing cervical cancer and EBV causing gastric carcinoma. In fact, the WHO apparently claims that more than 15% of cancers are caused by pathogens.

So sequences from an individual who had been infected with one of these viruses are likely to align also to that virus, in addition to the human genome - as the link I shared above implies (as does research isn the field). That would disagree with the notion that only contaminated samples would align to several organisms.

So now if I want to study the connection, say of HPV and cancer, would you rely on the information on their site despite their warning to take it 'with a boatload of salt', or not?

Thanks.

1
Entering edit mode

nice experimental setup. Do you have access to enough data for this, you would need genomic data from specific enough tissue (cells?) most likely?

As for the technical side of it: sounds like you have no need to map against the whole NT database, a subset of human + all known viral sequences will do the trick I think. I see no need of you trying to map your data against, say plants, algae, other mammals, ... , which are of course also all part of the NT database.

concerning the 'boatload of salt' issue. Me personally would not trust it blindfolded indeed but as in all analysis it might point you in the good direction which you then can analyse in more detail.

0
Entering edit mode

Thanks!

(I wrote an elaborate reply to this comment and to your comment below from several weeks ago, but couldn't post it because I don't yet have enough votes on the site...)

Regarding the data, I haven't found a way to filter SRA accession codes by tissue type on NCBI's SRA website (nor at DDBJ's or EBI's), so I retrieved them using H5 files on ARCHS4 website, which do enable filtering by tissue type and return GEO accession codes. I then used methods described in this post to get SRA accession codes based on GEO codes. Out of 1401 codes retrieved, 1294 contained relevant taxonomy analysis data on the site, and out of those - only 462 contained Homo sapiens data in the Strong Signals table. So it looks like I am down to those...

Good point about the NT. If I remember correctly, though, there were direct links to download specific superkingdoms, except bacteria, which I need. (Not sure about that, I would need to recheck.) So it was either download them one by one or download the entire NT, which is simpler. I wonder if I can find a list of all bacteria by host tissue type. In that case, that would be the simplest solution.

I appreciate your take on the 'trust' issue... ;-)

I am fairly new to this, so it is important for me to hear expert opinions. Thanks!

0
Entering edit mode

If you just need viral genomes then you can download this file.

0
Entering edit mode

Thanks. I need bacterial genomes too... Is there a similar file for that? And how about archaea?

Thank you.

0
Entering edit mode

You can download bacterial genomes using this tool. There is no single file for bacterial genomes.

What are you planning to use the bacterial genomes for?

0
Entering edit mode

That's great! Thanks for the useful link. I wonder if there's a way to get annotation files too... And I would need to find a solution for the issue of bacteria which don't have a reference for the species, so that not all strains should be aligned to...

Perhaps I will suggest it there.

I would like to align to bacterial genomes too...

Although as it currently stands, I would like to first work with the data from the SRA website to get a general direction. I have written code to scrape the data from the site and I will start with that.

0
Entering edit mode

Out of curiosity: what would be your next (technical) steps when you got the mapping worked out and you have your mapping results? Looking for reads that multi-map on both human and something else?

0
Entering edit mode

My plans for initial analysis are different.

I'd be happy to elaborate further in private if you wish. I am not keen on publicising research ideas before they are done... I hope you understand.

0
Entering edit mode

sure, no problem.

good luck with the quest !

0
Entering edit mode

Thanks!

As I said, I'd be happy to elaborate privately. I looked for a way to send you a private message on this site, but didn't find such an option.

And quite honestly, I am not yet completely certain myself of exactly where this would lead...

1
Entering edit mode

While I agree with everything you say above, you need to be careful about drawing conclusions from viral sequences present in human genome. Human genome is known to have many ERVs and HERVs (ref1, ref2 etc.) I am not immediately aware if these represent full viral genomes or just parts that we are able to identify in extant sequences.

If you find sequences that show perfect homology to known viral sequence then you can certainly use the taxonomic assignment with confidence. As to whether that sequence actually carries a functional significance, that is a critical question you will need to answer independently of the taxonomic assignment.

0
Entering edit mode

I will certainly look into it.

Currently, I am thinking of focusing more on bacteria in samples, although they don't affect the actual genome.

Thanks!

1
Entering edit mode

Did you click on the link that says How taxonomy analysis is done at the link you posted above? You should probably use that same method to do this analysis.

There are packages like kraken2 and centrifuge that can also be used for rapid read assignments to genus/species.

0
Entering edit mode

I did, thank you.

Specifically because they mention that the software is experimental and the results should be taken 'with a boatload of salt', I am interested in using a different method to arrive at similar results for specific SRA reads.

I will look into the packages you mention.

Thanks.

1
Entering edit mode

I have found GTF/GFF files for specific organisms or species

As you note annotation files are generally for specific genomes/organisms. So there is no way to get them for NT database.

For the record, trying to do STAR alignment against NT is a futile task. Even if you are able to find adequate hardware to do it you would likely not get anything useful out of it.

What are you trying to achieve?

0
Entering edit mode

Please see my comment above (in response to lieven.sterck's answer to my question) as to what I am trying to achieve. Would I not be able to do that by running STAR alignment against NT?

Do I understand correctly from your answer that there is no way to run STAR alignment against a full database?

I am fairly new to this, so I'd appreciate if you could please explain to me why it would be futile.

Thanks!

1
Entering edit mode

Do I understand correctly from your answer that there is no way to run STAR alignment against a full database?

Theoretically speaking I would not say there is no way to do this. If you have access to right hardware and patience it may be possible to so. For reference STAR needs about 30GB of RAM for human genome sized reference (3 GB) so you can extrapolate from there to get an idea of what NT may need. You may need to build indexes for pieces of NT and so on. It would become a significant undertaking at one point. See my comment above for some additional software options beyond the tool used by NCBI.

0
Entering edit mode

Thanks.

I am working on a server which should have enough RAM and memory, and considering working in the cloud too. That way, would it be possible to run STAR alignment against NT? How would I find the necessary annotation files, if yes?

1
Entering edit mode

Are you sure? NT fasta file seems to be 52 GB compressed. Compressed human genome sequence is < 1GB.

With STAR you don't need absolutely need to have GTF files. Having annotation improves alignment but you should be able to align without it as well. You will need to figure out your own method to assign taxonomy to the "hits" you get, so keep that in mind.

0
Entering edit mode

Thanks, I might try your suggestion of running STAR with just FASTA files. I didn't realise that was possible. I will look into it now.

Thanks.

1
Entering edit mode

Do keep us updated, I'm seriously interested to see how this turns out.

(perhaps also keep track of mem en storage usage etc, thx)

0
Entering edit mode

I sure will!

I have yet to decide exactly how to proceed. The comments on this post have added valuable perspectives to my thoughts and plans.

For now, I will not be carrying out mass-scale STAR alignments, partly because I do not yet have access to a server with enough RAM and disk storage (I thought I would have by now) but also because as you mentioned yesterday, it may be worthwhile to first use the taxonomy analysis data available on the SRA website to get a general direction of where things are going before carrying out resource-intense alignments myself.

(Finally I am managing to post this...)