Question: Download GTF/GFF annotation data for NT database, not by organism (for STAR alignment)
0
gravatar for Malka
19 days ago by
Malka0
Malka0 wrote:

Hello,

I would like to run some alignments against the full NT database, using STAR. I am having difficulty finding the necessary annotation files for the NT database. I have found GTF/GFF files for specific organisms or species, but not for the full NT database. Where can I find that?

Thanks in advance.

rna-seq next-gen alignment • 148 views
ADD COMMENTlink modified 17 days ago • written 19 days ago by Malka0
1

as far as I know, that will not exist (and frankly don't see the point of it either).

Why would you run STAR against the whole nt DB in the first place?

ADD REPLYlink written 19 days ago by lieven.sterck4.8k

Thanks for your fast reply!

I want to find out how much of my sequence aligns to each known organism and species. For example, in the Run Browser/Analysis tab in this link, there's a full Taxonomy Analysis for the particular read. I would like to do basically the same thing, and I thought running STAR alignment against the full NT database would achieve that. Am I wrong? How would you suggest going about this?

ADD REPLYlink modified 19 days ago • written 19 days ago by Malka0
1

For starters they use a very different software than STAR.

Did you had a look at the How taxonomy analysis is done? section they provided? It seems that tool is available from github. Also pay attention to the disclaimer at the top of the page and especially the "take these results with a boatload of salt" ;)

ADD REPLYlink written 19 days ago by lieven.sterck4.8k

Yes, thank you.

Specifically because of the disclaimer at the top of the page, I want to use a different tool, which is not experimental...

Shouldn't the results be fairly similar regardless of the tool being used?

ADD REPLYlink written 19 days ago by Malka0
1

I fear we're ending up in some circular reasoning here.

To me it seems NCBI came up with this 'experimental' approach because none of the non-experimental ones are likely not capable of doing it (see also genomax comments).

I still wonder if it's really absolutely necessary to try to align to the complete nt? You should have some idea where your read data comes from, no?

ADD REPLYlink written 19 days ago by lieven.sterck4.8k

I see.

I do know where my data comes from, but the taxonomy analysis at the above link seems to suggest that that would only be part of the story for any SRA run, with some reads likely aligning with other organisms, whose identity is unknown to me. I want to avoid running a separate alignment on each known organism, so I figured running it against the full NT at once would be a fast and accurate way to find out. I now understand that that is not really feasible...

I wonder what you think about using the data at the above link for SRA runs which I download from the site. They do say to use the results 'with a boatload of salt', but I assume the NCBI wouldn't publish results that they are not fairly sure are accurate. Would you rely on that available data for your research?

Thanks.

ADD REPLYlink modified 17 days ago • written 17 days ago by Malka0

You need to specify what kind of data you are working with. Generally a normal sample would have only one genome (or a defined number, if that is true by design, e.g. you created a mix of N bacteria). If you had contamination with more than 1-2 other organismal genomes (besides the one of your interest) then you need to seriously think about going further with that data.

Only case where you would be legitimately working with a mixture of genomes would be a metagenomic sample. In that case, the tools I already mentioned below would be useful to assign reads to closest taxonomy.

ADD REPLYlink modified 17 days ago • written 17 days ago by genomax67k

Did you click on the link that says How taxonomy analysis is done at the link you posted above? You should probably use that same method to do this analysis.

cap

There are packages like kraken2 and centrifuge that can also be used for rapid read assignments to genus/species.

ADD REPLYlink modified 19 days ago • written 19 days ago by genomax67k

I did, thank you.

Specifically because they mention that the software is experimental and the results should be taken 'with a boatload of salt', I am interested in using a different method to arrive at similar results for specific SRA reads.

I will look into the packages you mention.

Thanks.

ADD REPLYlink written 17 days ago by Malka0

I have found GTF/GFF files for specific organisms or species

As you note annotation files are generally for specific genomes/organisms. So there is no way to get them for NT database.

For the record, trying to do STAR alignment against NT is a futile task. Even if you are able to find adequate hardware to do it you would likely not get anything useful out of it.

What are you trying to achieve?

ADD REPLYlink written 19 days ago by genomax67k

Thanks for your fast reply!

Please see my comment above (in response to lieven.sterck's answer to my question) as to what I am trying to achieve. Would I not be able to do that by running STAR alignment against NT?

Do I understand correctly from your answer that there is no way to run STAR alignment against a full database?

I am fairly new to this, so I'd appreciate if you could please explain to me why it would be futile.

Thanks!

ADD REPLYlink modified 19 days ago • written 19 days ago by Malka0

Do I understand correctly from your answer that there is no way to run STAR alignment against a full database?

Theoretically speaking I would not say there is no way to do this. If you have access to right hardware and patience it may be possible to so. For reference STAR needs about 30GB of RAM for human genome sized reference (3 GB) so you can extrapolate from there to get an idea of what NT may need. You may need to build indexes for pieces of NT and so on. It would become a significant undertaking at one point. See my comment above for some additional software options beyond the tool used by NCBI.

ADD REPLYlink modified 19 days ago • written 19 days ago by genomax67k

Thanks.

I am working on a server which should have enough RAM and memory, and considering working in the cloud too. That way, would it be possible to run STAR alignment against NT? How would I find the necessary annotation files, if yes?

Thanks for your advice regarding the other packages.

ADD REPLYlink written 19 days ago by Malka0

Are you sure? NT fasta file seems to be 52 GB compressed. Compressed human genome sequence is < 1GB.

With STAR you don't need absolutely need to have GTF files. Having annotation improves alignment but you should be able to align without it as well. You will need to figure out your own method to assign taxonomy to the "hits" you get, so keep that in mind.

ADD REPLYlink written 19 days ago by genomax67k

Thanks, I might try your suggestion of running STAR with just FASTA files. I didn't realise that was possible. I will look into it now.

Thanks.

ADD REPLYlink modified 17 days ago • written 17 days ago by Malka0

Do keep us updated, I'm seriously interested to see how this turns out.

(perhaps also keep track of mem en storage usage etc, thx)

ADD REPLYlink written 17 days ago by lieven.sterck4.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 814 users visited in the last hour