Question: (EDITED) Aligning complete viral RNAs
gravatar for oddjobs
5 weeks ago by
oddjobs0 wrote:

It seems there may be potential mislabeling of the taxid of some sequences deposited in the NCBI virus database. For example, when I use taxid:694009 (SARS-CoV) to search for sequences, I see results reported for SARS-CoV2 as well (e.g., NC_045512). SARS-CoV2 has a taxid of 2697049. I wonder whether this is more widespread and not restricted only to SARS-related viruses.

If it is widespread, then is there a way to get around this problem? My project depends on downloading many viral species and aligning them to their respective reference sequences. However, if I cannot trust taxid-based downloads, then I will need additional filtering of the data. What I can think of is, after alignment, to use some cut-off to remove noisy downloads. What would be a systemmatic way to determine this cut-off?

EDIT: Based on comments below, I realize the above is a rookie error, and the taxid I used is not for a single virus, but rather a collection of viruses related to SARS. The question below is still of interest to me. I have received the following suggestions: minimap2, LASTZ, and Nextstrain. Thanks!

Also, what is a good way to align long sequences? I am currently using the striped smith waterman aligner given in skbio (which is adapted from However, the skbio version seems to have a sequence length cut-off of 16384. Is there another tool that can help me align longer sequences?

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by oddjobs0

That's not a mislabeling, SARS-COV-2 is under that taxid

NCBI taxids are hierarchical, e.g. humans are under many different taxids: eukaryotes, animals, chordata, mammals, etc.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by 5heikki8.9k

I see. Thanks for this information! So I will essentially need to get the taxid of the leaf nodes in this tree to ensure it is purely from one species.

ADD REPLYlink written 5 weeks ago by oddjobs0

I don't think this is a mistake. 694009 refers to the broad class of Severe acute respiratory syndrome-related coronaviruses, of which SARS-CoV-2 is a member. You can go to the NCBI taxonomy browser and search with this taxID to see that.

SARS-CoV-2 was recognized as a separate species sometime in February 2020 (if I recall right) so it was given an independent taxID, probably after that point in time.

minimap2 is perfect aligner for long reads. There are others like LASTZ which can do chromosomal alignments. You could also use the tools used by Nextstrain projects to do these alignments.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by genomax87k

Seems I made a rookie error! Thanks for the correction.

I will look into both LASTZ and Nextstrain.

I am not sure minimap2 is suitable, since it is designed for sequencing reads with high error-rate (especially indel error rates) rather than finished assemblies. I believe for it to perfectly work, I would need to change the parameters and increase the mismatch/deletion penalties? Please correct me if wrong.

ADD REPLYlink written 5 weeks ago by oddjobs0

Depends on what you are aligning to these genomes. You have not told us that. If you have long reads then minimap2 may be a valid option. If you tell us what your source data is and what kind then you can get more specific recommendations.

ADD REPLYlink written 5 weeks ago by genomax87k

Right. I am trying to align a finished sequence for a virus to the virus' reference sequence. So most mismatches/indels are just from viral mutations. I am not looking to do a multiple sequence alignment at this point.

ADD REPLYlink written 5 weeks ago by oddjobs0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 685 users visited in the last hour