It seems there may be potential mislabeling of the taxid of some sequences deposited in the NCBI virus database. For example, when I use taxid:694009 (SARS-CoV) to search for sequences, I see results reported for SARS-CoV2 as well (e.g., NC_045512). SARS-CoV2 has a taxid of 2697049. I wonder whether this is more widespread and not restricted only to SARS-related viruses.
If it is widespread, then is there a way to get around this problem? My project depends on downloading many viral species and aligning them to their respective reference sequences. However, if I cannot trust taxid-based downloads, then I will need additional filtering of the data. What I can think of is, after alignment, to use some cut-off to remove noisy downloads. What would be a systemmatic way to determine this cut-off?
EDIT: Based on comments below, I realize the above is a rookie error, and the taxid I used is not for a single virus, but rather a collection of viruses related to SARS. The question below is still of interest to me. I have received the following suggestions: minimap2, LASTZ, and Nextstrain. Thanks!
Also, what is a good way to align long sequences? I am currently using the striped smith waterman aligner given in skbio (which is adapted from https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library). However, the skbio version seems to have a sequence length cut-off of 16384. Is there another tool that can help me align longer sequences?