I'm trying to learn bioinformatics. As an exercise, I'd like to compare the RefSeq for Sars-Cov-2 to that of Influenza A (H1N1) using Biopython to get a score of how similar / dissimilar the two viruses are. So something like:
alignments = pairwise2.align.globalxx(sars_cov_2, influenza_a)
If I go to the NCBI Virus database, I can find the RefSeq for Sars-Cov-2 (GCF_009858895.2) https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Genome&VirusLineage_ss=Severe%20acute%20respiratory%20syndrome%20coronavirus%202%20(SARS-CoV-2),%20taxid:2697049
If I click on GCF_009858895.2 for details, a drawer slides open from the right of the screen and documents one Nucleotide Accession Segment: NC_045512.2
If I download the file and view the contents, I see one long segment.
I can also find multiple RefSeq's for Influenza A. I pick one, GCF_001343785.1 https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Genome&VirusLineage_ss=Influenza%20A%20virus,%20taxid:711320
If I click on GCF_001343785.1 for details, a drawer slides open from the right of the screen and documents eight Nucleotide Accession Segments: NC_026438.1 NC_026435.1 NC_026437.1 NC_026433.1 NC_026436.1 NC_026434.1 NC_026431.1 NC_026432.1
If I download the file and view the contents, I see eight short segments (not in numerical order, segment 4 followed by 7, etc).
The data structures for the two viruses are very different. I can pass in the contents of Sars-Cov-2 (GCF_009858895.2) to pairwise2.align.globalxx no problem. For Influenza A (GCF_001343785.1), I don't get just one segment I can pass into the function.
I've read the Wikipedia page on fasta file format and the documentation for the Fasta software, and various posts on this forum. I still don't understand how I can compare these files.
This leaves me with many questions, such as: If I read the NCBI documentation correctly, both RefSeq's are "complete". What does "complete" mean when data is broken up into eight segments? Can I expect that the noncoding RNA is included? What transformations can I apply to Influenza A? Can I simply append the eight Influenza A segments together? If so, in what order, the order of the segment number, or the order in which they appear in the file, or some other order? Is there documentation somewhere that explains why Sars-Cov-2 is stored as one segment, and Influenza A is broken up into eight? is a globalxx comparison between these two files possible and, if so, how?