I have large number of fasta files of bacteria from NCBI (in the GCF format) _genomic.fna.gz, and i am planning to extract the largest strain out of fasta files. I have noticed there are some organisms which contain the genome of its some chromosomes and hence for these cases it is not enough to extract the largest strain, since I should have all chromosomes. Different files have different headers and header in the first line of several different fasta files is as below:
>NZ_LS483491.1 Staphylococcus auricularis strain NCTC12101 genome assembly, chromosome: 1
>NZ_CP012214.1 Campylobacter jejuni strain CJ088CC52, complete genome
>NZ_CP016324.1 Vibrio cholerae 2740-80 chromosome 1, complete sequence
>NC_013791.2 Bacillus pseudofirmus OF4, complete genome # this file has a complete genome and the others
# are some complete sequences of some strains
I am completely new to sequencing. Can anyone tell me a way to extract the largest strain when I have a large number of files with different content like the situation keeping all the chromosome and on the other hand extracting the largest sequence when the file doesn't include the chromosome in the header ??
If you wanted only the complete genomes, you should have used the solution here: How to download COMPLETE bacterial genomes from NCBI based on list of names?