I have a FASTA file of bacteria which has multiple strains like below. I would like to go through the FASTA file and extract/separate the largest strain from the FASTA file. Does anyone help me how i can do that?
>NZ_PKKG01000001.1 Lactobacillus crispatus strain UMB1398 .21837_8_80.1, whole genome shotgun sequence TCTAATTTCACGGCTGATAGTTGATTTATGACAGCCTACTTCTTGAGCGATAGCAGTGCAGGAGGTTATGCCAGAATCAA GCAGAGCTTGAATTTGTCCACGTTGTTCGCTGTTCAATTGATGATAATGCTTGGAAATGCTAGAATTTGAGTTGGTCATG AAGATCTTCCTTTCTTGATTTTTGGTCACTTCAAGTTTAGGTCTTCATGGCCTTTTTGTTTAACAATTAGTGTTGCACTT >NZ_PKKG01000101.1 Lactobacillus crispatus strain UMB1398 .21837_8_80.101, whole genome shotgun sequence AAGGGCAAGAGCATGAAAGACAAGTCAAAAGCTTATGTTCAACAGGCAACTGATGCCATTAACCACAAATATCGCCGAAT CCTCCAATATCACACAGCAGAGGAACTCTTCAAGCAATATATCTCTTCATAACCTAACTGTTGCACTTAATTTGACAATT CAGGCAACTTGTAAATTGAACAAAAAATAGGGAAGATGAAACTAATCATTTCCCTATTTGTTTTTAAGAACAGTATTCAA ...
Outputs the sequences sorted by length and largest one on the top. You can use
seqkit sort --quiet -2lr sequence.fa |seqkit head -n 1for the largest sequence (provided there are no two equal length largest sequences) . Download seqkit from here. If you are a programmer, you can also use index of your fasta file (.fai from samtools faidx) to getting largest sequence.