Question

Separate the Longest Strain in Bacteria

0

Entering edit mode

5.8 years ago

Shelle ▴ 30

I have a FASTA file of bacteria which has multiple strains like below. I would like to go through the FASTA file and extract/separate the largest strain from the FASTA file. Does anyone help me how i can do that?

>NZ_PKKG01000001.1 Lactobacillus crispatus strain UMB1398 .21837_8_80.1, whole genome shotgun sequence
TCTAATTTCACGGCTGATAGTTGATTTATGACAGCCTACTTCTTGAGCGATAGCAGTGCAGGAGGTTATGCCAGAATCAA
GCAGAGCTTGAATTTGTCCACGTTGTTCGCTGTTCAATTGATGATAATGCTTGGAAATGCTAGAATTTGAGTTGGTCATG
AAGATCTTCCTTTCTTGATTTTTGGTCACTTCAAGTTTAGGTCTTCATGGCCTTTTTGTTTAACAATTAGTGTTGCACTT

>NZ_PKKG01000101.1 Lactobacillus crispatus strain UMB1398 .21837_8_80.101, whole genome shotgun sequence
AAGGGCAAGAGCATGAAAGACAAGTCAAAAGCTTATGTTCAACAGGCAACTGATGCCATTAACCACAAATATCGCCGAAT
CCTCCAATATCACACAGCAGAGGAACTCTTCAAGCAATATATCTCTTCATAACCTAACTGTTGCACTTAATTTGACAATT
CAGGCAACTTGTAAATTGAACAAAAAATAGGGAAGATGAAACTAATCATTTCCCTATTTGTTTTTAAGAACAGTATTCAA
...

genome sequence • 1.4k views

ADD COMMENT • link updated 5.8 years ago by Dattatray Mongad ▴ 370 • written 5.8 years ago by Shelle ▴ 30

0

Entering edit mode

seqkit sort --quiet -2lr input.fa

Outputs the sequences sorted by length and largest one on the top. You can use seqkit sort --quiet -2lr sequence.fa |seqkit head -n 1 for the largest sequence (provided there are no two equal length largest sequences) . Download seqkit from here. If you are a programmer, you can also use index of your fasta file (.fai from samtools faidx) to getting largest sequence.

ADD REPLY • link 5.8 years ago by cpad0112 21k

score 1 · Answer 1 · 2018-07-25

1

Entering edit mode

5.8 years ago

Dattatray Mongad ▴ 370

Get the length of each sequence using the following command -

cat file.fa | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }' reference

ADD COMMENT • link 5.8 years ago by Dattatray Mongad ▴ 370

0

Entering edit mode

@doctor.dee005 thanks for your response. This command only lists the length of sequences. How can i actually extract the longest strain and put it in a separate fasta file?

ADD REPLY • link 5.8 years ago by Shelle ▴ 30

0

Entering edit mode

Try this

longest=$(seqkit sort --quiet -lr your.fasta.file.fa | seqkit fx2tab -nl | head -1  | cut -f1); seqkit grep -n -r -p "$longest" your.fasta.file.fa > longest.fa

Explanation

A. Fetching the header of the longest sequence

longest=$(seqkit sort --quiet -lr your.fasta.file.fa | seqkit fx2tab -nl | head -1  | cut -f1);

seqkit sort --quiet -lr your.fasta.file.f : sort by length

seqkit fx2tab -nl: take the lengths of sorted fasta file

head -1 | cut -f1 : first line would have the header and length of the longest sequence. head -1 is for taking first line and cut -f1 is to take just the header. This is saved in the variable called longest.

B. Fetching the sequence

 seqkit grep -n -r -p "$longest" your.fasta.file.fa > longest.fa

grep : fetch header stored in longest variable

-n: by full name

-r: using regex (regular expression)

-p: pattern

ADD REPLY • link 5.8 years ago by lakhujanivijay 5.8k

0

Entering edit mode

@Vijay Please update your variables("$largest" is not defined). One can further shorten :

> longest=$(seqkit sort --quiet -lr input.fa| seqkit seq -n |head -1); seqkit grep -np "$longest" input.fa

ADD REPLY • link 5.8 years ago by cpad0112 21k

0

Entering edit mode

@cpad : did that, longest is more appropriate. I liked the shortened version ;)

ADD REPLY • link 5.8 years ago by lakhujanivijay 5.8k

score 0 · Answer 2 · 2018-07-25

0

Entering edit mode

5.8 years ago

lakhujanivijay 5.8k

Hello shelle,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

This question has been asked many times. A solution using seqkit

seqkit fx2tab -nl your.fasta.file.fa | sort -nr -k2 | head -1

Thank you!

ADD COMMENT • link 5.8 years ago by lakhujanivijay 5.8k