Question: Separate the Longest Strain in Bacteria
0
gravatar for Shelle
9 months ago by
Shelle0
Shelle0 wrote:

I have a FASTA file of bacteria which has multiple strains like below. I would like to go through the FASTA file and extract/separate the largest strain from the FASTA file. Does anyone help me how i can do that?

>NZ_PKKG01000001.1 Lactobacillus crispatus strain UMB1398 .21837_8_80.1, whole genome shotgun sequence
TCTAATTTCACGGCTGATAGTTGATTTATGACAGCCTACTTCTTGAGCGATAGCAGTGCAGGAGGTTATGCCAGAATCAA
GCAGAGCTTGAATTTGTCCACGTTGTTCGCTGTTCAATTGATGATAATGCTTGGAAATGCTAGAATTTGAGTTGGTCATG
AAGATCTTCCTTTCTTGATTTTTGGTCACTTCAAGTTTAGGTCTTCATGGCCTTTTTGTTTAACAATTAGTGTTGCACTT

>NZ_PKKG01000101.1 Lactobacillus crispatus strain UMB1398 .21837_8_80.101, whole genome shotgun sequence
AAGGGCAAGAGCATGAAAGACAAGTCAAAAGCTTATGTTCAACAGGCAACTGATGCCATTAACCACAAATATCGCCGAAT
CCTCCAATATCACACAGCAGAGGAACTCTTCAAGCAATATATCTCTTCATAACCTAACTGTTGCACTTAATTTGACAATT
CAGGCAACTTGTAAATTGAACAAAAAATAGGGAAGATGAAACTAATCATTTCCCTATTTGTTTTTAAGAACAGTATTCAA
...
sequence genome • 357 views
ADD COMMENTlink modified 9 months ago by Dattatray Mongad280 • written 9 months ago by Shelle0
seqkit sort --quiet -2lr input.fa

Outputs the sequences sorted by length and largest one on the top. You can use seqkit sort --quiet -2lr sequence.fa |seqkit head -n 1 for the largest sequence (provided there are no two equal length largest sequences) . Download seqkit from here. If you are a programmer, you can also use index of your fasta file (.fai from samtools faidx) to getting largest sequence.

ADD REPLYlink modified 9 months ago • written 9 months ago by cpad011211k
1
gravatar for Dattatray Mongad
9 months ago by
National Centre for Cell Science, Pune
Dattatray Mongad280 wrote:

Get the length of each sequence using the following command -


cat file.fa | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }' reference

ADD COMMENTlink written 9 months ago by Dattatray Mongad280

@doctor.dee005 thanks for your response. This command only lists the length of sequences. How can i actually extract the longest strain and put it in a separate fasta file?

ADD REPLYlink modified 9 months ago • written 9 months ago by Shelle0

Try this

longest=$(seqkit sort --quiet -lr your.fasta.file.fa | seqkit fx2tab -nl | head -1  | cut -f1); seqkit grep -n -r -p "$longest" your.fasta.file.fa > longest.fa

Explanation

A. Fetching the header of the longest sequence

longest=$(seqkit sort --quiet -lr your.fasta.file.fa | seqkit fx2tab -nl | head -1  | cut -f1);

seqkit sort --quiet -lr your.fasta.file.f : sort by length

seqkit fx2tab -nl: take the lengths of sorted fasta file

head -1 | cut -f1 : first line would have the header and length of the longest sequence. head -1 is for taking first line and cut -f1 is to take just the header. This is saved in the variable called longest.

B. Fetching the sequence

 seqkit grep -n -r -p "$longest" your.fasta.file.fa > longest.fa

grep : fetch header stored in longest variable

-n: by full name

-r: using regex (regular expression)

-p: pattern

ADD REPLYlink modified 9 months ago • written 9 months ago by Vijay Lakhujani4.1k

@Vijay Please update your variables("$largest" is not defined). One can further shorten :

> longest=$(seqkit sort --quiet -lr input.fa| seqkit seq -n |head -1); seqkit grep -np "$longest" input.fa
ADD REPLYlink modified 9 months ago • written 9 months ago by cpad011211k

@cpad : did that, longest is more appropriate. I liked the shortened version ;)

ADD REPLYlink written 9 months ago by Vijay Lakhujani4.1k
0
gravatar for Vijay Lakhujani
9 months ago by
Vijay Lakhujani4.1k
India
Vijay Lakhujani4.1k wrote:

Hello shelle,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

This question has been asked many times. A solution using seqkit

seqkit fx2tab -nl your.fasta.file.fa | sort -nr -k2 | head -1

Thank you!

ADD COMMENTlink modified 9 months ago • written 9 months ago by Vijay Lakhujani4.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1591 users visited in the last hour