Question: Separate the Longest Strain in Bacteria
0
gravatar for Shelle
20 months ago by
Shelle0
Shelle0 wrote:

I have a FASTA file of bacteria which has multiple strains like below. I would like to go through the FASTA file and extract/separate the largest strain from the FASTA file. Does anyone help me how i can do that?

>NZ_PKKG01000001.1 Lactobacillus crispatus strain UMB1398 .21837_8_80.1, whole genome shotgun sequence
TCTAATTTCACGGCTGATAGTTGATTTATGACAGCCTACTTCTTGAGCGATAGCAGTGCAGGAGGTTATGCCAGAATCAA
GCAGAGCTTGAATTTGTCCACGTTGTTCGCTGTTCAATTGATGATAATGCTTGGAAATGCTAGAATTTGAGTTGGTCATG
AAGATCTTCCTTTCTTGATTTTTGGTCACTTCAAGTTTAGGTCTTCATGGCCTTTTTGTTTAACAATTAGTGTTGCACTT

>NZ_PKKG01000101.1 Lactobacillus crispatus strain UMB1398 .21837_8_80.101, whole genome shotgun sequence
AAGGGCAAGAGCATGAAAGACAAGTCAAAAGCTTATGTTCAACAGGCAACTGATGCCATTAACCACAAATATCGCCGAAT
CCTCCAATATCACACAGCAGAGGAACTCTTCAAGCAATATATCTCTTCATAACCTAACTGTTGCACTTAATTTGACAATT
CAGGCAACTTGTAAATTGAACAAAAAATAGGGAAGATGAAACTAATCATTTCCCTATTTGTTTTTAAGAACAGTATTCAA
...
sequence genome • 532 views
ADD COMMENTlink modified 20 months ago by Dattatray Mongad350 • written 20 months ago by Shelle0
seqkit sort --quiet -2lr input.fa

Outputs the sequences sorted by length and largest one on the top. You can use seqkit sort --quiet -2lr sequence.fa |seqkit head -n 1 for the largest sequence (provided there are no two equal length largest sequences) . Download seqkit from here. If you are a programmer, you can also use index of your fasta file (.fai from samtools faidx) to getting largest sequence.

ADD REPLYlink modified 20 months ago • written 20 months ago by cpad011212k
1
gravatar for Dattatray Mongad
20 months ago by
National Centre for Cell Science, Pune
Dattatray Mongad350 wrote:

Get the length of each sequence using the following command -


cat file.fa | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }' reference

ADD COMMENTlink written 20 months ago by Dattatray Mongad350

@doctor.dee005 thanks for your response. This command only lists the length of sequences. How can i actually extract the longest strain and put it in a separate fasta file?

ADD REPLYlink modified 20 months ago • written 20 months ago by Shelle0

Try this

longest=$(seqkit sort --quiet -lr your.fasta.file.fa | seqkit fx2tab -nl | head -1  | cut -f1); seqkit grep -n -r -p "$longest" your.fasta.file.fa > longest.fa

Explanation

A. Fetching the header of the longest sequence

longest=$(seqkit sort --quiet -lr your.fasta.file.fa | seqkit fx2tab -nl | head -1  | cut -f1);

seqkit sort --quiet -lr your.fasta.file.f : sort by length

seqkit fx2tab -nl: take the lengths of sorted fasta file

head -1 | cut -f1 : first line would have the header and length of the longest sequence. head -1 is for taking first line and cut -f1 is to take just the header. This is saved in the variable called longest.

B. Fetching the sequence

 seqkit grep -n -r -p "$longest" your.fasta.file.fa > longest.fa

grep : fetch header stored in longest variable

-n: by full name

-r: using regex (regular expression)

-p: pattern

ADD REPLYlink modified 20 months ago • written 20 months ago by lakhujanivijay4.8k

@Vijay Please update your variables("$largest" is not defined). One can further shorten :

> longest=$(seqkit sort --quiet -lr input.fa| seqkit seq -n |head -1); seqkit grep -np "$longest" input.fa
ADD REPLYlink modified 20 months ago • written 20 months ago by cpad011212k

@cpad : did that, longest is more appropriate. I liked the shortened version ;)

ADD REPLYlink written 20 months ago by lakhujanivijay4.8k
0
gravatar for lakhujanivijay
20 months ago by
lakhujanivijay4.8k
India
lakhujanivijay4.8k wrote:

Hello shelle,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

This question has been asked many times. A solution using seqkit

seqkit fx2tab -nl your.fasta.file.fa | sort -nr -k2 | head -1

Thank you!

ADD COMMENTlink modified 20 months ago • written 20 months ago by lakhujanivijay4.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1954 users visited in the last hour