Question: Add string to list of protein sequences in fasta file with different lengh
0
gravatar for Jason
13 months ago by
Jason0
Jason0 wrote:

How can I do this using python or shell command:

Suppose I have 200 protein sequences in fasta file named 'proto.fasta' with different lengths

I want to find maximum length between these sequences and then I want to add one string (Z) to all other sequences at the end of them to make all sequences have the same length of maximum length. Some sequence needs only one character to have the same length of maximum length but other sequences need more character to be the same maximum length. By the end, all the sequences will have the same length of maximum length.

After that, I want to save the result in the text file or fasta file

This is only example but I want to do same things on my sequences file for example:

>sp|P08112|MASY_ECOLI Malate synthase B OS=Escherichia coli (strain K12) GN=aceB PE=1 SV=1
MTEQATTTDELAFTRPYGEQEKQILTAEAVEFLTELVTHFTPQRNKLLAARIQQQQDID
NGTLPDFISETASIRDADWKIRGIPADLEDRRVEITGPVERKMVINA
LNANVKVFMADFED
>gi|84383531|gb|AER967412.1| C OS=Escherichia coli (strain K14)
MTEQATTTDELAFTRPYGEQEKQILTAEAVEFLTELVTHFTPQRN
KLLAARIQQQQDIDNGTLPD
>np|M04142|MASY_ECOLI tra synthase D OS=Escherichia coli (strain K16) GN=aceB 
MTEQATTTDELAFTRPYGEQEKQILTAEAVEFLTELVTHFTP
>np|S08112|MASY_ECOLI kw synthase S OS=Escherichia coli (strain K16) GN=aceB 
MTEQA

the result will be like this since first sequence the largest one we will add Z to all the sequences to make it equal to first sequences:

>sp|P08112|MASY_ECOLI Malate synthase B OS=Escherichia coli (strain K12) GN=aceB PE=1 SV=1
MTEQATTTDELAFTRPYGEQEKQILTAEAVEFLTELVTHFTPQRNKLLAARIQQQQDIDNGTLPDFISETASIRDADWKIRGIPADLEDRRVEITGPVERKMVINALNANVKVFMADFED
>gi|84383531|gb|AER967412.1| C OS=Escherichia coli (strain K14)
MTEQATTTDELAFTRPYGEQEKQILTAEAVEFLTELVTHFTP
QRNKLLAARIQQQQDIDNGTLPDZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ

>np|M04142|MASY_ECOLI tra synthase D OS=Escherichia coli (strain K16) GN=aceB 
MTEQATTTDELAFTRPYGEQEKQILTAEAVEFLTELVTHFTPZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ

>np|S08112|MASY_ECOLI kw synthase S OS=Escherichia coli (strain K16) GN=aceB 
MTEQAZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
sequencing • 541 views
ADD COMMENTlink modified 13 months ago by benformatics1.1k • written 13 months ago by Jason0
1

Can I ask why? And why the Z character specifically?

ADD REPLYlink written 13 months ago by Joe14k
1

and a "protein" of length 5 looks very dubious to me

ADD REPLYlink written 13 months ago by lieven.sterck6.1k
0
gravatar for sacha
13 months ago by
sacha1.8k
France
sacha1.8k wrote:
from pyfaidx import Fasta, Faidx

# Create faidx of fasta file  
faidx = Faidx('test.fa')

# Get max len from faidx 
max_seq_name = max(faidx.index.keys(), key=(lambda k: faidx.index[k]["lenc"]))
max_len      = faidx.index[max_seq_name]["lenc"]


# Loop over fasta and adjust using list.extend("N" * (max - len(seq))
fa = Fasta('test.fa')

for f in fa.records:
    print(fa[f].long_name)
    seq = list(str(fa[f][:]).rstrip())
    seq.extend('N' * (max_len - len(seq)))
    print("".join(seq))
ADD COMMENTlink modified 13 months ago • written 13 months ago by sacha1.8k
0
gravatar for benformatics
13 months ago by
benformatics1.1k
ETH Zurich
benformatics1.1k wrote:

R implementation for you good sir

library(Biostrings)

## import your amino acid fasta file
fa <- readAAStringSet('your_fasta.fa')

## find and save the "width" of the entry with the longest length
max.aa.width.fa <- max(width(fa))

## find the difference between the max and each individual entry
width.diff <- max.aa.width.fa-width(fa)

## paste/cat N number of 'Z's to each entry, where N is the difference between that entry's length and the max
fa.extended <- xscat(fa,strrep('Z',width.diff))

## export your new subset as fasta (btw default for this function is fasta)
writeXStringSet(fa.extended,'your_fasta_extended.fa')
ADD COMMENTlink written 13 months ago by benformatics1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1399 users visited in the last hour