Question: Add string to list of protein sequences in fasta file with different lengh
0
gravatar for Jason
2.2 years ago by
Jason0
Jason0 wrote:

How can I do this using python or shell command:

Suppose I have 200 protein sequences in fasta file named 'proto.fasta' with different lengths

I want to find maximum length between these sequences and then I want to add one string (Z) to all other sequences at the end of them to make all sequences have the same length of maximum length. Some sequence needs only one character to have the same length of maximum length but other sequences need more character to be the same maximum length. By the end, all the sequences will have the same length of maximum length.

After that, I want to save the result in the text file or fasta file

This is only example but I want to do same things on my sequences file for example:

>sp|P08112|MASY_ECOLI Malate synthase B OS=Escherichia coli (strain K12) GN=aceB PE=1 SV=1
MTEQATTTDELAFTRPYGEQEKQILTAEAVEFLTELVTHFTPQRNKLLAARIQQQQDID
NGTLPDFISETASIRDADWKIRGIPADLEDRRVEITGPVERKMVINA
LNANVKVFMADFED
>gi|84383531|gb|AER967412.1| C OS=Escherichia coli (strain K14)
MTEQATTTDELAFTRPYGEQEKQILTAEAVEFLTELVTHFTPQRN
KLLAARIQQQQDIDNGTLPD
>np|M04142|MASY_ECOLI tra synthase D OS=Escherichia coli (strain K16) GN=aceB 
MTEQATTTDELAFTRPYGEQEKQILTAEAVEFLTELVTHFTP
>np|S08112|MASY_ECOLI kw synthase S OS=Escherichia coli (strain K16) GN=aceB 
MTEQA

the result will be like this since first sequence the largest one we will add Z to all the sequences to make it equal to first sequences:

>sp|P08112|MASY_ECOLI Malate synthase B OS=Escherichia coli (strain K12) GN=aceB PE=1 SV=1
MTEQATTTDELAFTRPYGEQEKQILTAEAVEFLTELVTHFTPQRNKLLAARIQQQQDIDNGTLPDFISETASIRDADWKIRGIPADLEDRRVEITGPVERKMVINALNANVKVFMADFED
>gi|84383531|gb|AER967412.1| C OS=Escherichia coli (strain K14)
MTEQATTTDELAFTRPYGEQEKQILTAEAVEFLTELVTHFTP
QRNKLLAARIQQQQDIDNGTLPDZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ

>np|M04142|MASY_ECOLI tra synthase D OS=Escherichia coli (strain K16) GN=aceB 
MTEQATTTDELAFTRPYGEQEKQILTAEAVEFLTELVTHFTPZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ

>np|S08112|MASY_ECOLI kw synthase S OS=Escherichia coli (strain K16) GN=aceB 
MTEQAZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
sequencing • 816 views
ADD COMMENTlink modified 2.2 years ago by benformatics2.0k • written 2.2 years ago by Jason0
1

Can I ask why? And why the Z character specifically?

ADD REPLYlink written 2.2 years ago by Joe18k
1

and a "protein" of length 5 looks very dubious to me

ADD REPLYlink written 2.2 years ago by lieven.sterck8.9k
0
gravatar for sacha
2.2 years ago by
sacha2.0k
France
sacha2.0k wrote:
from pyfaidx import Fasta, Faidx

# Create faidx of fasta file  
faidx = Faidx('test.fa')

# Get max len from faidx 
max_seq_name = max(faidx.index.keys(), key=(lambda k: faidx.index[k]["lenc"]))
max_len      = faidx.index[max_seq_name]["lenc"]


# Loop over fasta and adjust using list.extend("N" * (max - len(seq))
fa = Fasta('test.fa')

for f in fa.records:
    print(fa[f].long_name)
    seq = list(str(fa[f][:]).rstrip())
    seq.extend('N' * (max_len - len(seq)))
    print("".join(seq))
ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by sacha2.0k
0
gravatar for benformatics
2.2 years ago by
benformatics2.0k
ETH Zurich
benformatics2.0k wrote:

R implementation for you good sir

library(Biostrings)

## import your amino acid fasta file
fa <- readAAStringSet('your_fasta.fa')

## find and save the "width" of the entry with the longest length
max.aa.width.fa <- max(width(fa))

## find the difference between the max and each individual entry
width.diff <- max.aa.width.fa-width(fa)

## paste/cat N number of 'Z's to each entry, where N is the difference between that entry's length and the max
fa.extended <- xscat(fa,strrep('Z',width.diff))

## export your new subset as fasta (btw default for this function is fasta)
writeXStringSet(fa.extended,'your_fasta_extended.fa')
ADD COMMENTlink written 2.2 years ago by benformatics2.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1930 users visited in the last hour