Which tool can remove SignalP predicted signal peptides from FASTA file?
1
0
Entering edit mode
2.5 years ago
elabb@fau • 0

L.S.,

I have a list of proteins from either the UniProtKB or PlasmoDB databases that have a SignalP annotation. These proteins are thus predicted to have a signal peptide, of varying length, for secretion. I can manually remove the sequence corresponding to the predicted signal peptide, but takes a lot of time :(

I was wondering if it's possible to these kinds of operations automatically, perhaps using some kind of online tool. Or do I need to program a script of some sort to perform the operation?

Kind regards, Arman

sequence • 1.1k views
1
Entering edit mode

If you can get the ranges for each protein (without the signal peptide) in the form of a BED file then you can use bedtools getfasta (https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html) to do this.

For the initial table of ranges, you can download the UniProt data in GFF format and parse that table. Can you provide some examples?

0
Entering edit mode

if you're able to put together a script that will be most convenient I assume.

0
Entering edit mode
0
Entering edit mode

Alright! I have to process this information in order for me to fully understand what you've done ;) Can you send me the file with the mature protein sequences?

4
Entering edit mode
2.5 years ago
vkkodali ★ 2.6k

1. Download all 359 proteins in GFF format (uniprot.gff file)
2. Download all 359 proteins in FASTA format (uniprot.fasta file)

Then, I processed the two files as follows:

1. Process uniprot.gff file to create a BED-like file that has three columns: uniprot accession, protein start position, protein end position.
2. Process uniprot.fasta file to convert the headers to just have only the uniprot accession
3. Use bedtools getfasta to fetch the mature protein sequences

You can use the following code:

## step 1 - processing uniprot GFF file
cat uniprot.gff \
| grep -E '^##sequence-region|Signal peptide' \
| perl -pe 's/##sequence-region ([^ ]*) (\d+) (\d+)/\1\t\2\t\3/g' \
| awk 'BEGIN{FS="\t";OFS="\t"}{if (NF==3) {p=$1; e=$3} else {s=$5+1; print p,s,e}}' \ > uniprot.bed ## step 2 - processing the uniprot.fasta file. Note, this overwrites the existing file sed -ri 's/>[a-z]*\|([^\|]*).*$/>\1/g' uniprot.fasta

## step 3 - generate new fasta file with just the mature peptide sequences
bedtools getfasta -fi uniprot.fasta -bed uniprot.bed | fold -w 60 > uniprot.mat_pep.fasta


Out of the 359 proteins, one of them (Q7KQM4) did not have signal peptide so it is not included in the final output file uniprot.mat_pep.fasta.

0
Entering edit mode

Alright! I have to process this information in order for me to fully understand what you've done ;) Can you send me the file with the mature protein sequences?

0
Entering edit mode

If you run the commands shown above as-is you should end up with uniprot.mat_pep.fasta file. Are you having trouble running them? Here's the file: https://drive.google.com/open?id=1coo2uipv-zTK1F98xi09zfh6-Ahmmykt