Question

Which tool can remove SignalP predicted signal peptides from FASTA file?

0

Entering edit mode

5.4 years ago

elabb@fau • 0

L.S.,

I have a list of proteins from either the UniProtKB or PlasmoDB databases that have a SignalP annotation. These proteins are thus predicted to have a signal peptide, of varying length, for secretion. I can manually remove the sequence corresponding to the predicted signal peptide, but takes a lot of time :(

I was wondering if it's possible to these kinds of operations automatically, perhaps using some kind of online tool. Or do I need to program a script of some sort to perform the operation?

Kind regards, Arman

sequence • 2.7k views

ADD COMMENT • link 5.4 years ago by elabb@fau • 0

1

Entering edit mode

If you can get the ranges for each protein (without the signal peptide) in the form of a BED file then you can use bedtools getfasta (https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html) to do this.

For the initial table of ranges, you can download the UniProt data in GFF format and parse that table. Can you provide some examples?

ADD REPLY • link 5.4 years ago by vkkodali_ncbi ★ 3.7k

0

Entering edit mode

if you're able to put together a script that will be most convenient I assume.

ADD REPLY • link 5.4 years ago by lieven.sterck 15k

0

Entering edit mode

Here for example: https://www.uniprot.org/uniprot/?query=organism%3A%22Plasmodium%20falciparum%20(isolate%203D7)%20%5B36329%5D%22%20annotation%3A(type%3Asignal)&columns=id%2Centry%20name%2Creviewed%2Cprotein%20names%2Cgenes%2Corganism%2Clength%2Cfeature(SIGNAL)%2Cdatabase(EnsemblProtists)%2Cdatabase(EuPathDB)&sort=score

ADD REPLY • link 5.4 years ago by elabb@fau • 0

0

Entering edit mode

Alright! I have to process this information in order for me to fully understand what you've done ;) Can you send me the file with the mature protein sequences?

Thank you for your time!

ADD REPLY • link 5.4 years ago by elabb@fau • 0

score 4 · Accepted Answer · 2018-12-13

I followed your Uniprot link and clicked on the 'Download' button to download two files:

Download all 359 proteins in GFF format (uniprot.gff file)
Download all 359 proteins in FASTA format (uniprot.fasta file)

Then, I processed the two files as follows:

Process uniprot.gff file to create a BED-like file that has three columns: uniprot accession, protein start position, protein end position.
Process uniprot.fasta file to convert the headers to just have only the uniprot accession
Use bedtools getfasta to fetch the mature protein sequences

You can use the following code:

## step 1 - processing uniprot GFF file
cat uniprot.gff \
    | grep -E '^##sequence-region|Signal peptide' \
    | perl -pe 's/##sequence-region ([^ ]*) (\d+) (\d+)/\1\t\2\t\3/g' \
    | awk 'BEGIN{FS="\t";OFS="\t"}{if (NF==3) {p=$1; e=$3} else {s=$5+1; print p,s,e}}' \
    > uniprot.bed

## step 2 - processing the uniprot.fasta file. Note, this overwrites the existing file
sed -ri 's/>[a-z]*\|([^\|]*).*$/>\1/g' uniprot.fasta

## step 3 - generate new fasta file with just the mature peptide sequences
bedtools getfasta -fi uniprot.fasta -bed uniprot.bed | fold -w 60 > uniprot.mat_pep.fasta

Out of the 359 proteins, one of them (Q7KQM4) did not have signal peptide so it is not included in the final output file uniprot.mat_pep.fasta.