Question: Which tool can remove SignalP predicted signal peptides from FASTA file?
0
gravatar for elabb@fau
5 weeks ago by
elabb@fau0
elabb@fau0 wrote:

L.S.,

I have a list of proteins from either the UniProtKB or PlasmoDB databases that have a SignalP annotation. These proteins are thus predicted to have a signal peptide, of varying length, for secretion. I can manually remove the sequence corresponding to the predicted signal peptide, but takes a lot of time :(

I was wondering if it's possible to these kinds of operations automatically, perhaps using some kind of online tool. Or do I need to program a script of some sort to perform the operation?

Kind regards, Arman

sequence • 151 views
ADD COMMENTlink modified 4 weeks ago • written 5 weeks ago by elabb@fau0
1

If you can get the ranges for each protein (without the signal peptide) in the form of a BED file then you can use bedtools getfasta (https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html) to do this.

For the initial table of ranges, you can download the UniProt data in GFF format and parse that table. Can you provide some examples?

ADD REPLYlink written 5 weeks ago by vkkodali860

if you're able to put together a script that will be most convenient I assume.

ADD REPLYlink written 5 weeks ago by lieven.sterck3.5k

Alright! I have to process this information in order for me to fully understand what you've done ;) Can you send me the file with the mature protein sequences?

Thank you for your time!

ADD REPLYlink written 4 weeks ago by elabb@fau0
4
gravatar for vkkodali
5 weeks ago by
vkkodali860
United States
vkkodali860 wrote:

I followed your Uniprot link and clicked on the 'Download' button to download two files:

  1. Download all 359 proteins in GFF format (uniprot.gff file)
  2. Download all 359 proteins in FASTA format (uniprot.fasta file)

Then, I processed the two files as follows:

  1. Process uniprot.gff file to create a BED-like file that has three columns: uniprot accession, protein start position, protein end position.
  2. Process uniprot.fasta file to convert the headers to just have only the uniprot accession
  3. Use bedtools getfasta to fetch the mature protein sequences

You can use the following code:

## step 1 - processing uniprot GFF file
cat uniprot.gff \
    | grep -E '^##sequence-region|Signal peptide' \
    | perl -pe 's/##sequence-region ([^ ]*) (\d+) (\d+)/\1\t\2\t\3/g' \
    | awk 'BEGIN{FS="\t";OFS="\t"}{if (NF==3) {p=$1; e=$3} else {s=$5+1; print p,s,e}}' \
    > uniprot.bed

## step 2 - processing the uniprot.fasta file. Note, this overwrites the existing file
sed -ri 's/>[a-z]*\|([^\|]*).*$/>\1/g' uniprot.fasta

## step 3 - generate new fasta file with just the mature peptide sequences
bedtools getfasta -fi uniprot.fasta -bed uniprot.bed | fold -w 60 > uniprot.mat_pep.fasta

Out of the 359 proteins, one of them (Q7KQM4) did not have signal peptide so it is not included in the final output file uniprot.mat_pep.fasta.

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by vkkodali860

Alright! I have to process this information in order for me to fully understand what you've done ;) Can you send me the file with the mature protein sequences?

Thank you for your time!

ADD REPLYlink written 4 weeks ago by elabb@fau0

If you run the commands shown above as-is you should end up with uniprot.mat_pep.fasta file. Are you having trouble running them? Here's the file: https://drive.google.com/open?id=1coo2uipv-zTK1F98xi09zfh6-Ahmmykt

ADD REPLYlink written 4 weeks ago by vkkodali860
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1143 users visited in the last hour