Question: Which tool can remove SignalP predicted signal peptides from FASTA file?
0
gravatar for elabb@fau
4 months ago by
elabb@fau0
elabb@fau0 wrote:

L.S.,

I have a list of proteins from either the UniProtKB or PlasmoDB databases that have a SignalP annotation. These proteins are thus predicted to have a signal peptide, of varying length, for secretion. I can manually remove the sequence corresponding to the predicted signal peptide, but takes a lot of time :(

I was wondering if it's possible to these kinds of operations automatically, perhaps using some kind of online tool. Or do I need to program a script of some sort to perform the operation?

Kind regards, Arman

sequence • 244 views
ADD COMMENTlink modified 4 months ago • written 4 months ago by elabb@fau0
1

If you can get the ranges for each protein (without the signal peptide) in the form of a BED file then you can use bedtools getfasta (https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html) to do this.

For the initial table of ranges, you can download the UniProt data in GFF format and parse that table. Can you provide some examples?

ADD REPLYlink written 4 months ago by vkkodali1.1k

if you're able to put together a script that will be most convenient I assume.

ADD REPLYlink written 4 months ago by lieven.sterck4.5k

Alright! I have to process this information in order for me to fully understand what you've done ;) Can you send me the file with the mature protein sequences?

Thank you for your time!

ADD REPLYlink written 4 months ago by elabb@fau0
4
gravatar for vkkodali
4 months ago by
vkkodali1.1k
United States
vkkodali1.1k wrote:

I followed your Uniprot link and clicked on the 'Download' button to download two files:

  1. Download all 359 proteins in GFF format (uniprot.gff file)
  2. Download all 359 proteins in FASTA format (uniprot.fasta file)

Then, I processed the two files as follows:

  1. Process uniprot.gff file to create a BED-like file that has three columns: uniprot accession, protein start position, protein end position.
  2. Process uniprot.fasta file to convert the headers to just have only the uniprot accession
  3. Use bedtools getfasta to fetch the mature protein sequences

You can use the following code:

## step 1 - processing uniprot GFF file
cat uniprot.gff \
    | grep -E '^##sequence-region|Signal peptide' \
    | perl -pe 's/##sequence-region ([^ ]*) (\d+) (\d+)/\1\t\2\t\3/g' \
    | awk 'BEGIN{FS="\t";OFS="\t"}{if (NF==3) {p=$1; e=$3} else {s=$5+1; print p,s,e}}' \
    > uniprot.bed

## step 2 - processing the uniprot.fasta file. Note, this overwrites the existing file
sed -ri 's/>[a-z]*\|([^\|]*).*$/>\1/g' uniprot.fasta

## step 3 - generate new fasta file with just the mature peptide sequences
bedtools getfasta -fi uniprot.fasta -bed uniprot.bed | fold -w 60 > uniprot.mat_pep.fasta

Out of the 359 proteins, one of them (Q7KQM4) did not have signal peptide so it is not included in the final output file uniprot.mat_pep.fasta.

ADD COMMENTlink modified 4 months ago • written 4 months ago by vkkodali1.1k

Alright! I have to process this information in order for me to fully understand what you've done ;) Can you send me the file with the mature protein sequences?

Thank you for your time!

ADD REPLYlink written 4 months ago by elabb@fau0

If you run the commands shown above as-is you should end up with uniprot.mat_pep.fasta file. Are you having trouble running them? Here's the file: https://drive.google.com/open?id=1coo2uipv-zTK1F98xi09zfh6-Ahmmykt

ADD REPLYlink written 4 months ago by vkkodali1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1115 users visited in the last hour