Question

Finding unique regions in a sequence

0

Entering edit mode

20 months ago

ramiro.barrantes • 0

I have a human sequence and would like to find unique regions within it that are at least 11 amino acids long. Any leads on how to address this?

My current thinking is to use blast to the human protein database and find region(s) that have no similarity to any other sequence in the database, but I am wondering if there are other, more principled ways.

Ultimately, the goal is to find a unique region so one can design an antibody that would be specific to the protein.

seq peptide sequence blast • 724 views

ADD COMMENT • link 20 months ago by ramiro.barrantes • 0

score 1 · Answer 1 · 2022-09-10

1

Entering edit mode

20 months ago

Istvan Albert 100k

Don't use alignments here. Those are not optimized for finding exact matches at the rate you need.

The simplest method, in my opinion is to break the sequences into 11bp pieces, one piece per line, the sort and uniq rank them like so:

pieces.txt | sort | uniq -c | sort | head

ADD COMMENT • link 20 months ago by Istvan Albert 100k

score 1 · Answer 2 · 2022-09-10

As an example:

import sys
from Bio import SeqIO

# KMER size
SIZE = 11

recs = SeqIO.parse(sys.stdin, format="fasta")
for rec in recs:
    for i in range(0, len(rec.seq)-SIZE):
        seq = rec.seq[i:i+SIZE]
        print (seq)

then you can:

# Get some protein data
wget -nc http://ftp.ensembl.org/pub/current_fasta/accipiter_nisus/pep/Accipiter_nisus.Accipiter_nisus_ver1.0.pep.all.fa.gz

# Chop it into pieces
gunzip -c Accipiter_nisus.Accipiter_nisus_ver1.0.pep.all.fa.gz  | python chop.py > pieces.txt

# How many times did each piece occur
cat pieces.txt | sort | uniq -c | sort -rn > count.txt

looking at the end of the file:

tail count.txt

it prints:

  1 AAAAAAAAAGH
  1 AAAAAAAAAGD
  1 AAAAAAAAAFH
  1 AAAAAAAAAEK
  1 AAAAAAAAADL
  1 AAAAAAAAACR
  1 AAAAAAAAAAX
  1 AAAAAAAAAAN
  1 AAAAAAAAAAM
  1 AAAAAAAAAAD

the bottleneck is sorting, depending on the organism's size.

instead of sorting it as one file, split the counts.txt file into multiple files, sort each file in parallel, the merge sort the result