Question: Extract Sequence From Fasta and remove some characters before printing
0
gravatar for Light
18 months ago by
Light20
Light20 wrote:

Hello,

I have a list of UniProt Id's in the first column and signal peptide length(eg 25 below) in the second column in a file (say seqid_len.txt). eg of a single entry

tr_GHAT8X                25
tr_GHAMNO                26

I want to extract fasta seq for each UniProt id which is a subset of seqid in a large fasta file(say fasta_db.fasta), print those seq in a new file after removing the length(from 1 to 25, in eq) of the signal peptides. How could I do this?

>tr|GHAT8X|GHAT8X_9GHA Uncharacterized protein 
MJHJKAHKJHSKBABXNXNELRVYAISQLNELIADFGGNSARDYLESTISNEGAHPSIRNSAVFSY
GKTFYFSDRNHAENFLKRFSSJKDKAJHKDHKJAJGDAJSIRNP
>tr|GHAmno|GHAMNO_9GHA Uncharacterized protein 
MJHAHLFLAJFLJALFNNCLAN;CNALNCLANCLNALNCLANLKLJKHJHKBBHBHBHBHBHBHBHBH
CLNALCNLANCKLNALKNCKDHKJAJGDAJSIRNP

Output file sample(1to25aa removed)

>tr|GHAT8X|GHAT8X_9GHA Uncharacterized protein 
ISQLNELIADFGGNSARDYLESTISNEGAHPSIRNSAVFSY
GKTFYFSDRNHAENFLKRFSSJKDKAJHKDHKJAJGDAJSIRNP

Please suggest me if there is an alternate way for the solution in python/perl.

Thanks

biopython python fasta • 1.5k views
ADD COMMENTlink modified 18 months ago by st.ph.n2.4k • written 18 months ago by Light20
1

using seqkit and bash: input fasta file:

$ cat test.fa

>tr|GHAT8X|GHAT8X_9GHA Uncharacterized protein 
MJHJKAHKJHSKBABXNXNELRVYAISQLNELIADFGGNSARDYLESTISNEGAHPSIRNSAVFSY GKTFYFSDRNHAENFLKRFSSJKDKAJHKDHKJAJGDAJSIRNP
>tr|GHAmno|GHAMNO_9GHA Uncharacterized protein 
MJHAHLFLAJFLJALFNNCLAN;CNALNCLANCLNALNCLANLKLJKHJHKBBHBHBHBHBHBHBHBH CLNALCNLANCKLNALKNCKDHKJAJGDAJSIRNP

input ids:

$ cat ids.txt 
tr_GHAT8X 25
tr_GHAMNO 26

code:

$ sed 's/_/|/g' ids.txt | while read line;do grep -i -A1 ${line%%[1-9]*} test.fa | seqkit subseq -r ${line##[a-z]* }:-1 ; done

output:

>tr|GHAT8X|GHAT8X_9GHA Uncharacterized protein 
AISQLNELIADFGGNSARDYLESTISNEGAHPSIRNSAVFSY GKTFYFSDRNHAENFLKRFSSJKDKAJHKDHKJAJGDAJSIRNP
>tr|GHAmno|GHAMNO_9GHA Uncharacterized protein 
ALNCLANCLNALNCLANLKLJKHJHKBBHBHBHBHBHBHBHBH CLNALCNLANCKLNALKNCKDHKJAJGDAJSIRNP

Note: Clean up your fasta. It has special characters (;) and spaces in between AA.

Assumptions:

  1. Fasta sequences are linearized
  2. IDs and signal peptide lenght in ids.txt are separated by space.
ADD REPLYlink modified 18 months ago • written 18 months ago by cpad011211k

Thank You @cpad0112 for ur help. I was hoping to use python for the purpose.

ADD REPLYlink written 18 months ago by Light20

oh okay. Moving the post as comment.

ADD REPLYlink written 18 months ago by cpad011211k

This wouldn't be too hard in (bio)python, do you have any experience in programming (in python)?

ADD REPLYlink written 18 months ago by WouterDeCoster39k

I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

Please check if the displayed format is still accurate.

ADD REPLYlink written 18 months ago by WouterDeCoster39k

Thank You. Everything seems in the correct format. I recently started to learn python basics commands.

ADD REPLYlink written 18 months ago by Light20
2
gravatar for st.ph.n
18 months ago by
st.ph.n2.4k
Philadelphia, PA
st.ph.n2.4k wrote:

LInearize fasta, and assumes tab-delimited seqid_len.txt.

#!/usr/bin/env python

# Empty dictionary; {key:value}; {seqid:position, seqid2:pos, ..} Calling lendict[seqid] give the pos in integer
lendict = {}

# open file with seid '\t' pos    
with open('seqid_len.txt', 'r') as lens:
    # for each line in the file
    for line in lens:
    # populate dictionary, with {seqid:pos}
        lendict[line.strip().split('\t')[0]] = int(line.strip().split('\t')[1]


# open intput fasta file for reading
with open('species.fasta', 'r') as f:
# open output file for writing
    with open('fasta_pruned.fasta', 'w') as out:
    # for line in input fasta
        for line in f:
        # if line starts with '>'; i.e is a header
            if line.startswith(">"):
            # print the header to the output file; and on the next line of the output file write the sequence, but use the position from the dictionary, based on key value; to print the sequence from that position to the end of the sequence
                out.write(line.strip, '\n', next(f)[lendict[('_'.join(line.strip().split('|')[:2])]:])
ADD COMMENTlink modified 18 months ago • written 18 months ago by st.ph.n2.4k

Thanks for the help @st.ph.n.

ADD REPLYlink written 18 months ago by Light20
0
gravatar for Asaf
18 months ago by
Asaf5.6k
Israel
Asaf5.6k wrote:

I think the best way would be to generate a bed file from the coordinates file you have, defining the region of the signal peptide, this could easily be done with awk for instance and then use bedtools maskfasta to remove these regions.

ADD COMMENTlink written 18 months ago by Asaf5.6k

Thanks @Asaf. But I wanted to perform using python.

ADD REPLYlink written 18 months ago by Light20
1

Then it will be much easier. Read the coordinates file to a dictionary and then read the sequences using SeqIO.parse and using the names of the sequences print the subsequences. Look at the biopython cookbook for more information

ADD REPLYlink written 18 months ago by Asaf5.6k

sounds easy but for new on python programming like me it looks a high wall :D

ADD REPLYlink written 18 months ago by Light20
1

I could just write the script but maybe if we do this step by step, then you'll learn more in the long run.

Step1:

We first need to read the file with the signal peptide lengths. For every line in the file we make an entry in a dictionary coupling the name to the length.

with open("seqid_len.txt") as seqlengths:
    # read the file line by line
    # construct the dictionary

We can go to step 2 if you have written the code for this - or let us know if you get stuck.

ADD REPLYlink written 18 months ago by WouterDeCoster39k
0
gravatar for Light
18 months ago by
Light20
Light20 wrote:
for line in seqlengths:
     split_IDlength = line.split(' ')
     ID = split_IDlength[0]
     Length = split_IDlength[1]
dict = {"ID": , "Length":}

Thank You for the help. I am trying.

ADD COMMENTlink modified 18 months ago • written 18 months ago by Light20

Print out the dictionary, it's not what you want to get. The start of your code was good, but you need to look at how to make a dictionary again.

I would suggest something like:

lengthdict = dict() 
for line in seqlengths:
     split_IDlength  = line.strip().split(' ')
     lengthdict[split_IDlength[0]] = split_IDlength[1]

Typing code on my phone is a bit hard, hope I didn't make ridiculous mistakes. Will follow up tomorrow morning (CET).

You should also have a look at the biopython tutorial and cookbook (which is great!) on how to parse and slice a fasta file.

ADD REPLYlink written 18 months ago by WouterDeCoster39k

Ok, will look at it.

ADD REPLYlink written 18 months ago by Light20

Light : Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

ADD REPLYlink written 18 months ago by genomax67k

Sorry about that. I will keep it in mind.

ADD REPLYlink written 18 months ago by Light20

Let us know when you need further help.

ADD REPLYlink written 18 months ago by WouterDeCoster39k
2
from Bio import SeqIO
from Bio import Seq

f1 = open('fasta_pruned.fasta','w')

lengthdict = dict() 
with open("seqid_len.txt") as seqlengths:
    for line in seqlengths:
        split_IDlength  = line.strip().split(' ')
        lengthdict[split_IDlength[0]] =int( split_IDlength[-1])


with open("species.fasta","rU") as spe:
    for record in SeqIO.parse(spe,"fasta"):
        if record[0] == '>' :
            split_header = line.split('|')
            accession_ID = split_header[1]
            if accession_ID in lengthdict:
                f1.writestrseq_record.id) + "\n")
                f1.write(str(seq_record.seq[split_IDlength[1]-1:]))



f1.close()

For 2nd step, i want to read the fasta file and split the seq id line extract the unique id for comparison. But stuck how to use it dictionary and remove chars. doubt my syntax also :(

ADD REPLYlink modified 18 months ago • written 18 months ago by Light20

You are getting pretty close, but I would suggest the following:

for record in SeqIO.parse("species.fasta", "fasta"):
    accession_ID = record.id.split('|')[1]
    trim_length = int(lengthdict[accession_ID])
    record.seq = record.seq[trim_length:]
    print(record.format("fasta")
ADD REPLYlink written 18 months ago by WouterDeCoster39k
trim_length = int(lengthdict[accession_ID]

Trim_length will hold accession id?? not understanding it.

ADD REPLYlink written 18 months ago by Light20

You do not know what a dictionary is eh...

Why don't you make the dictionary and play with it a bit to see how it works?

ADD REPLYlink written 18 months ago by WouterDeCoster39k

working on it now..Thanks

ADD REPLYlink written 18 months ago by Light20

I added comments to my answer above. hope it helps.

ADD REPLYlink written 18 months ago by st.ph.n2.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1874 users visited in the last hour