Question: From the concatenated fasta file, how to find individual range of locations in each protein sequence
0
gravatar for User 6777
4.5 years ago by
User 677720
United States
User 677720 wrote:

Hi all,

I have a protein fasta file (protein.txt) like:

>a
mnspq
>b
rstuvw
>c
mnqa

Note that the length of a, b and c proteins are 5,6 and 4 respectively (total length = 15)

now I have extracted some ranges (calculation is based on total length) and save it (file1.txt) as:

2-3
4-10
11-14

The length of each protein (within the total length) as seen in protein file is saved in another file (file2.txt) as:

a  1-5
b  6-11
c  12-15

Now from file1 values, I want to modify the file2 values and try to calculate individual range for each protein sequence, For the above input, the output will be:

a   2-3,4-5
b   1-5, 6
c   2-5

In other words, if I first concatenate my all sequences and derermine some ranges from the concatenated file, how can I find individual range of locations in each protein sequence

Thanks for your consideration.

python perl fasta • 1.2k views
ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by User 677720

Well, just write a script.

ADD REPLYlink written 4.5 years ago by shenwei3565.8k

As long as you have unique headers in your multi-fasta file samtools faidx region should do the extraction part. See this: Extract User Defined Region From An Fasta File @Matt Shirley also has a python based pyfaidx solution.

I am not exactly certain what you are trying to do in the subsequent steps.

Edit: Re-reading your original post I am not sure this is what you need. But I will leave this here for now to see if it helps.

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by GenoMax96k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1132 users visited in the last hour
_