How to retrive sequnce from fasta file by using start and end point data from xl?
1
0
Entering edit mode
3.8 years ago
akashbala0 ▴ 10

Hi! I have an excel file with thousands of chromosome names with transcription start endpoint. Someone, please develop a python program from where I can retrieve the sequence from genome file according to the start and endpoint mentioned in excel.

excel file looks like

    chromosome  start   end
    KB317696.1  1361    1376
    KB317696.1  1594    1929
    KB317697.1  2033    2101
    KB317697.1  2159    2265
    KB317698.1  2319    2421
    KB317699.1  2513    2736
    KB317700.1  2789    2903
    KB317700.1  3157    3279
python biopython • 1.6k views
ADD COMMENT
1
Entering edit mode

please develop a python program

That is not what the forum is here for. There is an expectation that you demonstrate some effort toward solving the problem yourself first. Moreover, this is not an uncommon task, so please search the forum, there will undoubtedly be existing solutions you can try.

ADD REPLY
0
Entering edit mode

this can be done in the following steps: - import the sequence using Biopython (SeqIO.read()) - import the excel file as table using pandas, subset the table to only keep the start and stop positions - go through the columns of this table in a for loop, splice the sequence using start and stop column entries (example : seq_output = sequence[start_position:end_position]

P.S - python's index starts from 0 so your start_position should really be start_position+1

ADD REPLY
0
Entering edit mode
3.8 years ago

Just to close this post, you can check getFatsa or Biopython documentation if you are into python. No need to reinvent the wheel. A simple google search would have given you a quicker solution. And BED file is standard file format for storing such data.

ADD COMMENT

Login before adding your answer.

Traffic: 1640 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6