Question: (Closed) editing bed file with python
0
gravatar for flogin
6 days ago by
flogin110
FioCruz/Brazil
flogin110 wrote:

I have a file like this:

seq1,4205,6421
seq1,4205,6421
seq1,6367,7962
seq1,6367,7962
seq1,8527,9390
seq2,1612,4917
seq2,1612,4917
seq2,1612,4917
seq3,5813,6610
seq3,6676,8307
seq3,6676,8307

I want to remove redundancy names, and organize the output with the lower and greater values of each sequence, like this:

seq1,4205,9390
seq2,1612,4917
seq3,5813,8307

I create a python script to try to do this, using dictionaries (to convert in dataframe structure and csv in the final).

# -*- coding: utf-8 -*-
#!/usr/bin/env python3
import argparse as ag
parser = ag.ArgumentParser("This program receives as input a bed file with redundancy in sequence names, and different positions of each domain in the same sequence, and return a bed file without redundancy, and considered the lower and greater region as start and end respectively")
parser.add_argument("--infile",type=ag.FileType('r', encoding='UTF-8'),required=True,help="Input File")
args = parser.parse_args()
input_file = args.infile
dicti = {}
list_dicti = []
aux = "" # a quick fix to compare ID of each line
aux_2 = "" # a quick fix to compare ID of each line
start_1 = "" # used to compare values of start and end
start_2 = ""
end_1 = ""
end_2 = ""
for line in input_file: 
    if aux_2 == "": # in the first time, the aux_2 will receive the ID name
        aux_2 = line.strip().split(",")[0]  
    else:
        aux_2 = line.strip().split(",")[0]
    aux = line.strip().split(",")[0] # aux also receive the name of the sequence
    if aux == aux_2: #
        if start_1 == "" and end_1 == "":
            start_1 = line.strip().split(",")[1] 
            end_1 = line.strip().split(",")[2] 
        if start_2 == "" and end_2 == "":
            start_2 = line.strip().split(",")[1] 
            end_2 = line.strip().split(",")[2] 
        else:
            if start_2 < start_1:
                start_1 = start_2
            if end_2 > end_1:
                end_1 = end_2
    start_2 = line.strip().split(",")[1]
    end_2 = line.strip().split(",")[2] 
    if [d[aux] for d in list_dicti]:
        pass
    else:   
        dicti[aux]=start_1,end_1
        list_dicti.append(dicti)
    dicti = {}
    start_1 = ""
    start_2 = ""
    end_1 = ""
    end_2 = ""

for i in list_dicti:
    print(i)

But, my output is:

Traceback (most recent call last):
  File "bel_domains.py", line 38, in <module>
    if [d[aux] for d in list_dictio]:
  File "bel_domains.py", line 38, in <listcomp>
    if [d[aux] for d in list_dictio]:
KeyError: 'seq1'

My logic was as follow: Read each line of the archive, if the sequence name is the same between the lines, the sequence ID, the lower start value and the greater end value should be inserted in a dictionary, and the dictionary should be inserted in a list. So, to don't insert several lines of the same sequence, I put the lines

if [d[aux] for d in list_dicti]:
            pass

But, at that moment that it error occur.

Can anyone explain to me?

Best,

redundancy bed python • 92 views
ADD COMMENTlink modified 6 days ago by Zoomboi40 • written 6 days ago by flogin110

Hello flogin!

We believe that this post does not fit the main topic of this site.

Zoomboi carefully help me with the problem

For this reason we have closed your question. This allows us to keep the site focused on the topics that the community can help with.

If you disagree please tell us why in a reply below, we'll be happy to talk about it.

Cheers!

ADD REPLYlink written 6 days ago by flogin110
3
gravatar for Zoomboi
6 days ago by
Zoomboi40
Zoomboi40 wrote:

I'm sorry in advance if this doesn't address your question or if this isn't helpful. Here's another way to remove the redundancies and create a new csv with each sequence and its minimum and maximum values.

import csv

with open('tempbed.csv') as csv_file:
    seq_dict = {}
    csv_reader = csv.reader(csv_file, delimiter=',')
    for row in csv_reader:
        seq, mini, maxi = row[0], row[1], row[2]
        if not seq in seq_dict:
            seq_dict[seq] = [mini, maxi]
        else:
            seq_dict[seq].extend([mini,maxi])

with open('tempbed_out.csv', mode='w') as csv_file:
    writer = csv.writer(csv_file, delimiter=',')
    for sequ in seq_dict:
        writer.writerow([sequ, min(seq_dict[sequ]), max(seq_dict[sequ])])
ADD COMMENTlink written 6 days ago by Zoomboi40
1

Thanks Zoomboi, this resolves the problem. I'm a beginner in programming and I'm trying to study solo, I'll study your code to improve my programming skills.

Thanks

ADD REPLYlink written 6 days ago by flogin110
1

I'm glad this helped flogin! I'll explain the code a little to make it easier.

Essentially, we're creating a dictionary with the sequence name (Seq1, Seq2, etc.) as the key, and a list of all of the associated values as the value. Example of what it would look like:

{Seq1:[4205, 6421, 4205, 7962], Seq2: [8527, 9390, 1612]}

Once the dictionary is finished being made, we go through each key, and get the name of the key, and the minimum and maximum values associated with that key.

The real heroes here are the library csv, and the built in functions min() and max()

ADD REPLYlink written 6 days ago by Zoomboi40
1

Nice, thanks again !

ADD REPLYlink written 6 days ago by flogin110
Please log in to add an answer.
The thread is closed. No new answers may be added.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1795 users visited in the last hour