Question: How Can I Convert A Bed File Into A Tab File With Paired End Reads On The Same Row?
1
gravatar for Luke
7.4 years ago by
Luke230
Turin, Italy
Luke230 wrote:

Hi guys!
I have a file like this one (obtained from bed file using awk):

scaffold00002b  209798  209823  HWUSI-EAS1825_0024_FC:8:1:6009:1105
scaffold00002b  209802  209838  HWUSI-EAS1825_0024_FC:8:1:6009:1105  
scaffold00002d  43627   43652   HWUSI-EAS1825_0024_FC:8:1:8703:1105
scaffold00008e  22741   22767   HWUSI-EAS1825_0024_FC:8:1:14128:1104
scaffold00008e  22740   22768   HWUSI-EAS1825_0024_FC:8:1:14128:1104

(note that the rows 1-2 and 4-5 have the same record in the 4th field).

I wish to convert it to a tab file like this one:

HWUSI-EAS1825_0024_FC:8:1:6009:1105  scaffold00002b  209798  209823  scaffold00002b  209802  209838
HWUSI-EAS1825_0024_FC:8:1:8703:1105  scaffold00002d  43627   43652
HWUSI-EAS1825_0024_FC:8:1:14128:1104 scaffold00008e  22741   22767   scaffold00008e  22740   22768

in which the fields belonging to lines with the same records in $4 column are printed in a single row.
Since the rows with the same 4th field are always consecutive, I tried to test if the 4th field of the previous row is == to the same field of the actual row and to iterate this process over all the rows of my input file.

BUT...
unfortunately I have no idea on how to print the records of the actual row alongside the records of the previous row (if the "==" condition is satisfied).

Any idea?

Thanks in advance, Luke

bed conversion file • 2.4k views
ADD COMMENTlink written 7.4 years ago by Luke230
1
gravatar for Sean Davis
7.4 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

I created a file, 'test.txt', that contains your data as shown above. Here is a quick python solution:

#!/usr/bin/env python
import csv

with open('test.txt','r') as f:
    reader = csv.reader(f,delimiter='\t')
    prevrow=None
    for row in reader:
        if(prevrow is None):
            # initialize the first time through
            prevrow=row
            continue
        if(row[3]!=prevrow[3]):
            # single reads
            print "%s\t%s" % (row[3],"\t".join(prevrow[:3]))
            prevrow=row
        if(row[3]==prevrow[3]):
            # print pairs
            print "%s\t%s" % (row[3],"\t".join(prevrow[:3]+row[:3]))
            prevrow=None

Output is:

HWUSI-EAS1825_0024_FC:8:1:6009:1105 scaffold00002b  209798  209823  scaffold00002b  209802  209838
HWUSI-EAS1825_0024_FC:8:1:14128:1104    scaffold00002d  43627   43652
HWUSI-EAS1825_0024_FC:8:1:14128:1104    scaffold00008e  22741   22767   scaffold00008e  22741   22767
ADD COMMENTlink written 7.4 years ago by Sean Davis25k

Dear Sean, thank you! I'm studing python, it's a very powerful language! I tried your script, but it works fine only for the first two record of my file. In the following rows it prints the first read, then it prints the second read twice in the following row. Is it a problem of iteration? Or variable substitution one?

ADD REPLYlink written 7.4 years ago by Luke230
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1227 users visited in the last hour