How Can I Convert A Bed File Into A Tab File With Paired End Reads On The Same Row?
1
1
Entering edit mode
12.3 years ago
Luke ▴ 240

Hi guys!

I have a file like this one (obtained from bed file using awk):

scaffold00002b  209798  209823  HWUSI-EAS1825_0024_FC:8:1:6009:1105
scaffold00002b  209802  209838  HWUSI-EAS1825_0024_FC:8:1:6009:1105  
scaffold00002d  43627   43652   HWUSI-EAS1825_0024_FC:8:1:8703:1105
scaffold00008e  22741   22767   HWUSI-EAS1825_0024_FC:8:1:14128:1104
scaffold00008e  22740   22768   HWUSI-EAS1825_0024_FC:8:1:14128:1104

(note that the rows 1-2 and 4-5 have the same record in the 4th field).

I wish to convert it to a tab file like this one:

HWUSI-EAS1825_0024_FC:8:1:6009:1105  scaffold00002b  209798  209823  scaffold00002b  209802  209838
HWUSI-EAS1825_0024_FC:8:1:8703:1105  scaffold00002d  43627   43652
HWUSI-EAS1825_0024_FC:8:1:14128:1104 scaffold00008e  22741   22767   scaffold00008e  22740   22768

in which the fields belonging to lines with the same records in $4 column are printed in a single row.

Since the rows with the same 4th field are always consecutive, I tried to test if the 4th field of the previous row is == to the same field of the actual row and to iterate this process over all the rows of my input file.

BUT...

unfortunately I have no idea on how to print the records of the actual row alongside the records of the previous row (if the "==" condition is satisfied).

Any idea?

Thanks in advance,
Luke

bed • 4.3k views
ADD COMMENT
1
Entering edit mode
12.2 years ago

I created a file, 'test.txt', that contains your data as shown above. Here is a quick python solution:

#!/usr/bin/env python
import csv

with open('test.txt','r') as f:
    reader = csv.reader(f,delimiter='\t')
    prevrow=None
    for row in reader:
        if(prevrow is None):
            # initialize the first time through
            prevrow=row
            continue
        if(row[3]!=prevrow[3]):
            # single reads
            print "%s\t%s" % (row[3],"\t".join(prevrow[:3]))
            prevrow=row
        if(row[3]==prevrow[3]):
            # print pairs
            print "%s\t%s" % (row[3],"\t".join(prevrow[:3]+row[:3]))
            prevrow=None

Output is:

HWUSI-EAS1825_0024_FC:8:1:6009:1105 scaffold00002b  209798  209823  scaffold00002b  209802  209838
HWUSI-EAS1825_0024_FC:8:1:14128:1104    scaffold00002d  43627   43652
HWUSI-EAS1825_0024_FC:8:1:14128:1104    scaffold00008e  22741   22767   scaffold00008e  22741   22767
ADD COMMENT
0
Entering edit mode

Dear Sean, thank you! I'm studing python, it's a very powerful language! I tried your script, but it works fine only for the first two record of my file. In the following rows it prints the first read, then it prints the second read twice in the following row. Is it a problem of iteration? Or variable substitution one?

ADD REPLY

Login before adding your answer.

Traffic: 2580 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6