Question: Extract start coordinate from read 1 and end coordinate from read 2 of paired end reads.
0
gravatar for rjobmc
7 months ago by
rjobmc0
rjobmc0 wrote:

Hi there. Im looking to extract chr, start coordinate of read 1 and the end coordinate of read 2 of paired-end NGS into one "bed" file. I have over 7 million reads and I am not sure that every paired-end read has a "pair". I have used bamtobed function of bedtools and sorted the file by read info (eg M01269....). Here is an example of what I need, For the following:

chrII   404128  404259  M01269:176:000000000-BW364:1:1101:10000:12221/1 60  -
chrII   404126  404251  M01269:176:000000000-BW364:1:1101:10000:12221/2 60  +
chrVII  350990  351120  M01269:176:000000000-BW364:1:1101:10000:24715/1 60  -
chrVII  350971  351093  M01269:176:000000000-BW364:1:1101:10000:24715/2 60  +
chrXII  527617  527747  M01269:176:000000000-BW364:1:1101:10000:26164/1 60  +
chrXII  527627  527753  M01269:176:000000000-BW364:1:1101:10000:26164/2 60  -
chrVII  826318  826449  M01269:176:000000000-BW364:1:1101:10000:8567/1  60  +
chrVII  826335  826461  M01269:176:000000000-BW364:1:1101:10000:8567/2  60  -
chrXII  880431  880562  M01269:176:000000000-BW364:1:1101:10001:14255/1 60  +
chrXII  880448  880574  M01269:176:000000000-BW364:1:1101:10001:14255/2 60  -

I need:

chrII    404128 404251
chrVII  350990  351093
chrXII  527617  527753
chrVII  826318 . 826461
chrXII  880431  880574

Any help would be very appreciated. Thanks

unix bedtools • 489 views
ADD COMMENTlink modified 7 months ago by finswimmer11k • written 7 months ago by rjobmc0
1

Hello rjobmc,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLYlink written 7 months ago by finswimmer11k

Hello Could you please explain what 60 and - or + stand for ?

ADD REPLYlink written 5 months ago by rania.hamdy10

probably length of the read and strand (+ and -) rania.hamdy1

ADD REPLYlink written 5 months ago by cpad011211k
2
gravatar for cpad0112
7 months ago by
cpad011211k
India
cpad011211k wrote:
$ sed 's/\/.*//g' test.txt | datamash -sg 1,4 min 2 max 3 | cut -f1,3,4
chrI    12085   12225
chrI    329 456


$ cat test.txt 
chrI    12085   12215   M01269:176:000000000-BW364:1:2114:9178:18263/1  60  +
chrI    12099   12225   M01269:176:000000000-BW364:1:2114:9178:18263/2  60  -
chrI    329 456 M01269:176:000000000-BW364:1:2114:9317:17550/1  60  +
chrI    330 456 M01269:176:000000000-BW364:1:2114:9317:17550/2  60  -
ADD COMMENTlink modified 7 months ago • written 7 months ago by cpad011211k

$ awk -v OFS="\t" '$4 ~ "/1" {print $1,$2,$3}' test.txt Unfortunately this didnt work, it just pulls out the end coordinate of read 1, when I need the read 2 end coordinate

ADD REPLYlink written 7 months ago by rjobmc0
$ sed 's/\/.*//g' test.txt | datamash -sg 1,4 min 2 max 3 | cut -f1,3,4

Worked perfectly! Thanks so much.

ADD REPLYlink modified 7 months ago • written 7 months ago by rjobmc0
1

btw, cross check if this holds true if your r1 is on - instead of +. rjobmc

ADD REPLYlink written 7 months ago by cpad011211k
1

assumption is that R1 and R2 are consecutive lines in input and if you want to sort by chromosome, add sort -k1,1 at the end.:

$ sed 's/\/.*//g' test.txt | datamash -g1,4 first 2 last 3 | cut -f1,3,4
chrII   404128  404251
chrVII  350990  351093
chrXII  527617  527753
chrVII  826318  826461
chrXII  880431  880574

$ cat test.txt 
chrII   404128  404259  M01269:176:000000000-BW364:1:1101:10000:12221/1 60  -
chrII   404126  404251  M01269:176:000000000-BW364:1:1101:10000:12221/2 60  +
chrVII  350990  351120  M01269:176:000000000-BW364:1:1101:10000:24715/1 60  -
chrVII  350971  351093  M01269:176:000000000-BW364:1:1101:10000:24715/2 60  +
chrXII  527617  527747  M01269:176:000000000-BW364:1:1101:10000:26164/1 60  +
chrXII  527627  527753  M01269:176:000000000-BW364:1:1101:10000:26164/2 60  -
chrVII  826318  826449  M01269:176:000000000-BW364:1:1101:10000:8567/1  60  +
chrVII  826335  826461  M01269:176:000000000-BW364:1:1101:10000:8567/2  60  -
chrXII  880431  880562  M01269:176:000000000-BW364:1:1101:10001:14255/1 60  +
chrXII  880448  880574  M01269:176:000000000-BW364:1:1101:10001:14255/2 60  -

BTW, what would happen to multi mapping reads?

ADD REPLYlink modified 7 months ago • written 7 months ago by cpad011211k

I checked

btw, cross check if this holds true if your r1 is on - instead of +

And it worked perfectly.

BTW, what would happen to multi mapping reads?

This was just an example but your command has worked perfectly for all reads (multi and unique). Thanks again :)

ADD REPLYlink modified 7 months ago • written 7 months ago by rjobmc0

Hello again,

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.

Upvote|Bookmark|Accept

ADD REPLYlink written 7 months ago by finswimmer11k
1
gravatar for jomo018
7 months ago by
jomo018470
jomo018470 wrote:

Python script will do it, just change ./results and ./inputFile as required. Single reads will output start and end of that single read. Blank lines (as in your input example) are not handled.

with open("./results",'w') as fho:
    with open("./inputFile",'r') as fhi:
        chr,startA,endA,idA,score,strand=fhi.readline().strip().split()
        for line in fhi:
            chr,startB,endB,idB,score,strand=line.strip().split()
            if idA[:-2]==idB[:-2]:
                endA=endB
                continue
            fho.write(chr+"\t"+startA+"\t"+endA+"\n")
            idA=idB
            startA=startB
            endA=endB   
        fho.write(chr+"\t"+startA+"\t"+endA+"\n")
ADD COMMENTlink written 7 months ago by jomo018470
0
gravatar for btsui
7 months ago by
btsui290
UCSD
btsui290 wrote:
cut -d' ' -f1-3 inputfile > outputfile
ADD COMMENTlink modified 7 months ago by finswimmer11k • written 7 months ago by btsui290
cut -d' ' -f1-3 inputfile > outputfile

This did not work

ADD REPLYlink modified 7 months ago by finswimmer11k • written 7 months ago by rjobmc0
0
gravatar for finswimmer
7 months ago by
finswimmer11k
Germany
finswimmer11k wrote:

Try this:

$ sort -k4,4 inputfile|paste - -|cut -f1,2,9|sort -k1,1V -k2,2g -k3,3g > outputfile
  • sort -k4,4 inputfile sort by read name. This ensures that read1 is directly followed by read 2
  • paste - - puts two lines in one. Doing so the data for a read pair is in one line
  • cut -f1,2,9 selects the columns 1, 2 and 9 which contains chr, start read1 and end read2
  • sort -k1,1V -k2,2g -k3,3g sort by chrom, start, end

fin swimmer

ADD COMMENTlink modified 7 months ago • written 7 months ago by finswimmer11k
$ sort -k4,4 inputfile|paste - -|cut -f1,2,9 > outputfile

Thanks fin swimmer, but it has not worked. The reads are sorted by reads and not chr, start, end and so they are not in numerical/chromosomal order.

Is there any way to check that the read pair "code" is the same, ie its the /2 of the /1 before extracting the start and end coordinates?

ADD REPLYlink modified 7 months ago • written 7 months ago by rjobmc0

The reads are sorted by reads and not chr, start, end and so they are not in numerical/chromosomal order.

Just add one more sorting step. I edited my answer and also add some comments.

ADD REPLYlink written 7 months ago by finswimmer11k
0
gravatar for Pierre Lindenbaum
7 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

using bioalcidaejdk http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html and a BAM sorted on queryname.

ADD COMMENTlink written 7 months ago by Pierre Lindenbaum118k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1341 users visited in the last hour