Remove duplicate lines based on specific columns
0
0
Entering edit mode
3.7 years ago
dzisis1986 ▴ 40

I used bedtools intersect like that

bedtools intersect -a test.bed -b fragments.bed -wa -wb -loj > test.txt


The fragments file is a file with 3 columns: chr,start,end position the test.bed is a file wtih 4 columns with counts : chr,start,end,count

I used intersect of bedtools and the result is something like that

chr19   61030687    61040876    0   -1  -1  0
chr19   61040883    61041418    0   -1  -1  0
chr19   61041425    61041896    0   -1  -1  0
chr19   61041903    61042676    0   -1  -1  0
chr19   61042683    61044693    0   -1  -1  0
chr19   61044700    61045007    0   -1  -1  0
chr19   61045014    61048846    0   -1  -1  0
chr19   61048853    61051147    0   -1  -1  0
chr19   61051154    61051161    0   -1  -1  0
chr19   61051168    61055066    0   -1  -1  0
chr19   61055073    61059079    chr19   61057150    61059534    3
chr19   61059086    61065281    chr19   61057150    61059534    3
chr19   61065288    61065491    0   -1  -1  0
chr19   61065498    61069950    0   -1  -1  0
chr19   61069957    61070313    0   -1  -1  0
chr19   61070320    61071203    0   -1  -1  0
chr19   61071210    61074042    0   -1  -1  0
chr19   61074049    61076962    0   -1  -1  0
chr19   61076969    61078370    0   -1  -1  0
chr19   61078377    61084129    chr19   61079739    61080558    10
chr19   61084136    61085306    0   -1  -1  0
chr19   61110306    61112208    0   -1  -1  0
chr19   61130752    61131999    0   -1  -1  0
chr19   61132006    61139461    0   -1  -1  0
chr19   61139468    61142499    0   -1  -1  0
chr19   61142506    61144492    0   -1  -1  0
chr19   61144499    61144577    0   -1  -1  0
chr19   61144584    61147571    chr19   61146043    61148013    8
chr19   61147578    61147680    chr19   61146043    61148013    8
chr19   61147687    61148346    chr19   61146043    61148013    8
chr19   61148353    61149397    0   -1  -1  0
chr19   61149404    61149653    0   -1  -1  0
chr19   61149660    61150034    0   -1  -1  0


This is not correct because there are dublicate counts that are not in the original file. i would like to filter it in order to have something like that :

    chr19   61030687    61040876    0   -1  -1  0
chr19   61040883    61041418    0   -1  -1  0
chr19   61041425    61041896    0   -1  -1  0
chr19   61041903    61042676    0   -1  -1  0
chr19   61042683    61044693    0   -1  -1  0
chr19   61044700    61045007    0   -1  -1  0
chr19   61045014    61048846    0   -1  -1  0
chr19   61048853    61051147    0   -1  -1  0
chr19   61051154    61051161    0   -1  -1  0
chr19   61051168    61055066    0   -1  -1  0
chr19   61055073    61059079    chr19   61057150    61059534    3
chr19   61059086    61065281    0   -1  -1  0
chr19   61065288    61065491    0   -1  -1  0
chr19   61065498    61069950    0   -1  -1  0
chr19   61069957    61070313    0   -1  -1  0
chr19   61070320    61071203    0   -1  -1  0
chr19   61071210    61074042    0   -1  -1  0
chr19   61074049    61076962    0   -1  -1  0
chr19   61076969    61078370    0   -1  -1  0
chr19   61078377    61084129    chr19   61079739    61080558    10
chr19   61084136    61085306    0   -1  -1  0
chr19   61110306    61112208    0   -1  -1  0
chr19   61130752    61131999    0   -1  -1  0
chr19   61132006    61139461    0   -1  -1  0
chr19   61139468    61142499    0   -1  -1  0
chr19   61142506    61144492    0   -1  -1  0
chr19   61144499    61144577    0   -1  -1  0
chr19   61144584    61147571    chr19   61146043    61148013    8
chr19   61147578    61147680    0   -1  -1  0
chr19   61147687    61148346    0   -1  -1  0
chr19   61148353    61149397    0   -1  -1  0
chr19   61149404    61149653    0   -1  -1  0
chr19   61149660    61150034    0   -1  -1  0


Any help to do it in R or python ? Thanks

r python intersect reads • 4.0k views
1
Entering edit mode

Do you have any experience in R or Python to try something by yourself ?

0
Entering edit mode

YEs i do. i tried some manipulations but i cant see how to keep unfiltered the first 3 columns and filter only the 4-6 but also keep the rest with 0 and -1 as it is !

1
Entering edit mode

Edit your post and show us what you did please, even if it does not work

1
Entering edit mode

What do you mean by :

This is not correct because there are dublicate counts that are not in the original file.

chr19   61055073    61059079    chr19   61057150    61059534    3
chr19   61059086    61065281    chr19   61057150    61059534    3


This is correct : The 2 fragments below overlap with the test

61057150 < 61059079 < 61059534 and 61057150 < 61059086 < 61059534

chr19   61055073    61059079
chr19   61059086    61065281


are in fragments.bed

chr19   61057150    61059534    3


is in test.bed

For this specific result what do you want as output ?

1
Entering edit mode

I suggest you specify the ultimate goal and optimize the bedtools command. As Bastien has pointed out, there's no error in the bedtools output and I have the feeling that using a different set of bedtools options may get you what you want. In order to help with that, we would need to know what exactly it is you're looking for and how you would decide what the "wrong" output line was.

For example this one -- which line would you keep and why?

chr19   61055073    61059079    chr19   61057150    61059534    3
chr19   61059086    61065281    chr19   61057150    61059534    3

0
Entering edit mode

Hello dzisis1986 ,

please show us the exact command you used and how your input files looks like.

Thanks.

fin swimmer

0
Entering edit mode
bedtools intersect -a test.bed -b fragments.bed -wa -wb -loj > test.txt

0
Entering edit mode

Could you also paste the first lines of test.bed and fragments.bed please

0
Entering edit mode

test.bed can be a file like that:

chr19   54453724    54456882    1
chr19   54938190    54939661    2
chr19   56118707    56122392    56
chr19   61057150    61059534    3
chr19   61079739    61080558    10
chr19   61146043    61148013    8


and fragments.bed is a file like that :

chr19   61055073    61059079
chr19   61059086    61065281
chr19   61065288    61065491
chr19   61065498    61069950
chr19   61069957    61070313
chr19   61070320    61071203
chr19   61071210    61074042
chr19   61074049    61076962
chr19   61076969    61078370
chr19   61078377    61084129
chr19   61084136    61085306
chr19   61085313    61087770
chr19   61087777    61088015
chr19   61088022    61089802
chr19   61089809    61094605
chr19   61094612    61100478
chr19   61100485    61100872

0
Entering edit mode

Probably because a single record from test.bed overlaps with two ranges in fragments.bed. For eg. record from test.bed (a single line: chr19 61057150 61059534 3) overlaps with two ranges in fragments.bed (two lines: chr19 61057150 61059534 3 and chr19 61059086 61065281. Hence you would see duplicate lines like this (which are not, in fact):

chr19   61057150    61059534    3   chr19   61055073    61059079
chr19   61057150    61059534    3   chr19   61059086    61065281


or I got your issue wrong.

0
Entering edit mode

Yes this is the reason but i dont want to have this duplicate that why i want to remove it manually after intersect,

0
Entering edit mode

what is the expected output? dzisis1986. Like this:

 $bedtools intersect -a a.bed -b b.bed -wa -wb -loj | datamash -sg1,2,3 unique 4 chr19 54453724 54456882 1 chr19 54938190 54939661 2 chr19 56118707 56122392 56 chr19 61057150 61059534 3 chr19 61079739 61080558 10 chr19 61146043 61148013 8  instead of: $ bedtools intersect -a a.bed -b b.bed -wa -wb -loj | cut -f1-4
chr19   54453724    54456882    1
chr19   54938190    54939661    2
chr19   56118707    56122392    56
chr19   61057150    61059534    3
chr19   61057150    61059534    3
chr19   61079739    61080558    10
chr19   61146043    61148013    8


If you don't want to lose the information and at the same time remove duplicate lines (supposed to be), you can do this:

$bedtools intersect -a a.bed -b b.bed -wa -wb -loj | datamash -sg1,2,3 unique 4 collapse 5,6,7 chr19 54453724 54456882 1 . -1 -1 chr19 54938190 54939661 2 . -1 -1 chr19 56118707 56122392 56 . -1 -1 chr19 61057150 61059534 3 chr19,chr19 61055073,61059086 61059079,61065281 chr19 61079739 61080558 10 chr19 61078377 61084129 chr19 61146043 61148013 8 . -1 -1  ADD REPLY 1 Entering edit mode I would go with this in case you need to reconstruct in future: $ bedtools intersect -a a.bed -b b.bed -wa -wb -loj | awk -v OFS="\t" '{print $1,$2,$3,$4,$5":"$6"-"$7}' | datamash -sg 1,2,3 unique 4 collapse 5 chr19 54453724 54456882 1 .:-1--1 chr19 54938190 54939661 2 .:-1--1 chr19 56118707 56122392 56 .:-1--1 chr19 61057150 61059534 3 chr19:61055073-61059079,chr19:61059086-61065281 chr19 61079739 61080558 10 chr19:61078377-61084129 chr19 61146043 61148013 8 .:-1--1  For any reason, if you want to count, how many ranges each record overlaps, you can use following code: $ bedtools intersect -a a.bed -b b.bed -wa -wb -loj | awk -v OFS="\t" '{print $1,$2,$3,$4,$5":"$6"-"\$7}' | datamash -sg 1,2,3 unique 4 count 5 collapse 5

chr19   54453724    54456882    1   1   .:-1--1
chr19   54938190    54939661    2   1   .:-1--1
chr19   56118707    56122392    56  1   .:-1--1
chr19   61057150    61059534    3   2   chr19:61055073-61059079,chr19:61059086-61065281
chr19   61079739    61080558    10  1   chr19:61078377-61084129
chr19   61146043    61148013    8   1   .:-1--1


ps: posted as a subpost as the previous post was getting bigger and confusing.

0
Entering edit mode

You want a single fragment in each test bin ?

If you do somehting like this :

chr19   61057150    61059534    3   chr19   61055073    61059079
chr19   61057150    61059534    3   0   -1    -1


chr19   61059086    61065281

0
Entering edit mode

If you just want to remove duplicates, sort followed by uniq will do the job in linux.

Eg.

cat input_file | sort | uniq > output_file

0
Entering edit mode

sorry misunderstood the question