Question: Remove duplicate lines based on specific columns
0
gravatar for dzisis1986
4 months ago by
dzisis198620
dzisis198620 wrote:

I used bedtools intersect like that

bedtools intersect -a test.bed -b fragments.bed -wa -wb -loj > test.txt

The fragments file is a file with 3 columns: chr,start,end position the test.bed is a file wtih 4 columns with counts : chr,start,end,count

I used intersect of bedtools and the result is something like that

chr19   61030687    61040876    0   -1  -1  0
chr19   61040883    61041418    0   -1  -1  0
chr19   61041425    61041896    0   -1  -1  0
chr19   61041903    61042676    0   -1  -1  0
chr19   61042683    61044693    0   -1  -1  0
chr19   61044700    61045007    0   -1  -1  0
chr19   61045014    61048846    0   -1  -1  0
chr19   61048853    61051147    0   -1  -1  0
chr19   61051154    61051161    0   -1  -1  0
chr19   61051168    61055066    0   -1  -1  0
chr19   61055073    61059079    chr19   61057150    61059534    3
chr19   61059086    61065281    chr19   61057150    61059534    3
chr19   61065288    61065491    0   -1  -1  0
chr19   61065498    61069950    0   -1  -1  0
chr19   61069957    61070313    0   -1  -1  0
chr19   61070320    61071203    0   -1  -1  0
chr19   61071210    61074042    0   -1  -1  0
chr19   61074049    61076962    0   -1  -1  0
chr19   61076969    61078370    0   -1  -1  0
chr19   61078377    61084129    chr19   61079739    61080558    10
chr19   61084136    61085306    0   -1  -1  0
chr19   61110306    61112208    0   -1  -1  0
chr19   61130752    61131999    0   -1  -1  0
chr19   61132006    61139461    0   -1  -1  0
chr19   61139468    61142499    0   -1  -1  0
chr19   61142506    61144492    0   -1  -1  0
chr19   61144499    61144577    0   -1  -1  0
chr19   61144584    61147571    chr19   61146043    61148013    8
chr19   61147578    61147680    chr19   61146043    61148013    8
chr19   61147687    61148346    chr19   61146043    61148013    8
chr19   61148353    61149397    0   -1  -1  0
chr19   61149404    61149653    0   -1  -1  0
chr19   61149660    61150034    0   -1  -1  0

This is not correct because there are dublicate counts that are not in the original file. i would like to filter it in order to have something like that :

    chr19   61030687    61040876    0   -1  -1  0
    chr19   61040883    61041418    0   -1  -1  0
    chr19   61041425    61041896    0   -1  -1  0
    chr19   61041903    61042676    0   -1  -1  0
    chr19   61042683    61044693    0   -1  -1  0
    chr19   61044700    61045007    0   -1  -1  0
    chr19   61045014    61048846    0   -1  -1  0
    chr19   61048853    61051147    0   -1  -1  0
    chr19   61051154    61051161    0   -1  -1  0
    chr19   61051168    61055066    0   -1  -1  0
    chr19   61055073    61059079    chr19   61057150    61059534    3
    chr19   61059086    61065281    0   -1  -1  0
    chr19   61065288    61065491    0   -1  -1  0
    chr19   61065498    61069950    0   -1  -1  0
    chr19   61069957    61070313    0   -1  -1  0
    chr19   61070320    61071203    0   -1  -1  0
    chr19   61071210    61074042    0   -1  -1  0
    chr19   61074049    61076962    0   -1  -1  0
    chr19   61076969    61078370    0   -1  -1  0
    chr19   61078377    61084129    chr19   61079739    61080558    10
    chr19   61084136    61085306    0   -1  -1  0
    chr19   61110306    61112208    0   -1  -1  0
    chr19   61130752    61131999    0   -1  -1  0
    chr19   61132006    61139461    0   -1  -1  0
    chr19   61139468    61142499    0   -1  -1  0
    chr19   61142506    61144492    0   -1  -1  0
    chr19   61144499    61144577    0   -1  -1  0
    chr19   61144584    61147571    chr19   61146043    61148013    8
    chr19   61147578    61147680    0   -1  -1  0
    chr19   61147687    61148346    0   -1  -1  0
    chr19   61148353    61149397    0   -1  -1  0
    chr19   61149404    61149653    0   -1  -1  0
    chr19   61149660    61150034    0   -1  -1  0

Any help to do it in R or python ? Thanks

python intersect R reads • 292 views
ADD COMMENTlink modified 4 months ago by RamRS20k • written 4 months ago by dzisis198620
1

Do you have any experience in R or Python to try something by yourself ?

ADD REPLYlink written 4 months ago by Bastien Hervé3.3k

YEs i do. i tried some manipulations but i cant see how to keep unfiltered the first 3 columns and filter only the 4-6 but also keep the rest with 0 and -1 as it is !

ADD REPLYlink written 4 months ago by dzisis198620
1

Edit your post and show us what you did please, even if it does not work

ADD REPLYlink written 4 months ago by Bastien Hervé3.3k
1

What do you mean by :

This is not correct because there are dublicate counts that are not in the original file.


chr19   61055073    61059079    chr19   61057150    61059534    3
chr19   61059086    61065281    chr19   61057150    61059534    3

This is correct : The 2 fragments below overlap with the test

61057150 < 61059079 < 61059534 and 61057150 < 61059086 < 61059534

chr19   61055073    61059079
chr19   61059086    61065281

are in fragments.bed

chr19   61057150    61059534    3

is in test.bed

For this specific result what do you want as output ?

ADD REPLYlink modified 4 months ago • written 4 months ago by Bastien Hervé3.3k
1

I suggest you specify the ultimate goal and optimize the bedtools command. As Bastien has pointed out, there's no error in the bedtools output and I have the feeling that using a different set of bedtools options may get you what you want. In order to help with that, we would need to know what exactly it is you're looking for and how you would decide what the "wrong" output line was.

For example this one -- which line would you keep and why?

chr19   61055073    61059079    chr19   61057150    61059534    3
chr19   61059086    61065281    chr19   61057150    61059534    3
ADD REPLYlink modified 4 months ago • written 4 months ago by Friederike2.7k

Hello dzisis1986 ,

please show us the exact command you used and how your input files looks like.

Thanks.

fin swimmer

ADD REPLYlink written 4 months ago by finswimmer10k
bedtools intersect -a test.bed -b fragments.bed -wa -wb -loj > test.txt
ADD REPLYlink modified 4 months ago • written 4 months ago by dzisis198620

Could you also paste the first lines of test.bed and fragments.bed please

ADD REPLYlink written 4 months ago by Bastien Hervé3.3k

test.bed can be a file like that:

chr19   54453724    54456882    1
chr19   54938190    54939661    2
chr19   56118707    56122392    56
chr19   61057150    61059534    3
chr19   61079739    61080558    10
chr19   61146043    61148013    8

and fragments.bed is a file like that :

chr19   61055073    61059079
chr19   61059086    61065281
chr19   61065288    61065491
chr19   61065498    61069950
chr19   61069957    61070313
chr19   61070320    61071203
chr19   61071210    61074042
chr19   61074049    61076962
chr19   61076969    61078370
chr19   61078377    61084129
chr19   61084136    61085306
chr19   61085313    61087770
chr19   61087777    61088015
chr19   61088022    61089802
chr19   61089809    61094605
chr19   61094612    61100478
chr19   61100485    61100872
ADD REPLYlink written 4 months ago by dzisis198620

Probably because a single record from test.bed overlaps with two ranges in fragments.bed. For eg. record from test.bed (a single line: chr19 61057150 61059534 3) overlaps with two ranges in fragments.bed (two lines: chr19 61057150 61059534 3 and chr19 61059086 61065281. Hence you would see duplicate lines like this (which are not, in fact):

chr19   61057150    61059534    3   chr19   61055073    61059079
chr19   61057150    61059534    3   chr19   61059086    61065281

or I got your issue wrong.

ADD REPLYlink modified 4 months ago • written 4 months ago by cpad011211k

Yes this is the reason but i dont want to have this duplicate that why i want to remove it manually after intersect,

ADD REPLYlink written 4 months ago by dzisis198620

what is the expected output? dzisis1986. Like this:

 $ bedtools intersect -a a.bed -b b.bed -wa -wb -loj | datamash -sg1,2,3 unique 4

chr19   54453724    54456882    1
chr19   54938190    54939661    2
chr19   56118707    56122392    56
chr19   61057150    61059534    3
chr19   61079739    61080558    10
chr19   61146043    61148013    8

instead of:

$ bedtools intersect -a a.bed -b b.bed -wa -wb -loj | cut -f1-4
chr19   54453724    54456882    1
chr19   54938190    54939661    2
chr19   56118707    56122392    56
chr19   61057150    61059534    3
chr19   61057150    61059534    3
chr19   61079739    61080558    10
chr19   61146043    61148013    8

If you don't want to lose the information and at the same time remove duplicate lines (supposed to be), you can do this:

$ bedtools intersect -a a.bed -b b.bed -wa -wb -loj | datamash -sg1,2,3 unique 4 collapse 5,6,7

chr19   54453724    54456882    1   .   -1  -1
chr19   54938190    54939661    2   .   -1  -1
chr19   56118707    56122392    56  .   -1  -1
chr19   61057150    61059534    3   chr19,chr19 61055073,61059086   61059079,61065281
chr19   61079739    61080558    10  chr19   61078377    61084129
chr19   61146043    61148013    8   .   -1  -1
ADD REPLYlink modified 4 months ago • written 4 months ago by cpad011211k
1

I would go with this in case you need to reconstruct in future:

$ bedtools intersect -a a.bed -b b.bed -wa -wb -loj | awk -v OFS="\t" '{print $1,$2,$3,$4,$5":"$6"-"$7}' | datamash -sg 1,2,3 unique 4 collapse 5 

chr19   54453724    54456882    1   .:-1--1
chr19   54938190    54939661    2   .:-1--1
chr19   56118707    56122392    56  .:-1--1
chr19   61057150    61059534    3   chr19:61055073-61059079,chr19:61059086-61065281
chr19   61079739    61080558    10  chr19:61078377-61084129
chr19   61146043    61148013    8   .:-1--1

For any reason, if you want to count, how many ranges each record overlaps, you can use following code:

$ bedtools intersect -a a.bed -b b.bed -wa -wb -loj | awk -v OFS="\t" '{print $1,$2,$3,$4,$5":"$6"-"$7}' | datamash -sg 1,2,3 unique 4 count 5 collapse 5 

chr19   54453724    54456882    1   1   .:-1--1
chr19   54938190    54939661    2   1   .:-1--1
chr19   56118707    56122392    56  1   .:-1--1
chr19   61057150    61059534    3   2   chr19:61055073-61059079,chr19:61059086-61065281
chr19   61079739    61080558    10  1   chr19:61078377-61084129
chr19   61146043    61148013    8   1   .:-1--1

ps: posted as a subpost as the previous post was getting bigger and confusing.

ADD REPLYlink modified 4 months ago • written 4 months ago by cpad011211k

You want a single fragment in each test bin ?

If you do somehting like this :

chr19   61057150    61059534    3   chr19   61055073    61059079
chr19   61057150    61059534    3   0   -1    -1

You will lost information about this fragment :

chr19   61059086    61065281
ADD REPLYlink written 4 months ago by Bastien Hervé3.3k

If you just want to remove duplicates, sort followed by uniq will do the job in linux.

Eg.

cat input_file | sort | uniq > output_file
ADD REPLYlink modified 4 months ago by RamRS20k • written 4 months ago by karthic100

sorry misunderstood the question

ADD REPLYlink written 4 months ago by karthic100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1397 users visited in the last hour