Question: bedtools intersect error?
0
gravatar for star
4 weeks ago by
star130
Netherlands
star130 wrote:

I like to do intersect between two file using bedtools intersect. first file is the output of MACS2 (narrowpeak) that I changed it to bed format using

cat A.narrowPeak | sort -k1,1 -k2,2n | cut -f 1-4,7,9 | sed -n '/^[0-9,X]/Ip' | sed 's/^/chr/' > A.bed

and the second file is a CSV format include genome cordinate that I saved it to bed format using

write.table(b,  file="/path/b.bed", quote=F, sep="\t", row.names=F, col.names=F)

then sort it using

sort -k1,1 -k2,2n  b.bed > sorted_b.bed

Then I did intersect using

 bedtools intersect -a /path/a.bed -b /path/sorted_b.bed -wao -f 0.8 > a_b_intersect.bed

but I faced with ***** ERROR: illegal character ' ' found in integer conversion of string "10966144 ". Exiting. and my output is just contain chromosome 1.

Three first line of each original files:

A.narrowpeak :

1   1624245 1624472 GSM1554660_SRR5082154_7pcw_H3K27ac_rep1_narrow_peak_1   44  .   4.68168 8.44714 4.48990 36

 1  2143559 2143864 GSM1554660_SRR5082154_7pcw_H3K27ac_rep1_narrow_peak_3   72  .   4.36351 11.59222    7.25891 182

1   2144136 2145751 GSM1554660_SRR5082154_7pcw_H3K27ac_rep1_narrow_peak_4   165 .   8.28860 21.84743    16.59296    367

b.csv

chr start   end

chr16   86430087    86430726

chr16   80372593    80373755

chr16   78510608    78511944
chip-seq intersect bedtools • 222 views
ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by star130
1

There is a white-space behind the position 10966144.

Without knowing how A.narrowPeak or your csv looks like, it hard to say where this comes from.

ADD REPLYlink written 4 weeks ago by finswimmer11k

Thanks for your reply, I updated my post. but I do not have position 10966144. A.narrowpeak is contain 12999 peak number and a.csv is contain 1555091 position.

ADD REPLYlink written 4 weeks ago by star130
3

but I do not have position 10966144.

What is grep telling you here? :

$ grep "10966144" /path/a.bed
$ grep "10966144" /path/sorted_b.bed

You can simply remove all white-spaces in a file with sed:

$ sed 's/ //g' input > output

Or if you want to overwrite the original file:

$ sed -i 's/ //g' input
ADD REPLYlink written 4 weeks ago by finswimmer11k

Thanks it worked. but I get this error ***** ERROR: too many digits/characters for integer conversion in string . Exiting...

ADD REPLYlink written 4 weeks ago by star130
2

According to this post, this error appears if you have duplicate entries in one of your file. Sort your files again after you have remove the white-spaces, but this time use -u to remove duplicates:

$ sort -u -k1,1 -k2,2n input > output
ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by finswimmer11k

Thanks @finswimmer, i did it but still there is the same ERROR.

ADD REPLYlink written 4 weeks ago by star130
1

This explanation makes it more clear what's going on here.

Your bed files are malformed. There are lines where the second or third column doesn't contain valid coordinates. This will give you the lines where the second or third column doesn't consist of one ore more numbers:

$ awk -v FS="\t" '$2 !~ /^[0-9]+$/ || $3 !~ /^[0-9]+$/' a.bed
$ awk -v FS="\t" '$2 !~ /^[0-9]+$/ || $3 !~ /^[0-9]+$/' b.bed
ADD REPLYlink written 4 weeks ago by finswimmer11k

Thanks. I done it and there are some informal data in one of my files.

chr2        242335250   
chr22   9-Jan   42462950    
chr5    15-Feb  132225350
chr7    20-Mar  35932850    
chr7    23-Feb  35917850

How can i ignore them?

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by star130
3

Did you open and save this file in Excel?

ADD REPLYlink written 4 weeks ago by WouterDeCoster37k

Yes, unfortunately, I have an Excel file that I think it changed all things

ADD REPLYlink written 4 weeks ago by star130

Thanks for your help, I can fix it with your help. This time I can do intersect for some data without any error but for some one I get Error: Sorted input specified, but the file /path/sorted_a.bed has the following out of order record chr10 1000824 1003242 sun_2016

ADD REPLYlink written 4 weeks ago by star130
1

OK, let's do some more awk-voodoo.

The following code will check for if there are white-spaces in the sequence name and if the start and end position contain only numbers. Those lines that are not valid will be written to bad.bed. All others go to good.bed, will be sorted and duplicates get removed:

$ awk -v FS="\t" '$1 ~ / / || $2 !~ /^[0-9]+$/ || $3 !~ /^[0-9]+$/ { print $0 > "bad.bed"; next; } {print $0|"sort -u -k1,1 -k2,2n -k3,3n > good.bed"}' input.bed
ADD REPLYlink written 4 weeks ago by finswimmer11k
1

It's not the line number "position"; its one of your coordinates.

Just use grep "10966144 " on your two bed files.

ADD REPLYlink written 4 weeks ago by michael.ante3.2k

The error message is quite clear, no? Try to read them carefully, they often tell you exactly what is wrong.

ADD REPLYlink written 4 weeks ago by WouterDeCoster37k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 890 users visited in the last hour