Entering edit mode
4.1 years ago
star ▴ 330
I like to do intersect between two file using
bedtools intersect. first file is the output of MACS2 (narrowpeak) that I changed it to bed format using
cat A.narrowPeak | sort -k1,1 -k2,2n | cut -f 1-4,7,9 | sed -n '/^[0-9,X]/Ip' | sed 's/^/chr/' > A.bed
and the second file is a CSV format include genome cordinate that I saved it to bed format using
write.table(b, file="/path/b.bed", quote=F, sep="\t", row.names=F, col.names=F)
then sort it using
sort -k1,1 -k2,2n b.bed > sorted_b.bed
Then I did intersect using
bedtools intersect -a /path/a.bed -b /path/sorted_b.bed -wao -f 0.8 > a_b_intersect.bed
but I faced with
***** ERROR: illegal character ' ' found in integer conversion of string "10966144 ". Exiting. and my output is just contain chromosome 1.
Three first line of each original files:
1 1624245 1624472 GSM1554660_SRR5082154_7pcw_H3K27ac_rep1_narrow_peak_1 44 . 4.68168 8.44714 4.48990 36 1 2143559 2143864 GSM1554660_SRR5082154_7pcw_H3K27ac_rep1_narrow_peak_3 72 . 4.36351 11.59222 7.25891 182 1 2144136 2145751 GSM1554660_SRR5082154_7pcw_H3K27ac_rep1_narrow_peak_4 165 . 8.28860 21.84743 16.59296 367
chr start end chr16 86430087 86430726 chr16 80372593 80373755 chr16 78510608 78511944
There is a white-space behind the position 10966144.
Without knowing how
csvlooks like, it hard to say where this comes from.
Thanks for your reply, I updated my post. but I do not have position 10966144. A.narrowpeak is contain 12999 peak number and a.csv is contain 1555091 position.
greptelling you here? :
You can simply remove all white-spaces in a file with
Or if you want to overwrite the original file:
Thanks it worked. but I get this error
***** ERROR: too many digits/characters for integer conversion in string . Exiting...
According to this post, this error appears if you have duplicate entries in one of your file. Sort your files again after you have remove the white-spaces, but this time use
-uto remove duplicates:
Thanks @finswimmer, i did it but still there is the same ERROR.
This explanation makes it more clear what's going on here.
bedfiles are malformed. There are lines where the second or third column doesn't contain valid coordinates. This will give you the lines where the second or third column doesn't consist of one ore more numbers:
Thanks. I done it and there are some informal data in one of my files.
How can i ignore them?
Did you open and save this file in Excel?
Yes, unfortunately, I have an Excel file that I think it changed all things
Thanks for your help, I can fix it with your help. This time I can do intersect for some data without any error but for some one I get
Error: Sorted input specified, but the file /path/sorted_a.bed has the following out of order record chr10 1000824 1003242 sun_2016
OK, let's do some more awk-voodoo.
The following code will check for if there are white-spaces in the sequence name and if the start and end position contain only numbers. Those lines that are not valid will be written to
bad.bed. All others go to
good.bed, will be sorted and duplicates get removed:
It's not the line number "position"; its one of your coordinates.
grep "10966144 "on your two bed files.
The error message is quite clear, no? Try to read them carefully, they often tell you exactly what is wrong.