error reading bed files for bedtools intersect
2
0
Entering edit mode
4.3 years ago
rthapa ▴ 90

Hi, I am trying to use bedtools intersect to find the overlapping gene regions with corresponding SNPs. I have two files; SNP file and gene annotation file. I am getting the following error while using bedtools intersect. I do not see extra tab at the end of the lines, I am not able to find the problem. Appreciate any suggestion. Thanks!

Error: Type checker found wrong number of fields while tokenizing data line.
Perhaps you have extra TAB at the end of your line? Check with "cat -t"

cat -t genesgff1.bed|head

chr^Istart^Iend^Igenes
chr1^I1951^I2616^IS.001G000100
chr1^I11180^I14899^IS.001G000200
chr1^I23399^I24152^IS.001G000300
chr1^I22391^I42443^IS.001G000400
chr1^I52891^I53594^IS.001G000501
chr1^I53781^I63305^IS.001G000700
chr1^I62892^I69306^IS.001G000800
chr1^I79159^I81636^IS.001G000900
chr1^I81932^I83350^IS.001G001000

cat -t SNPs.bed|head

chr^Ipos^Ieffect^ISNP
chr9^I57068854^I0.355187213^IS1
chr9^I57068854^I21.59969981^IS2
chr9^I57068854^I0.326924349^IS3
chr6^I13897772^I^I3.351266271^IS4
chr9^I57068854^I18.61550849^IS5
chr2^I2244737^I1.158934285^IS6
chr9^I57068854^I26.81277167^IS7
chr2^I2244737^I5.017257342^IS8
chr9^I57054157^I26.5431411^IS9
bedtools intersect • 4.1k views
ADD COMMENT
0
Entering edit mode

remove the header: chr^Istart^Iend^Igenes

what is the output of

cat genesgff1.bed SNPs.bed| awk -F '\t' '{print NF;}' | uniq | sort | uniq
ADD REPLY
0
Entering edit mode

Thank you, I did remove the headers from both bed files. The output of

cat genesgff1.bed SNPs.bed| awk -F '\t' '{print NF;}' | uniq | sort | uniq

is

4
5
ADD REPLY
0
Entering edit mode
4.3 years ago

The output of is 4 5

some lines have more than 4 fields.

you can display those lines using:

awk -F '\t' '(NF!=4) {print FILENAME,NF,NR, $0;}'   genesgff1.bed SNPs.bed
ADD COMMENT
0
Entering edit mode

There were many lines with more than 4 fields. So, to simplify the file, I removed the last two columns and retained only the first two columns. cat -t SNPs.bed

Chr09^I57068854
Chr09^I57068854
Chr09^I57068854
Chr06^I13897772
Chr09^I57068854
Chr02^I2244737
Chr09^I57068854
Chr02^I2244737
Chr09^I57054157
Chr01^I60532342

awk -F '\t' '(NF!=2) {print FILENAME,NF,NR, $0;}'   SNPs.bed
awk -F '\t' '(NF!=4) {print FILENAME,NF,NR, $0;}'   genesgff1.bed

didn't display anything whichI think is expected. But the bedtools is still not working,

bedtools intersect -a genesgff1.bed -b SNPs.bed -wa > output.txt

Error: unable to open file or unable to determine types for file SNPs.bed

- Please ensure that your file is TAB delimited (e.g., cat -t FILE).
- Also ensure that your file has integer chromosome coordinates in the 
  expected columns (e.g., cols 2 and 3 for BED).
ADD REPLY
0
Entering edit mode
4.3 years ago

same idea, check the lines where there is no integer in column 2 and 3

awk -F '\t' '(int($2)==0 || int($3)==0) {print FILENAME,NF,NR, $0;}'   SNPs.bed
ADD COMMENT
0
Entering edit mode

It printed out number of lines. Is the error is due to this? How can I deal with this?

SNPs.bed 2 1 Chr09  57068854
SNPs.bed 2 2 Chr09  57068854
SNPs.bed 2 3 Chr09  57068854
SNPs.bed 2 4 Chr06  13897772
SNPs.bed 2 5 Chr09  57068854
SNPs.bed 2 6 Chr02  2244737
SNPs.bed 2 7 Chr09  57068854
SNPs.bed 2 8 Chr02  2244737
ADD REPLY
0
Entering edit mode

It looks like it printed out every line of the file.

ADD REPLY
0
Entering edit mode

what is the output of

file SNPs.bed  genesgff1.bed

?

ADD REPLY
0
Entering edit mode
SNPs.bed: ASCII text
genesgff1.bed: ASCII text
ADD REPLY
0
Entering edit mode

there are only two fields in line 1 of file SNPs.bed (see also the other lines).

ADD REPLY
0
Entering edit mode
awk -F '\t' '(int($2)==0 || int($3)==0) {print FILENAME,NF,NR, $0;}'   SNPs.bed

It didn't display anything, which I think is correct. But still, bedtools intersect is not working.

ADD REPLY
0
Entering edit mode

It looks like the problem is from SNPs.bed file. In this file I can only have two columns - 1) Chromosome number and 2) SNP position but looks like bed file needs atleast 3 columns. I want to find the corresponding genes from gene annotation file by comparing gene location and SNP position. Do you have any suggestions to improve the file format and make it acceptable for bedtools?

ADD REPLY

Login before adding your answer.

Traffic: 2132 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6