Question: condition bedtools on name column
0
gravatar for burnsro
3.2 years ago by
burnsro20
Austria
burnsro20 wrote:

Bedtools accepts files with formats like 'chr start end name score strand'

However, is it possible to take these columns into account (e.g name column) when using bedtools intersect or bedtools merge.

For example I have a bed file like so, with a 'name' column for transposable elements

FileA:

scaffold_1  913038  914038  DNA_transposon  
scaffold_1  925018  926018  LTR

File B

scaffold_1  1026586 1027586 Mariner 
scaffold_1  925211  935660  DNA_transposon  
scaffold_1  925880  926300  LTR

but I would only like bedtools to intersect the 2nd line from FileA with the 3rd line from FileB, ignoring the 2nd line in FileB because it has "DNA_transposon" and not "LTR" in column4.

I can't find something like this in the manual file for bedtools. Knowing how to do it in bedtools would be helpful because it has the option to compare one bed file with many others (options -a and -b), in addition to many other options.

bedtools • 1.3k views
ADD COMMENTlink modified 3.2 years ago by Alex Reynolds29k • written 3.2 years ago by burnsro20
0
gravatar for EagleEye
3.2 years ago by
EagleEye6.5k
Sweden
EagleEye6.5k wrote:

1) Why don't you grep 'LTR' lines from both the files

grep -w "LTR" FileA.bed > FileA_greppedLTR.bed

grep -w "LTR" FileB.bed > FileB_greppedLTR.bed

2) Use bed tools on grepped files

ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by EagleEye6.5k

True, it would just be a lot of "grepped" files to make for each unique name in column4. This example I made is just simplification.

ADD REPLYlink written 3.2 years ago by burnsro20

EDIT: (The -F is the delimiter but it's not needed for tab-separate, see below). While this doesn't help not make lots of files, you could just do awk -F '{print > "file_"$4}' file.bed and that'll make you a new file for each unique string in that column, then you've not got to spit each on individually.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by Daniel3.7k

By the way. I need some clarification, the AWK command you mentioned does it actually work ?

ADD REPLYlink written 3.2 years ago by EagleEye6.5k

oops, I just realised I removed the file delimiter but not the flag (you don't need the -F if it's tab-separated). But other than that, yeah it works as this:

awk '{print > "file_"$4}' file.bed
ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by Daniel3.7k

Great Awesome.......

ADD REPLYlink written 3.2 years ago by EagleEye6.5k
0
gravatar for Pierre Lindenbaum
3.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum123k wrote:

using sqlite3

$ awk 'BEGIN {printf("create table if not exists TA (C TEXT,S int,E int, T text); BEGIN TRANSACTION;\n");} { printf("insert into TA(C,S,E,T) values (\"%s\",%s,%s,\"%s\");\n",$1,$2,$3,$4);} END {printf("commit;\n");}' fileA | sqlite3 db.sqlite
$ awk 'BEGIN {printf("create table if not exists TB (C TEXT,S int,E int, T text); BEGIN TRANSACTION;\n");} { printf("insert into TB(C,S,E,T) values (\"%s\",%s,%s,\"%s\");\n",$1,$2,$3,$4);} END {printf("commit;\n");}' fileB | sqlite3 db.sqlite
$ sqlite3 db.sqlite 'select * from TA,TB where TA.C=TB.C and TA.T=TB.T and NOT(TA.E <= TB.S or TB.E <= TA.S)'
ADD COMMENTlink written 3.2 years ago by Pierre Lindenbaum123k

It would work. I am only hesitant to use something other than bedtools because bedtools has many useful option such as -c to count the number of files that has an interval, -f for matching only a percentage of an interval, -wa and -wb etc.

ADD REPLYlink written 3.2 years ago by burnsro20
0
gravatar for EagleEye
3.2 years ago by
EagleEye6.5k
Sweden
EagleEye6.5k wrote:

Sorry I could not able to add as a reply to your comment because of the format of the message. I am writing it in the answer section

Example input (BED tab-delimited):

scaffold_1 1026586 1027586 Mariner

scaffold_1 925211 935660 DNA_transposon

scaffold_1 925880 926300 LTR

scaffold_1 11111 22324 DNA_transposon

scaffold_1 1026586 1027586 Mariner

scaffold_1 21432 21444 Mariner

SCRIPT ( splitFile.sh ) :

file=$1

cols=$(cat $file | cut -f4 | awk '!x[$0]++' | tr '\n' ' ' | sed -e 's/(.*) /\1/g');

for name in $cols

{

IFS='/' read -a file_name <<< "$file";

fname=$(echo $file_name | cut -f 1 -d '.')

grep -w "$name" $file > ${fname}_${name}.bed

}

Run script:

./splitFile.sh FileB.bed

Output files:

1) FileB_DNA_transposon.bed

scaffold_1 925211 935660 DNA_transposon

scaffold_1 11111 22324 DNA_transposon

2) FileB_LTR.bed

scaffold_1 925880 926300 LTR

3) FileB_Mariner.bed

scaffold_1 1026586 1027586 Mariner

scaffold_1 1026586 1027586 Mariner

scaffold_1 21432 21444 Mariner

Run this script for FileA.bed and then compare appropriate pairs using BedTools. This is just a suggestion (idea), you can modify according to your needs.

ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by EagleEye6.5k

Yes I understood what you meant by using grep in the first comment, but it would lead to a lot of pairwise comparisons to make especially with many bed files to compare. With bedtools you can use one file to compare with many at once, thats why I wrote it in the question.

ADD REPLYlink written 3.2 years ago by burnsro20

Now the question is not clear. You wanted to compare File A and File B rows having 'same/matching' column#4 using bedtools, that's what I understood.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by EagleEye6.5k

File A and File B are a general example, but ideally I'd like to scale up to many FileAs and FileBs. I agree your way works its just a lot of intermediates to generate.

ADD REPLYlink written 3.2 years ago by burnsro20
0
gravatar for Alex Reynolds
3.2 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

Generating intermediate files is unnecessary. Doing pairwise comparisons is unnecessary.

The bedops toolkit supports multiple standard input streams, so for one "condition" or ID field, you can use a process substitution to generate a stream of subsetted elements, wherever you would have otherwise specified a path to a regular file ("filename"):

$ bedops --whatever-set-operation <(awk '$4 == "DNA_transposon"' A.bed) <(awk '$4 == "DNA_transposon"' B.bed) > answer.DNA_transposon.bed

More generally:

$ id="DNA_transposon" && bedops --whatever-set-operation <(awk -vid=$id '$4==id' A.bed) <(awk -vid=$id '$4==id' B.bed) > answer.$id.bed

From here, it is easy to deal with automating the even more general case of multiple IDs.

Just set up a loop that walks through a list of unique IDs from one of the inputs via cut, sort anduniq:

$ for id in `cut -f4 A.bed | sort | uniq`; do bedops --whatever-set-operation <(awk -vid=$id '$4==id' A.bed) <(awk -vid=$id '$4==id' B.bed) > answer.$id.bed; done

If you want a list of IDs from two or more files, just use cat to concatenate them to build the ID list:

$ for id in `cat A.bed B.bed | cut -f4 | sort | uniq`; do bedops --whatever-set-operation <(awk -vid=$id '$4==id' A.bed) <(awk -vid=$id '$4==id' B.bed) > answer.$id.bed; done

Thanks to default support for Unix input streams in bedops tools and to functionality in the bash shell, all of these approaches generate no intermediate (regular) files and should run pretty quickly.

ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by Alex Reynolds29k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1021 users visited in the last hour