Question

Sequence extraction based on coordinates

0

Entering edit mode

6.9 years ago

leo1985.arnab ▴ 50

I have a file with sequences in first column and start coordinate for these sequences in the 3rd column. I want to extract all rows where the starting loci of sequences vary by 5 nt of each other. Following is an example

  Sequence     Column header_xxx          Start coordinate           Column header_yyy      
    ABC                                           500
    XYZ                                           502
    DEF                                           12050
    PQR                                           12055
    abc                                           400
    mno                                           456

In the above example the script I am trying to come up with should be able to extract rows with seq ABC, XYZ, DEF and PQR, but not abc and mno.

Any solution preferably in bash or perl will be very helpful.

Thanks and regards

sequence • 1.3k views

ADD COMMENT • link updated 6.9 years ago by Alex Reynolds 35k • written 6.9 years ago by leo1985.arnab ▴ 50

score 3 · Answer 1 · 2017-05-19

As a pre-requisite, install BEDOPS to get bedmap and sort-bed in your environment.

Convert the (headerless) file to a pseudo-BED file:

$ awk 'BEGIN{ OFS="\t"; }{ print "chrPseudo",$3,($3+1),$1 }' seq.txt | sort-bed - > seq.bed

Use bedmap to map elements of seq.bed that are within 5 bases of each other and retrieve qualifying seq.bed ID values. Write them out on separate lines, using sort and uniq to build a text file of unique ID names, written to a file called seqIDs.txt:
```
$ bedmap --echo-map-id --range 5 --multi-delim '\n' seq.bed | sort | uniq > seqIDs.txt
```
Use grep and the seqIDs.txt file to filter the original seq.txt file:
```
$ grep -f seqIDs.txt seq.txt > seqFiltered.txt
```

The file seqFiltered.txt should contain a subset of lines from seq.txt that meet your loci requirements.

Once you see what each step does, you can do all of this in a one-liner:

$ awk 'BEGIN{ OFS="\t"; }{ print "chrPseudo",$3,($3+1),$1 }' seq.txt | sort-bed - | bedmap --echo-map-id --range 5 --multi-delim '\n' - | sort | uniq | grep -f - seq.txt > seqFiltered.txt

This would run faster, by avoiding the creation of intermediate files.