Question: Sequence extraction based on coordinates
0
gravatar for leo1985.arnab
5 weeks ago by
leo1985.arnab0 wrote:

I have a file with sequences in first column and start coordinate for these sequences in the 3rd column. I want to extract all rows where the starting loci of sequences vary by 5 nt of each other. Following is an example

  Sequence     Column header_xxx          Start coordinate           Column header_yyy      
    ABC                                           500
    XYZ                                           502
    DEF                                           12050
    PQR                                           12055
    abc                                           400
    mno                                           456

In the above example the script I am trying to come up with should be able to extract rows with seq ABC, XYZ, DEF and PQR, but not abc and mno.

Any solution preferably in bash or perl will be very helpful.

Thanks and regards

sequence • 151 views
ADD COMMENTlink modified 5 weeks ago by Alex Reynolds19k • written 5 weeks ago by leo1985.arnab0
3
gravatar for Alex Reynolds
5 weeks ago by
Alex Reynolds19k
Seattle, WA USA
Alex Reynolds19k wrote:

As a pre-requisite, install BEDOPS to get bedmap and sort-bed in your environment.

  1. Convert the (headerless) file to a pseudo-BED file:

    $ awk 'BEGIN{ OFS="\t"; }{ print "chrPseudo",$3,($3+1),$1 }' seq.txt | sort-bed - > seq.bed
    
  2. Use bedmap to map elements of seq.bed that are within 5 bases of each other and retrieve qualifying seq.bed ID values. Write them out on separate lines, using sort and uniq to build a text file of unique ID names, written to a file called seqIDs.txt:

    $ bedmap --echo-map-id --range 5 --multi-delim '\n' seq.bed | sort | uniq > seqIDs.txt
    
  3. Use grep and the seqIDs.txt file to filter the original seq.txt file:

    $ grep -f seqIDs.txt seq.txt > seqFiltered.txt
    

The file seqFiltered.txt should contain a subset of lines from seq.txt that meet your loci requirements.

Once you see what each step does, you can do all of this in a one-liner:

$ awk 'BEGIN{ OFS="\t"; }{ print "chrPseudo",$3,($3+1),$1 }' seq.txt | sort-bed - | bedmap --echo-map-id --range 5 --multi-delim '\n' - | sort | uniq | grep -f - seq.txt > seqFiltered.txt

This would run faster, by avoiding the creation of intermediate files.

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by Alex Reynolds19k

Thanks a lot for the solution Alex !!

ADD REPLYlink written 5 weeks ago by leo1985.arnab0

If this answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.

ADD REPLYlink written 5 weeks ago by WouterDeCoster19k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 608 users visited in the last hour