4.9 years ago by
Wales, UK
Since I'm working on a cluster and don't have bedtools
installed or privileges to install it (either on the cluster or on my local machine) I came up with this work around:
1. Change the space-separated .out file into a tab-delimited file:
cat FILE.out | tr -s ' ' | sed 's/^ *//g' | tr ' ' '\t' > FILE.out.tab
2. Extract the fifth column with sequence names, get rid of the duplicates, then cut first three lines out (these are there from making the file tab-delimited):
cut -f5 FILE.out.tab | sort -u | tail -q -n +4 > repeat.sequence.names.list
3. Make your .masked file a one-line file for easier manipulation (you do have to type >
sign in the second line):
sed '/>/s/$/</g' < FILE.masked | tr -d '\n' | tr '<' '\n'| sed 's/>/\
>/g' | grep . > FILE.masked.1
4. Use the one-line .masked file to pull out sequences with repeats:
grep -A1 -f repeat.transcripts.list FILE.masked.1 | grep -v "^--$" > masked.sequences.repeat.fasta
There's an option of using grep
with multiple CPU cores if you have parallel
installed. See how to here.
•
link
modified 4.9 years ago
•
written
4.9 years ago by
san.san • 160