How to pull out sequences with repeat elements from RepeatMasker output file?
2
0
Entering edit mode
8.1 years ago
san.san ▴ 190

Hi all,

I've successfully run RepeatMasker with hard mask and soft mask parameters and have been asked to pull out sequences which have masked repeat elements.

I'm new to the command line and can only use grep, awk, etc. in a very basic way.

Would anyone be able to help me with this?

Thanks!

repeatmasker sequence sorting filtering • 5.4k views
ADD COMMENT
3
Entering edit mode
8.1 years ago
san.san ▴ 190

Since I'm working on a cluster and don't have bedtools installed or privileges to install it (either on the cluster or on my local machine) I came up with this work around:

1. Change the space-separated .out file into a tab-delimited file:

cat FILE.out | tr -s ' ' | sed 's/^ *//g' | tr ' ' '\t' > FILE.out.tab

2. Extract the fifth column with sequence names, get rid of the duplicates, then cut first three lines out (these are there from making the file tab-delimited):

cut -f5 FILE.out.tab | sort -u | tail -q -n +4 > repeat.sequence.names.list

3. Make your .masked file a one-line file for easier manipulation (you do have to type > sign in the second line):

sed '/>/s/$/</g' < FILE.masked | tr -d '\n' | tr '<' '\n'| sed 's/>/\

>/g' | grep . > FILE.masked.1

4. Use the one-line .masked file to pull out sequences with repeats:

grep -A1 -f repeat.transcripts.list FILE.masked.1 | grep -v "^--$" > masked.sequences.repeat.fasta

There's an option of using grep with multiple CPU cores if you have parallel installed. See how to here.

ADD COMMENT
1
Entering edit mode

FYI, you can always install software into your home directory (typically ~/bin). You don't need elevated privileges for that.

ADD REPLY
0
Entering edit mode

Ah, yes! Would have saved me a lot of time. Thanks!

ADD REPLY
1
Entering edit mode
8.1 years ago

Since you mention being familiar with awk:

  1. Convert the repeatmasker text file output to a BED file (something like awk '{OFS="\t"}{print $6, $7-1, $8}', though you should check include the strand).
  2. Use bedtools getfasta with a fasta file and the BED file from step one.
ADD COMMENT
0
Entering edit mode

I'm sorry, what do you mean by checking the strand? Thanks!

ADD REPLY
0
Entering edit mode

Look at the columns, one of them has a strand, which you might want to include.

ADD REPLY
0
Entering edit mode

By the way, RepeatMasker's .out file isn't a tab-delimited file, unfortunately. Otherwise I reckon I could cut and sort -u the .out file that contains the names of sequences with repeat elements and grep those from my .masked file :/

ADD REPLY
0
Entering edit mode

That's the benefit of awk, it'll handle the fixed-width nature of the file (granted, you could fix this with sed too).

ADD REPLY
0
Entering edit mode

I used you kindly provided awk command on my .out file and ended up with this:

http://postimg.org/image/k0kt25hg7/

Which doesn't have the info I need :/ But I came up with another work around, so it's all good.

ADD REPLY

Login before adding your answer.

Traffic: 2822 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6