Question: How to pull out sequences with repeat elements from RepeatMasker output file?
0
gravatar for san.san
4.6 years ago by
san.san160
Wales, UK
san.san160 wrote:

Hi all,

I've successfully run RepeatMasker with hard mask and soft mask parameters and have been asked to pull out sequences which have masked repeat elements.

I'm new to the command line and can only use grep, awk, etc. in a very basic way.

Would anyone be able to help me with this?

Thanks!

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by san.san160
3
gravatar for san.san
4.6 years ago by
san.san160
Wales, UK
san.san160 wrote:

Since I'm working on a cluster and don't have bedtools installed or privileges to install it (either on the cluster or on my local machine) I came up with this work around:

1. Change the space-separated .out file into a tab-delimited file:

cat FILE.out | tr -s ' ' | sed 's/^ *//g' | tr ' ' '\t' > FILE.out.tab

2. Extract the fifth column with sequence names, get rid of the duplicates, then cut first three lines out (these are there from making the file tab-delimited):

cut -f5 FILE.out.tab | sort -u | tail -q -n +4 > repeat.sequence.names.list

3. Make your .masked file a one-line file for easier manipulation (you do have to type > sign in the second line):

sed '/>/s/$/</g' < FILE.masked | tr -d '\n' | tr '<' '\n'| sed 's/>/\

>/g' | grep . > FILE.masked.1

4. Use the one-line .masked file to pull out sequences with repeats:

grep -A1 -f repeat.transcripts.list FILE.masked.1 | grep -v "^--$" > masked.sequences.repeat.fasta

There's an option of using grep with multiple CPU cores if you have parallel installed. See how to here.

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by san.san160
1

FYI, you can always install software into your home directory (typically ~/bin). You don't need elevated privileges for that.

ADD REPLYlink written 4.6 years ago by Devon Ryan97k

Ah, yes! Would have saved me a lot of time. Thanks!

ADD REPLYlink written 4.6 years ago by san.san160
1
gravatar for Devon Ryan
4.6 years ago by
Devon Ryan97k
Freiburg, Germany
Devon Ryan97k wrote:

Since you mention being familiar with awk:

  1. Convert the repeatmasker text file output to a BED file (something like awk '{OFS="\t"}{print $6, $7-1, $8}', though you should check include the strand).
  2. Use bedtools getfasta with a fasta file and the BED file from step one.
ADD COMMENTlink written 4.6 years ago by Devon Ryan97k

I'm sorry, what do you mean by checking the strand? Thanks!

ADD REPLYlink written 4.6 years ago by san.san160

Look at the columns, one of them has a strand, which you might want to include.

ADD REPLYlink written 4.6 years ago by Devon Ryan97k

By the way, RepeatMasker's .out file isn't a tab-delimited file, unfortunately. Otherwise I reckon I could cut and sort -u the .out file that contains the names of sequences with repeat elements and grep those from my .masked file :/

ADD REPLYlink written 4.6 years ago by san.san160

That's the benefit of awk, it'll handle the fixed-width nature of the file (granted, you could fix this with sed too).

ADD REPLYlink written 4.6 years ago by Devon Ryan97k

I used you kindly provided awk command on my .out file and ended up with this:

http://postimg.org/image/k0kt25hg7/

Which doesn't have the info I need :/ But I came up with another work around, so it's all good.

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by san.san160
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1502 users visited in the last hour