Question

Remove repetitive sequence of variable length from reads

0

Entering edit mode

7.6 years ago

rm16 • 0

I am working with a FASTA file in which each read contains a repetitive sequence of variable length at the 5' end. For instance, in the below file:

>seq1
CCCCAAAACCCCAAAACCCCGATGATCATGGATC
>seq2
CCCCAAAACCCCGATGGCATCATTCA
>seq3
CCCCAAAACCCCAAAATATGTTGCTACTAG

I would like to remove the repetitive sequence of C's and A's from the 5' end of each read, but whatever solution I use should take into account that there may be any number of repetitive units, including a repetitive C block without a subsequent A block (see "seq2" above).

If this can be done in the Mac OSX command line, that would be optimal. I am also interested in software packages that may be able to accomplish this. Thank you for any help you can offer.

fasta sequencing osx • 1.8k views

ADD COMMENT • link updated 7.6 years ago by igor 13k • written 7.6 years ago by rm16 • 0

score 3 · Accepted Answer · 2016-09-08

3

Entering edit mode

7.6 years ago

igor 13k

I think fastx_clipper can do this. I am not sure about how it treats repeats. If it only removes the first one, I suppose you could just run it multiple times. Docs: http://hannonlab.cshl.edu/fastx_toolkit/commandline.html

Of course, there is always the sed solution, but then you need to be concerned about the formatting of your FASTA file (sequences may span multiple lines, for example): sed 's/^CCCAAA[\(CCCAAA\)]*//g' file.fasta

ADD COMMENT • link 7.6 years ago by igor 13k

1

Entering edit mode

While sed -i is useful, this will destroy the original data which is maybe not what OP wants. I advice against using -i unless you are sure that your command is the right one and you no longer need the original file.

I addition, this sed will not remove CCCCCAAAAACCCCCAAAAA completely...

ADD REPLY • link 7.6 years ago by WouterDeCoster 47k

1

Entering edit mode

I suppose the -i is arguable, but I removed it. It was really meant as a suggestion. I assume people don't just copy and paste random commands from the internet, but that's a big assumption.

And fixed the pattern. I forgot the ^ was there.

ADD REPLY • link 7.6 years ago by igor 13k

0

Entering edit mode

I was able to figure it out using extended regular expressions and just running a couple of different scripts to make sure I removed every repetitive instance:

sed -E 's/^CCCCAAAA*//g' file.fasta

Thanks a lot, everyone.

ADD REPLY • link 7.6 years ago by rm16 • 0

1

Entering edit mode

The * would only apply to the previous character (A), not the entire string.

Note: -E works on BSD sed, but it would be -r on GNU sed.

ADD REPLY • link 7.6 years ago by igor 13k