Question: Remove repetitive sequence of variable length from reads
0
gravatar for rm16
4.1 years ago by
rm160
rm160 wrote:

I am working with a FASTA file in which each read contains a repetitive sequence of variable length at the 5' end. For instance, in the below file:

>seq1
CCCCAAAACCCCAAAACCCCGATGATCATGGATC
>seq2
CCCCAAAACCCCGATGGCATCATTCA
>seq3
CCCCAAAACCCCAAAATATGTTGCTACTAG

I would like to remove the repetitive sequence of C's and A's from the 5' end of each read, but whatever solution I use should take into account that there may be any number of repetitive units, including a repetitive C block without a subsequent A block (see "seq2" above).

If this can be done in the Mac OSX command line, that would be optimal. I am also interested in software packages that may be able to accomplish this. Thank you for any help you can offer.

sequencing osx fasta • 967 views
ADD COMMENTlink modified 4.1 years ago by igor11k • written 4.1 years ago by rm160
3
gravatar for igor
4.1 years ago by
igor11k
United States
igor11k wrote:

I think fastx_clipper can do this. I am not sure about how it treats repeats. If it only removes the first one, I suppose you could just run it multiple times. Docs: http://hannonlab.cshl.edu/fastx_toolkit/commandline.html

Of course, there is always the sed solution, but then you need to be concerned about the formatting of your FASTA file (sequences may span multiple lines, for example): sed 's/^CCCAAA[\(CCCAAA\)]*//g' file.fasta

ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by igor11k
1

While sed -i is useful, this will destroy the original data which is maybe not what OP wants. I advice against using -i unless you are sure that your command is the right one and you no longer need the original file.

I addition, this sed will not remove CCCCCAAAAACCCCCAAAAA completely...

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by WouterDeCoster44k
1

I suppose the -i is arguable, but I removed it. It was really meant as a suggestion. I assume people don't just copy and paste random commands from the internet, but that's a big assumption.

And fixed the pattern. I forgot the ^ was there.

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by igor11k

I was able to figure it out using extended regular expressions and just running a couple of different scripts to make sure I removed every repetitive instance:

sed -E 's/^CCCCAAAA*//g' file.fasta

Thanks a lot, everyone.

ADD REPLYlink written 4.1 years ago by rm160
1

The * would only apply to the previous character (A), not the entire string.

Note: -E works on BSD sed, but it would be -r on GNU sed.

ADD REPLYlink written 4.1 years ago by igor11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1728 users visited in the last hour