Question: awk to clip off telomeres
0
gravatar for rm16
2.6 years ago by
rm160
rm160 wrote:

I want to use the Mac OSX terminal to clip repetitive sequences from the beginning of the each sequence in a fasta file. For example, I would like to make the following file:

>seq1
CCCCAAAACCCCATGATCATGGATC
>seq2
CCCCAAAACCCCATGGCATCATTCA
>seq3
CCCCAAAACCCCATGTTGCTACTAG

become:

 >seq1
ATGATCATGGATC
>seq2
ATGGCATCATTCA
>seq3
ATGTTGCTACTAG

by clipping off the CCCCAAAACCCC at the beginning of each sequence. Is there a way I can do this in the OSX terminal?

awk terminal osx fasta telomere • 922 views
ADD COMMENTlink modified 2.6 years ago by Farbod3.2k • written 2.6 years ago by rm160
1

Why don't you use something like fastx? http://hannonlab.cshl.edu/fastx_toolkit/index.html

ADD REPLYlink written 2.6 years ago by Benn6.6k

That seems pretty handy. Thank you for the tip. I'll try it out.

ADD REPLYlink written 2.6 years ago by rm160
1
gravatar for coleman_jonathan
2.6 years ago by
European Union
coleman_jonathan410 wrote:

This seems like more of a sed query than an awk one. Specific solution to your query below:

sed -e 's/^CCCCAAAACCCC//g' inputfile.txt > outputfile.txt

(This removes the sequence of CCCCAAAACCCC from the beginning of any line).

The general solution (which I imagine is what you really want) is harder because you need to set criteria for what the sequence can comprise... For example, if you just wanted to strip the first 12 n'tides from your sequences the following would work:

sed -e 's/^[A,C,T,G]\{12\}//g' inputfile.txt > outputfile.txt

...but I imagine that is too simplistic?

Either way, you should check out sed substitutions and regular expressions.

ADD COMMENTlink modified 2.6 years ago by geek_y9.4k • written 2.6 years ago by coleman_jonathan410

That sounds like a good starting place. I think I can adapt this. Thanks a lot!

ADD REPLYlink written 2.6 years ago by rm160
1
gravatar for Farbod
2.6 years ago by
Farbod3.2k
Toronto
Farbod3.2k wrote:

Hi ,

have you tried BioAwk ?

https://github.com/lh3/bioawk

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by Farbod3.2k
0
gravatar for shenwei356
2.6 years ago by
shenwei3564.5k
China
shenwei3564.5k wrote:

You can also try SeqKit, usage of subseq

Subseq from 13th to last base (-1):

seqkit subseq -r 13:-1 seq.fa > out.fa
ADD COMMENTlink written 2.6 years ago by shenwei3564.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1203 users visited in the last hour