Question

Selecting the last 100 nt from sequences

0

Entering edit mode

3.7 years ago

far.zi ▴ 10

Hi,

I have fasta file containing loci of like 500 introns. I don't know how to have just the last 100 bases using awk command lines.

Thanks, Farid

RNA-Seq • 1.0k views

ADD COMMENT • link 3.7 years ago by far.zi ▴ 10

2

Entering edit mode

Use bioawk or substr in awk with pre-calcuated sequence lengths (you can use length() for that). Read: https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html

ADD REPLY • link 3.7 years ago by Ram 43k

0

Entering edit mode

Thanks RamRS. I think this command works when I know the length of the introns. But I have different lengths and I want the last 100 bases from each sequence.

ADD REPLY • link 3.7 years ago by far.zi ▴ 10

1

Entering edit mode

You could use length($seq) in biooawk to calculate length on the fly. I don't see why you need to know length before you start the entire operation. It just needs to be calculated before the substr step.

ADD REPLY • link 3.7 years ago by Ram 43k

0

Entering edit mode

Thanks everyone. The sed -Ee 's/^.*(.{100})$/\1/' file.fasta worked great. Appreciate it.

ADD REPLY • link updated 3.7 years ago by Ram 43k • written 3.7 years ago by far.zi ▴ 10

0

Entering edit mode

Please accept answers that worked for you. You can accept more than one answer if they all work.

Upvote|Bookmark|Accept

ADD REPLY • link 3.7 years ago by Ram 43k

score 2 · Answer 1 · 2020-08-19

with seqkit:

$ seqkit subseq -r -100:-1 input.fa

if fasta sequence is in single line, with awk:

$ awk -v OFS="\n" '{getline seq} {print $0, substr(seq, length(seq)-99, length(seq))}' test.fa

if fasta sequence is in single line, with sed:

$ sed  -n  '/^>/p;/^>/! s/.*\(.\{100\}\)/\1/p' test.fa

score 1 · Answer 2 · 2020-08-19

1

Entering edit mode

3.7 years ago

zorbax ▴ 610

Maybe this will work for you, unless you have IDs longer than 100 characters.

perl -pe 'if(/\>/){s/\n/\t/}; s/\n//; s/\>/\n\>/' | sed -Ee '/^$/d ; s/^.*(.{100})$/\1/' file.fasta

You will get fasta file in single line, removed the empty lines and get the last characters.

ADD COMMENT • link 3.7 years ago by zorbax ▴ 610

1

Entering edit mode

You're making another assumption here: that the FASTA entries are single-line. If you have multi-line FASTA, this script won't work. Plus, this script cannot handle empty lines. It would also look a lot cleaner with extended regex sed:

sed -Ee 's/^.*(.{100})$/\1/' file.fasta

ADD REPLY • link 3.7 years ago by Ram 43k

0

Entering edit mode

You're right, I modified my answer.

ADD REPLY • link 3.7 years ago by zorbax ▴ 610

1

Entering edit mode

Cool, thanks for following up.

ADD REPLY • link 3.7 years ago by Ram 43k