Selecting the last 100 nt from sequences
2
0
Entering edit mode
3.7 years ago
far.zi ▴ 10

Hi,

I have fasta file containing loci of like 500 introns. I don't know how to have just the last 100 bases using awk command lines.

Thanks, Farid

RNA-Seq • 1.0k views
ADD COMMENT
2
Entering edit mode

Use bioawk or substr in awk with pre-calcuated sequence lengths (you can use length() for that). Read: https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html

ADD REPLY
0
Entering edit mode

Thanks RamRS. I think this command works when I know the length of the introns. But I have different lengths and I want the last 100 bases from each sequence.

ADD REPLY
1
Entering edit mode

You could use length($seq) in biooawk to calculate length on the fly. I don't see why you need to know length before you start the entire operation. It just needs to be calculated before the substr step.

ADD REPLY
0
Entering edit mode

Thanks everyone. The sed -Ee 's/^.*(.{100})$/\1/' file.fasta worked great. Appreciate it.

ADD REPLY
0
Entering edit mode

Please accept answers that worked for you. You can accept more than one answer if they all work.

Upvote|Bookmark|Accept

ADD REPLY
2
Entering edit mode
3.7 years ago

with seqkit:

$ seqkit subseq -r -100:-1 input.fa

if fasta sequence is in single line, with awk:

$ awk -v OFS="\n" '{getline seq} {print $0, substr(seq, length(seq)-99, length(seq))}' test.fa

if fasta sequence is in single line, with sed:

$ sed  -n  '/^>/p;/^>/! s/.*\(.\{100\}\)/\1/p' test.fa
ADD COMMENT
1
Entering edit mode
3.7 years ago
zorbax ▴ 610

Maybe this will work for you, unless you have IDs longer than 100 characters.

perl -pe 'if(/\>/){s/\n/\t/}; s/\n//; s/\>/\n\>/' | sed -Ee '/^$/d ; s/^.*(.{100})$/\1/' file.fasta

You will get fasta file in single line, removed the empty lines and get the last characters.

ADD COMMENT
1
Entering edit mode

You're making another assumption here: that the FASTA entries are single-line. If you have multi-line FASTA, this script won't work. Plus, this script cannot handle empty lines. It would also look a lot cleaner with extended regex sed:

sed -Ee 's/^.*(.{100})$/\1/' file.fasta
ADD REPLY
0
Entering edit mode

You're right, I modified my answer.

ADD REPLY
1
Entering edit mode

Cool, thanks for following up.

ADD REPLY

Login before adding your answer.

Traffic: 1670 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6