extract sequences from fasta starting with a specific nucleotide
2
0
Entering edit mode
8.0 years ago
Chris ▴ 30

Hi, have a fasta file of 500 sequences, I want to extract the sequences starting with a specific nucleotide in the example A. Can anyone help?

E.g.

>seq_1
ACACACCGCTTCTAGCTG
>seq_2
ACAGGCAGAATTCTACAAGGA
>seq_3
CAAATATAATGACTATGGAATACC
>seq_4
CAATCGCCCGCTCACCTAGGTCT
>seq_5-493
TAACAGGCAGAATTCTACAAGGAC

Desired output:

>seq_1
ACACACCGCTTCTAGCTG
>seq_2
ACAGGCAGAATTCTACAAGGA

thank you in advance for your help

next-gen RNA-Seq sequence • 3.3k views
ADD COMMENT
0
Entering edit mode

Thank you all for your answers. Both methods are working!

ADD REPLY
2
Entering edit mode
8.0 years ago
venu 7.1k

Something like following should work

grep '^A' -B 1 file.fa | sed '/--/d' > new_file.fa

Update: (Credits - Pierre)

grep '^A' -B 1 --no-group-separator file.fa > new_file.fa
ADD COMMENT
2
Entering edit mode

A yes, grep -B1 ! :-) , you know there is a secret option in grep to remove the double hyphen: --no-group-separator

ADD REPLY
1
Entering edit mode

Aww..this is cool. Thank you Pierre.

ADD REPLY
0
Entering edit mode

--no-group-separator is not available in all implementations of grep. So having a sed/grep -v to exclude separators is a safe bet.

ADD REPLY
0
Entering edit mode

It is not there in man page also. (grep (GNU grep) 2.16).

ADD REPLY
0
Entering edit mode
8.0 years ago

assuming there are only 2 lines per record:

cat input.fa |paste - - | awk  '($2 ~ /^A/)' | tr "\t" "\n"
ADD COMMENT
0
Entering edit mode

What if I want to extract sequences with T at the 10th position? Thanks.

ADD REPLY

Login before adding your answer.

Traffic: 2006 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6