How to subset the fasta sequence starting with "G"?
2
0
Entering edit mode
6.0 years ago
bright602 ▴ 50

Hi there,

If I have a fasta sequence as following:

chr3:181879479-181879497 CGTTCCTCCTGGCGAGAG chr3:181879488-181879506 TACTTATTTCGTTCCTCC chr3:181879507-181879525 GAGGAGTGGGCATGAGGA chr3:181879549-181879567 AACCCTAAATGTCAATTA

How do I extract the sequence starting with "G"

chr3:181879507-181879525 GAGGAGTGGGCATGAGGA

Thanks a lot.

genome sequence • 889 views
ADD COMMENT
0
Entering edit mode

It seems to me, it's better to deal with fasta-sequences, so add ">"-sign before 'chr'.

Start from the first ">" and read every sign until the next ">". Gaps play the role of a new line sign, don't they?

Make and open a new empty file. Write everything that has been read to this new file.

Check the first letter after the gap or spacer, " ". If this was 'G', save the file with "good"-current output, then continue.

If this was not 'G', don't save the file with the latest output.

ADD REPLY
0
Entering edit mode
ADD REPLY
1
Entering edit mode
6.0 years ago
Daniel ★ 3.9k

Also couldn't resist. Although is it a poor formatting on upload and actually each space is a newline? If so:

grep -B 1 '^G' file >outfile

Otherwise, turn spaces into newlines first (sed -i 's/ /\n/g' file), then do that.

ADD COMMENT
0
Entering edit mode

true. a sed + grep combination seems even more evident than a perl one-liner, plus it'll work on a valid fasta file if that would be the case:

sed 's/ /\n/g' inFile | grep -B1 '^G' >outFile
ADD REPLY
0
Entering edit mode
6.0 years ago
  1. that is not fasta format

  2. this sound like homework

  3. couldn't resist solving it: perl -lne 'while (/(chr\S+\sG\S+)/g) { print $1 }' file

ADD COMMENT

Login before adding your answer.

Traffic: 1779 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6