Question

Concatenate Sequence Fragments in Multiple Alignment Fasta

0

Entering edit mode

7.3 years ago

bhanratt ▴ 50

I am using UCSC's multiz 100 species vertebrate multiple alignment fasta for hg19. It is refGene.exponAA.fa available here: http://hgdownload.cse.ucsc.edu/goldenpath/hg19/multiz46way/alignments/

The sequences seem to be broken up into fragments. For example the first sequence is:

>NM_152486.2_hg19_1_13 24 0 0 chr1:861322-861393+
MSKGILQVHPPICDCPGCRISSPV

In this example this is fragment 1 of 13. NM_152486.2_hg19_1_13 Further down there is 2_13, 3_13 etc.

I would like to concatenate all 13 fragments into 1 sequence for each refseq ID.

Is there existing software or a script that can perform this task?

sequence • 2.7k views

ADD COMMENT • link updated 7.3 years ago by Chun-Jie Liu ▴ 280 • written 7.3 years ago by bhanratt ▴ 50

score 0 · Answer 1 · 2017-01-10

0

Entering edit mode

7.3 years ago

Chun-Jie Liu ▴ 280

You may try sed -n '/NM_152486.2_hg19/{n;p}' refGene.fa |tr -d '\n'

ADD COMMENT • link 7.3 years ago by Chun-Jie Liu ▴ 280

0

Entering edit mode

Thanks for your response. I guess I didn't explain it very well. I need it to do it on all IDs and species and am just asking if anyone knows an existing method. Otherwise I can write one myself.

Thanks though!

ADD REPLY • link 7.3 years ago by bhanratt ▴ 50

0

Entering edit mode

Use bash loop, grep and sed one line command can deal with this problem. I give the sed part.

ADD REPLY • link 7.3 years ago by Chun-Jie Liu ▴ 280