Question

Collapse multifasta file by specific chromosome names

0

Entering edit mode

17 months ago

YOUSEUFS ▴ 30

I have a multicast file with unique identifiers ('SUBJECT.1', 'SUBJECT.2' etc) like this:

>SUBJECT.1.1:1203-2742(+)
AAATTT
>SUBJECT.1:354-700(+)
CCCGGG
>SUBJECT.2:789-2000(+)
GGGCCC
>SUBJECT.2:2012-2742(+)
TTTAAA

how would I extract every line that's associated to each unique identifier and concatenate them together to form an output file that looks like

>SUBJECT.1
AAATTTCCCGGG
>SUBJECT.2
GGGCCCTTTAAA

fasta • 586 views

ADD COMMENT • link 17 months ago by YOUSEUFS ▴ 30

0

Entering edit mode

Maybe something along these lines?:

1) Simplify headers:

cut -d':' -f1 input.fa > output.fa

2) Concat entries with same IDs using seqkit, specifically, seqkit concat

I'd make 100% sure that the entries in the fasta file are ordered properly before merging, and that you don't have duplicated ids.

ADD REPLY • link 17 months ago by iraun 6.2k

0

Entering edit mode

seqkit only works when merging two file, this is a single file.

ADD REPLY • link 17 months ago by YOUSEUFS ▴ 30

score 2 · Accepted Answer · 2022-11-15

2

Entering edit mode

17 months ago

iraun 6.2k

Then use awk:

cut -d':' -f1 input.fa > output.fa

awk '/>/ { id = $0 } !/>/ { seq[id] = seq[id] $0 } END { for (id in seq) print id "\n" seq[id] }' output.fa > output_collapsed.fa

In this example I have assumed that the IDs you want to collapse are those before :, please adapt the code to your desired IDs as you consider. And as i said, remember the order of the sequences.