Question

How to extract and concatenate fasta lines that match substring?

0

Entering edit mode

18 months ago

YOUSEUFS ▴ 30

I have a list of unique identifiers

identifiers = ['subject_1', 'subject_2']

and a multi-fasta file containing

>CDS::subject_1::123
AAATTT
>CDS::subject_1::354
CCCGGG
>CDS::subject_2::789
GGGCCC
>CDS::subject_2::765
TTTAAA

how would I extract every line that's associated to each unique identifier and concatenate them together to form an output file that looks like

>subject_1
AAATTTCCCGGG
>subject_2
GGGCCCTTTAAA

fasta python • 584 views

ADD COMMENT • link updated 18 months ago by rpolicastro 13k • written 18 months ago by YOUSEUFS ▴ 30

0

Entering edit mode

18 months ago

Pierre Lindenbaum 161k

cat input.fa  | paste - - | sed 's/>CDS:://;s/::[^\t]*//' | awk '{seq[$1]=sprintf("%s%s",seq[$1],$2);} END{for(n in seq) printf(">%s\n%s\n",n,seq[n]);}'

>subject_1
AAATTTCCCGGG
>subject_2
GGGCCCTTTAAA

ADD COMMENT • link 18 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

I'm having trouble getting this to work. Perhaps I should have stated more clearly, the fast-file looks more like

>CDS::NC_005291.1:5877-7537(-)
AAATTT
>CDS::NC_005291.1:7650-7800(-)
CCCGGG
>CDS::NC_007641.1:5877-7537(-)
AAATTT
>CDS::NC_007641.1:7650-7800(-)
CCCGGG

Which I'm trying to turn into

>NC_005291.1
AAATTTCCCGGG
>NC_007641.1
AAATTTCCCGGG

ADD REPLY • link 18 months ago by YOUSEUFS ▴ 30

0

Entering edit mode

change the sed expression...

ADD REPLY • link 18 months ago by Pierre Lindenbaum 161k

score 2 · Accepted Answer · 2022-10-06

2

Entering edit mode

18 months ago

rpolicastro 13k

seqkit and csvtk answer

seqkit replace -p ".+::(\S+):.+" -r "\$1" test.fasta |
  seqkit fx2tab |
  csvtk fold -tH -f1 -v2 -s"," |
  sed 's/,//g' |
  seqkit tab2fx

ADD COMMENT • link 18 months ago by rpolicastro 13k