INPUT

Question

How to replace/fill "Ns" in fasta with reference file having same coordinates

0

Entering edit mode

2.4 years ago

Adnan • 0

Dear community,

Hope you are doing great. As asked in title, please guide if there is any way to fill or replace N or N's in fasta file with the help of reference file.

For example

INPUT

Fasta with Ns

>fasta1
ACTGGCATCATGNNNNACTTTTGACC

Reference Fasta

>reference
ACTGGCATCATGTCAGACTTTTGACC

OUTPUT

>fasta1
ACTGGCATCATG**TCAG**ACTTTTGACC

I will really appreciate any help in this regard

Kind regards Ad

awk sed fasta • 1.8k views

ADD COMMENT • link updated 2.4 years ago by shenwei356 8.4k • written 2.4 years ago by Adnan • 0

0

Entering edit mode

you have to explain how your problem is different from cp ref.fa user.fa

ADD REPLY • link 2.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

It is different in a way that it is exactly/completely copying the complete ref.fa into user.fa. However, what I want is manipulation at "N" regions only. For example: my ref file =ACTGGCATCATGTTTTACTTTTGACC and user file is=ACTGGCATCATGNNNNACTTTTGACC. So i want only Ns to be replaced by TTTT [as specified in reference] and not change any other characters. Hope it answers your query

ADD REPLY • link 2.4 years ago by Adnan • 0

0

Entering edit mode

Is the each entry in the "Fasta with Ns" always going to be exactly the same length as the equivalent entry in the "reference fasta"?.

ADD REPLY • link 2.4 years ago by i.sudbery 19k

0

Entering edit mode

Yes, its the same. coordinates wise. However the length of Ns might be different across the complete FASTA

ADD REPLY • link 2.4 years ago by Adnan • 0

0

Entering edit mode

cross posted: https://stackoverflow.com/questions/70135133/

ADD REPLY • link 2.4 years ago by Pierre Lindenbaum 161k

score 0 · Answer 1 · 2021-11-28

I assume the query file has only one short sequence, neither contigs/scaffolds nor the whole assembly.

Here's a semi-automatic way:

Searching in the reference

 seqkit locate --degenerate --pattern-file test.fasta ref.fasta 
 seqID   patternName     pattern strand  start   end     matched
 reference       fasta1  ACTGGCATCATGNNNNACTTTTGACC      +       7       32      ACTGGCATCATGTCAGACTTTTGACC

Replacing queries with matched sequences

 seqkit replace --by-seq -p ACTGGCATCATGNNNNACTTTTGACC -r ACTGGCATCATGTCAGACTTTTGACC test.fasta
 >fasta1
 ACTGGCATCATGTCAGACTTTTGACC

However the length of Ns might be different across the complete FASTA

If there are lots of records, a script is needed. For every record:

Extract subsequences around N+
Index on the reference using the subsequences.
Return the subsequences on the ref.