How to replace/fill "Ns" in fasta with reference file having same coordinates
1
0
Entering edit mode
7 weeks ago

Dear community,

Hope you are doing great. As asked in title, please guide if there is any way to fill or replace N or N's in fasta file with the help of reference file.

For example

# INPUT

Fasta with Ns

>fasta1
ACTGGCATCATGNNNNACTTTTGACC


Reference Fasta

>reference
ACTGGCATCATGTCAGACTTTTGACC


## OUTPUT

>fasta1
ACTGGCATCATG**TCAG**ACTTTTGACC


I will really appreciate any help in this regard

awk sed fasta • 566 views
0
Entering edit mode

you have to explain how your problem is different from cp ref.fa user.fa

0
Entering edit mode

It is different in a way that it is exactly/completely copying the complete ref.fa into user.fa. However, what I want is manipulation at "N" regions only. For example: my ref file =ACTGGCATCATGTTTTACTTTTGACC and user file is=ACTGGCATCATGNNNNACTTTTGACC. So i want only Ns to be replaced by TTTT [as specified in reference] and not change any other characters. Hope it answers your query

0
Entering edit mode

Is the each entry in the "Fasta with Ns" always going to be exactly the same length as the equivalent entry in the "reference fasta"?.

0
Entering edit mode

Yes, its the same. coordinates wise. However the length of Ns might be different across the complete FASTA

0
Entering edit mode
0
Entering edit mode
7 weeks ago

I assume the query file has only one short sequence, neither contigs/scaffolds nor the whole assembly.

Here's a semi-automatic way:

1. Searching in the reference

 seqkit locate --degenerate --pattern-file test.fasta ref.fasta
seqID   patternName     pattern strand  start   end     matched
reference       fasta1  ACTGGCATCATGNNNNACTTTTGACC      +       7       32      ACTGGCATCATGTCAGACTTTTGACC

2. Replacing queries with matched sequences

 seqkit replace --by-seq -p ACTGGCATCATGNNNNACTTTTGACC -r ACTGGCATCATGTCAGACTTTTGACC test.fasta
>fasta1
ACTGGCATCATGTCAGACTTTTGACC


However the length of Ns might be different across the complete FASTA

If there are lots of records, a script is needed. For every record:

1. Extract subsequences around N+
2. Index on the reference using the subsequences.
3. Return the subsequences on the ref.