How to replace/fill "Ns" in fasta with reference file having same coordinates
1
0
Entering edit mode
8 weeks ago
Adnan • 0

Dear community,

Hope you are doing great. As asked in title, please guide if there is any way to fill or replace N or N's in fasta file with the help of reference file.

For example

# INPUT

Fasta with Ns

>fasta1
ACTGGCATCATGNNNNACTTTTGACC


Reference Fasta

>reference
ACTGGCATCATGTCAGACTTTTGACC


## OUTPUT

>fasta1
ACTGGCATCATG**TCAG**ACTTTTGACC


I will really appreciate any help in this regard

Kind regards Ad

awk sed fasta • 584 views
ADD COMMENT
0
Entering edit mode

you have to explain how your problem is different from cp ref.fa user.fa

ADD REPLY
0
Entering edit mode

It is different in a way that it is exactly/completely copying the complete ref.fa into user.fa. However, what I want is manipulation at "N" regions only. For example: my ref file =ACTGGCATCATGTTTTACTTTTGACC and user file is=ACTGGCATCATGNNNNACTTTTGACC. So i want only Ns to be replaced by TTTT [as specified in reference] and not change any other characters. Hope it answers your query

ADD REPLY
0
Entering edit mode

Is the each entry in the "Fasta with Ns" always going to be exactly the same length as the equivalent entry in the "reference fasta"?.

ADD REPLY
0
Entering edit mode

Yes, its the same. coordinates wise. However the length of Ns might be different across the complete FASTA

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode
8 weeks ago

I assume the query file has only one short sequence, neither contigs/scaffolds nor the whole assembly.

Here's a semi-automatic way:

1. Searching in the reference

 seqkit locate --degenerate --pattern-file test.fasta ref.fasta
seqID   patternName     pattern strand  start   end     matched
reference       fasta1  ACTGGCATCATGNNNNACTTTTGACC      +       7       32      ACTGGCATCATGTCAGACTTTTGACC

2. Replacing queries with matched sequences

 seqkit replace --by-seq -p ACTGGCATCATGNNNNACTTTTGACC -r ACTGGCATCATGTCAGACTTTTGACC test.fasta
>fasta1
ACTGGCATCATGTCAGACTTTTGACC


However the length of Ns might be different across the complete FASTA

If there are lots of records, a script is needed. For every record:

1. Extract subsequences around N+
2. Index on the reference using the subsequences.
3. Return the subsequences on the ref.
ADD COMMENT

Login before adding your answer.

Traffic: 2504 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6