How to replace/fill "Ns" in fasta with reference file having same coordinates
1
0
Entering edit mode
2.4 years ago
Adnan • 0

Dear community,

Hope you are doing great. As asked in title, please guide if there is any way to fill or replace N or N's in fasta file with the help of reference file.

For example

INPUT

Fasta with Ns

>fasta1
ACTGGCATCATGNNNNACTTTTGACC

Reference Fasta

>reference
ACTGGCATCATGTCAGACTTTTGACC

OUTPUT

>fasta1
ACTGGCATCATG**TCAG**ACTTTTGACC

I will really appreciate any help in this regard

Kind regards Ad

awk sed fasta • 1.8k views
ADD COMMENT
0
Entering edit mode

you have to explain how your problem is different from cp ref.fa user.fa

ADD REPLY
0
Entering edit mode

It is different in a way that it is exactly/completely copying the complete ref.fa into user.fa. However, what I want is manipulation at "N" regions only. For example: my ref file =ACTGGCATCATGTTTTACTTTTGACC and user file is=ACTGGCATCATGNNNNACTTTTGACC. So i want only Ns to be replaced by TTTT [as specified in reference] and not change any other characters. Hope it answers your query

ADD REPLY
0
Entering edit mode

Is the each entry in the "Fasta with Ns" always going to be exactly the same length as the equivalent entry in the "reference fasta"?.

ADD REPLY
0
Entering edit mode

Yes, its the same. coordinates wise. However the length of Ns might be different across the complete FASTA

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode
2.4 years ago

I assume the query file has only one short sequence, neither contigs/scaffolds nor the whole assembly.

Here's a semi-automatic way:

  1. Searching in the reference

     seqkit locate --degenerate --pattern-file test.fasta ref.fasta 
     seqID   patternName     pattern strand  start   end     matched
     reference       fasta1  ACTGGCATCATGNNNNACTTTTGACC      +       7       32      ACTGGCATCATGTCAGACTTTTGACC
    
  2. Replacing queries with matched sequences

     seqkit replace --by-seq -p ACTGGCATCATGNNNNACTTTTGACC -r ACTGGCATCATGTCAGACTTTTGACC test.fasta
     >fasta1
     ACTGGCATCATGTCAGACTTTTGACC
    

However the length of Ns might be different across the complete FASTA

If there are lots of records, a script is needed. For every record:

  1. Extract subsequences around N+
  2. Index on the reference using the subsequences.
  3. Return the subsequences on the ref.
ADD COMMENT

Login before adding your answer.

Traffic: 2315 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6