Question: How to append two fasta files?
0
gravatar for Dineshkumar K
7 days ago by
Kasaragod, Kerala, India
Dineshkumar K10 wrote:

I have two fasta files as shown below,

File:1

Contig_1:90600-91187 AAGGCCATCAAGGACGTGGATGAGGTCGTCAAGGGCAAGGAACAGGAATTGATGACGGTC Contig_98:35323-35886 GACGAAGCGCTCGCCAAGGCCGAAGAAGAAGGCCTGGATCTGGTCGAAATCCAGCCGCAG Contig_24:26615-28387 GCTGCGGCGCTGATCCTGGCGGCCCGCGCCGAGGAGATCGCCCGTTTGGAGCGCGGCGAA

File:2

Contig_1:90600-91187
GACCGTCATCAATTCCTGTTCCTTGCCCTTGACGACCTCATCCACGTCCTTGATGGCCTT
Contig_24:26615-28387
TTCGCCGCGCTCCAAACGGGCGATCTCCTCGGCGCGGGCCGCCAGGATCAGCGCCG

Both files are having the same fasta headers but vary in terms of sequences. Hence, I need to replace File:2 sequences in File:1 as shown below,

Expected outcome:

Contig_1:90600-91187
GACCGTCATCAATTCCTGTTCCTTGCCCTTGACGACCTCATCCACGTCCTTGATGGCCTT
Contig_98:35323-35886
GACGAAGCGCTCGCCAAGGCCGAAGAAGAAGGCCTGGATCTGGTCGAAATCCAGCCG
Contig_24:26615-28387
TTCGCCGCGCTCCAAACGGGCGATCTCCTCGGCGCGGGCCGCCAGGATCAGCGCCG

I tried with cat command, but it is concatenating all the sequences instead of replacing the sequences as I mentioned above.

ADD COMMENTlink modified 7 days ago by Corentin290 • written 7 days ago by Dineshkumar K10
0
gravatar for Corentin
7 days ago by
Corentin290
Corentin290 wrote:

Hi,

You can use "samtools faidx"

samtools faidx file1 Contig_1:90600-91187  > result.fa

samtools faidx file2 Contig_98:35323-35886 >> result.fa
samtools faidx file2 Contig_24:26615-28387 >> result.fa

However, the name format "contig:start-stop" could cause some issues, faidx might interpret it as a command to extract just a portion of the sequences. You can rename the contigs as "name_start_stop" to avoid this.

ADD COMMENTlink written 7 days ago by Corentin290

Thank you Corentin. However, I have large fasta file, in that case it would be tedious to implement samtools faidx.

ADD REPLYlink modified 7 days ago • written 7 days ago by Dineshkumar K10

There is an option to give a list of regions to faidx:

-r, --region-file FILE   File of regions.  Format is chr:from-to. One per line.

You could create two files, one with a list of regions from file1, the other from file2 and run two faidx commands

ADD REPLYlink written 7 days ago by Corentin290

Thank you Corentin, Is it possible to extract the sequences based on the reverse order coordinates by samtools faidx. For example, Instead of proper order

Contig_98:35323-35886

If it like this,

Contig_98:35886-35323

Will it work?

ADD REPLYlink modified 7 days ago • written 7 days ago by Dineshkumar K10
1

I am not exactly sure what you want to do, if you want the reverse complement, faidx also have an option for that:

   -i, --reverse-complement
    Output the sequence as the reverse complement. `When this option is used, “/rc” will be appended to the sequence names. To turn this off or change the string appended, use the --mark-strand option`.

You can see the documentation here : http://www.htslib.org/doc/samtools.html

ADD REPLYlink written 7 days ago by Corentin290

Dear Corentin, If the coordinates are in reverse order, will samtools work? Let me put in this way, I have the series of coordinates to be extracted from the given data,

Contig_98:35323-35886
Contig_1:90600-91187
Contig_24:26615-28387
Contig_91:690-450

The above mentioned coordinates are well formatted except the last one Contig_91:690-450 . Because, the last coordinate is in reverse order. In that case will samtool works?

ADD REPLYlink written 7 days ago by Dineshkumar K10
1

Apparently faidx is giving this error if the coordinates are reversed:

[faidx] Zero length sequence:
ADD REPLYlink written 7 days ago by Corentin290

Thank you Corentin for your time and help. I have implemented your suggestions and all works well.

ADD REPLYlink written 6 days ago by Dineshkumar K10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1939 users visited in the last hour