How to create a fasta file using a list of DNA sequences data
2
0
Entering edit mode
20 months ago
Alex S ▴ 20

I have a file with the following structure:

Lcn.Chr1:75500000-95000000:1393900-1393947  gaaatgatttaattagattatttgaggtttgatgattaggattagag 1648480
Lcn.Chr1:75500000-95000000:1393980-1394025  AAATATGAACTCAGGGTTTTGAGATAAGCCAAACAACGATTCCAC   1648480
Lcn.Chr1:75500000-95000000:1394080-1394127  caccccaacttttataattgctatttaaattaattaattagtattgt 1648480

I've extracted the sequences using cut -f 2, now I need to make them as a .fasta format to use it as a database for a blast analysis. Any tips on how to add the fasta header to those sequences? The IDs could be numbers 001, 002, 003..

linux fasta • 733 views
ADD COMMENT
2
Entering edit mode
20 months ago
Dave Carlson ★ 1.7k

Using your example file (let's call it seqs.txt).

cat seqs.txt| while read line; do printf "%s%s\n%s\n" ">" $(echo $line | cut -d " " -f 1) $(echo $line | cut -d " " -f 2); done

produces:

>Chr1:75500000-95000000:1393900-1393947
gaaatgatttaattagattatttgaggtttgatgattaggattagag
>Lcn.Chr1:75500000-95000000:1393980-1394025
AAATATGAACTCAGGGTTTTGAGATAAGCCAAACAACGATTCCAC
>Lcn.Chr1:75500000-95000000:1394080-1394127
caccccaacttttataattgctatttaaattaattaattagtattgt

That's a pretty ugly solution, but it should work.

ADD COMMENT
2
Entering edit mode
20 months ago
Shred ★ 1.4k

In Python3

import sys

with open(sys.argv[1], 'r') as sequences:
    for idx,line in enumerate(sequences):
        print(f">{idx:03d}")
        print(line.rstrip().split('\t')[1])

Launch it as

python3 script.py your_input_file > output.fasta

It produces

>0001
gaaatgatttaattagattatttgaggtttgatgattaggattagag
>0002
AAATATGAACTCAGGGTTTTGAGATAAGCCAAACAACGATTCCAC
..
ADD COMMENT

Login before adding your answer.

Traffic: 2263 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6