Adding gene names to fasta file
1
0
Entering edit mode
3.3 years ago
mm2568 • 0

I have file 1 (a FASTA file):

>dmel_X type=golden_path_region; loc=X:2270008..2271068; ID=X; dbxref=GB:AE014298,GB:AE014298,REFSEQ:NC_004354;   release=r6.37; species=Dmel;
CTCTTTGTTAGCCTACGCTTTCTGCTGAGTTTGTTATTTTTGTCTGCTCCCCACAAGGATATTGTTACAGAGAAAAAGCT
CGAATTGAAGGGAAAATGGAGACAAATAAGAAAACCCATGACAAAGAGGAAAGTTTCAAATATGGGCAATCGAAAAAATC
GAGAAGTGAGCCAATTTTTTTTTCGCCGAGGCTCCACTGTTCCCAGCTGCATAACTGTTTTCCCTCGGCACCTCTCTTTT

>dmel_3L type=golden_path_region; loc=3L:20341634..20342694; ID=3L; dbxref=GB:AE014296,GB:AE014296,REFSEQ:NT_037436;   release=r6.37; species=Dmel;
ATTAGTATATAGGCATATGCTTAAGTCTTAGGGTCTTATGGATATGTCACTATATATATATATAATTGCATAAATAGAGA
TATAATAATAGAGGGAGATAATATATTGAAAGCTTTTAATTGCTTCATACAAATTGATGACATCTCAATATCAAATACAA
TGTTGGATTACACACAAACCGTTTATGTCAATAAGAAAATAACTAAATGGGAAGATCTTTCTATATAAGAATATATAGAG

And I have file 2 (gene names):

CG2918
Spn77Bc

How can I replace the string after the ">" in the FASTA file to have the unique gene names replace the "dmel_.....". The files are obviously longer, but the output should look like:

>CG2918 type=golden_path_region; loc=X:2270008..2271068; ID=X; dbxref=GB:AE014298,GB:AE014298,REFSEQ:NC_004354;   release=r6.37; species=Dmel;
CTCTTTGTTAGCCTACGCTTTCTGCTGAGTTTGTTATTTTTGTCTGCTCCCCACAAGGATATTGTTACAGAGAAAAAGCT
CGAATTGAAGGGAAAATGGAGACAAATAAGAAAACCCATGACAAAGAGGAAAGTTTCAAATATGGGCAATCGAAAAAATC
GAGAAGTGAGCCAATTTTTTTTTCGCCGAGGCTCCACTGTTCCCAGCTGCATAACTGTTTTCCCTCGGCACCTCTCTTTT

>Spn77Bc type=golden_path_region; loc=3L:20341634..20342694; ID=3L; dbxref=GB:AE014296,GB:AE014296,REFSEQ:NT_037436;   release=r6.37; species=Dmel;
ATTAGTATATAGGCATATGCTTAAGTCTTAGGGTCTTATGGATATGTCACTATATATATATATAATTGCATAAATAGAGA
TATAATAATAGAGGGAGATAATATATTGAAAGCTTTTAATTGCTTCATACAAATTGATGACATCTCAATATCAAATACAA
TGTTGGATTACACACAAACCGTTTATGTCAATAAGAAAATAACTAAATGGGAAGATCTTTCTATATAAGAATATATAGAG

Thank you so much!

fasta motif-search • 1.2k views
ADD COMMENT
1
Entering edit mode

Are the gene names in the file the same order as the fasta entry for which it matches?

ADD REPLY
0
Entering edit mode

Yes, the gene names in the file are in the same order as in the FASTA!

ADD REPLY
1
Entering edit mode

What is the relationship between CG2918 and dmel_X? Are they simply in order as @rpolicastro asked, or do you have some sort of mapping file?

ADD REPLY
0
Entering edit mode

See reply above, they are in the same order - so I was hoping to iterate line by line and replace from the file to the FASTA.

ADD REPLY
2
Entering edit mode
3.3 years ago
Joe 21k

Here's a biopython solution:

from Bio import SeqIO
import sys

with open(sys.argv[1], 'r') as fh:
    for name, record in zip((line for line in fh.readlines()), SeqIO.parse(sys.argv[2], 'fasta')):
        record.description = record.description.replacerecord.id+' ', '')
        record.id = name.strip()

        print(record.format('fasta'))

Run as python scriptname.py id_file.txt sequences.fasta

Note that due to a weird Biostars bug, there is meant to be a ( between and replace and record.id on line 5.

ADD COMMENT
0
Entering edit mode

This is awesome, it worked. Thank you so much for your help!

ADD REPLY

Login before adding your answer.

Traffic: 1669 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6