how to extract genes from a fasta file in groups of 2, query-object, and store the "couple" in different files
1
0
Entering edit mode
4.2 years ago
bio90029 ▴ 10

Hi, I am running out of ideas to do this, and I will appreciate some help, please.I have 2 fasta files from two different bacterial strains with 1000 genes each. An example of files:

file A:                           fileB
query seq.id                    query seq.id
query seq                        query seq.id
objA seq.id                     objB seq.id
objA seq                          objB seq
query_1 seq.id                query_1 seq.id
query_1 seq                    query_1 seq
obj_1A seq.id                obj_1B seq.id
obj_1A seq                    obj_1B seq


What I would like to do is to get it this:

   file_1                  file_2
query seq.id          query_1 seq.id
query seq              query_1 seq
obj seq.id                obj_1A seq. id
obj seq                   obj_A seq
objB seq.id             obj_1B seq.id
objB seq                  obj_1B seq


But I just dont know how to split the fasta files. I was trying to do this using biopython SeqIO but I am quite lost.

python biopython • 870 views
1
Entering edit mode

If I understand correctly, you have 2 large .fa files, of which individual entries you would like to split to separate folders?

0
Entering edit mode

In fact, I have 100 fasta files that contained about 1000 genes, but if I manage to do it for 2, and will manage for all. Each file contained, the query reference genes, and the object gene they match. What I would like to do is to split the fasta files or to short them out in the way that I have one file per query gene with all the matching object genes.

0
Entering edit mode

Please show the Biopython code you're trying, with errors, and we can help correct any errors. This would be the most beneficial for you, as a learning experience, and also keep s/o from writing the code for you as Biostars is not a coding service.

0
Entering edit mode

I was trying to do this using biopython SeqIO but I am quite lost.

Post what you have tried. Also, your use of terms here is a little confusing. Are there '>' in our fasta file headers, and just not represented here? Perhaps post a cpl of examples from you files, if you can share them.

0
Entering edit mode

Yes, the genes id all containg the '>' .

for file in files:
my_file=glob.glob(file + '/*.fa')
#print my_file[1]
filename='outfile%s.fasta'
records=list(SeqIO.parse(my_file[1], 'fasta'))
query_gene=records[0].id
print query_gene
for record in  range (0,len(records),2):
with open(os.path.join('/path/files/', filename% query_gene), 'w') as output_handle:
SeqIO.write(records, ouput_handle, 'fasta')
sorting_fasta_files()


But I don't get the right ouput. In fact it places all the genes in the new fasta file when I only want 2 genes per file.

1
Entering edit mode
4.2 years ago
bio90029 ▴ 10

The answer is this little biopython script that I am going to post the link in case someone else is in the need to do the same than me. [http://biopython.org/wiki/Split_large_file][1]