Split fasta based on symbol and numeric position in header
0
0
Entering edit mode
25 days ago

Hi, I have discovered that there is a problem with the database I created, which is that my pipeline is not handling multiple genes in a single fasta entry. Carbohydrate active enzyme (CAZyme) genes often contain a GH and a CBM close to one another, and in my current database that looks like:

>NZ_LR732827.1_2243__GH81(350-867)+CBM32(1336-1449)+CBM32(1571-1687)+CBM32(1705-1820) Curtobacterium_sp_8I-2
WGDWNVSWRMPQDGPGGQYMDVTSVQGSPY... [truncated for ease of viewing]


The number in parentheses reflects the position of the gene in the sequence that follows it. I have read through lots of solutions to similar problems but I can't work out how to apply them to my specific situation.

My required output would contain:

>NZ_LR732827.1_2243__GH81(350-867)
WGDWNVSWR
>NZ_LR732827.1_2243__CBM32(1336-1449)
MDVTSVQGR


With the sequence corresponding to the position in parentheses. I understand that bedtools getfasta will let me split based on position so if it's not possible to get the above in one step, if I could simply duplicate the whole sequence for each new entry then I will be able to extract the relevant part afterwards.

I also would like this code to look through my whole fasta which contains mostly one gene per header and ignore those, for those are already in the format I want, I just want it to create a new line and duplicate sequence for every "+" it sees.

I believe the best tool for this would be biopython, but my current attempt:

from Bio import SeqIO

for record in SeqIO.parse("file.fa", "fasta"):
species = record.id.split('+')[0]
with open(f"{species}.fa", "a") as f:
SeqIO.write(record, f, "fasta")


Is simply cutting off everything after the first "+" and also is missing the Curtobacterium name at the end as it is separated by a space I think.

Thank you so much for your help!! Please let me know if there is anything that is unclear

python fasta sequence genome split • 357 views
0
Entering edit mode

I don't know if this will be helpful, but in my years of experience with human genomics, I've only seen FASTA files as storage of raw sequencing data and reference files that contain chromosomes or contigs. Genes are usually specified in GTF files that contain the position and other metadata about genes. GTF files are basically in BED format which makes them very easy to work with using BEDtools or something similar. The only tools I've seen for FASTA files involve fetching the sequence given a position, like bedtools nuc. Storing gene positions in fasta headers in parenthesis seems like it will lead to some difficulties.

0
Entering edit mode

Thanks for getting back to me - I could see why that would be advantageous. Is there a tool that I could use if I went back a step and put them in GTF format instead?

0
Entering edit mode

You need to explain what functions you want the database to have. if you use GTF format than this will be readable by alignment programs such as STAR and BWA. If the genes are in BED format than you can perform genome arithmetic with BEDtools, and visualize gene locations with IGV or the genome browser. It's likely a good idea to have the genes in both GTF and BED format. Are you working on non model organisms? GTF files are available for many species at the UCSC table browser, you could download one there and then modify it to your liking.

0
Entering edit mode

Okay this is super helpful thank you. My only need for the database is for it to have a fasta entry for each CAZyme so I can blast metagenomes against it (with a very very stringent filter). I ended up with my database the way it is above because I am new to bioinformatics and I'm thinking I've created a problem with my previous sub-optimal step. If I go back a step, I have data in the form of

>NZ_LR732827.1_2243 # 2371281 # 2376752 # -1 # ID=1_2243;partial=00;start_type=GTG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.722


And then also the tool gives the CAZyme info (with the plus signs for each hit) in the form of

NZ_LR732827.1_2243  CBM38(188-311)+GH81(350-867)+CBM32(1336-1449)+CBM32(1571-1687)+CBM32(1705-1820)


So I've extracted only the information in the first file that matches the NZ_etc in the second file. This left me with a database containing multiple genes per sequence.

I'm guessing the best solution to my problem would be at this step, by splitting up the plus signs at this point into new lines so that it looks like

NZ_LR732827.1_2243  CBM38(188-311)
NZ_LR732827.1_2243  GH81(350-867)


And then create the database. I tried the below suggestion of

cut -d "+" -f1 your_file.fa


But it still is returning only the top line (with CBM38). This seems like it will have an easier solution though I will see if I can find one!

0
Entering edit mode

You may be able to do

\$ cut -d "+" -f1 your_file.fa
>NZ_LR732827.1_2243__GH81(350-867)
WGDWNVSWRMPQDGPGGQYMDVTSVQGSPY

0
Entering edit mode

Thanks for the response - I tried this and it looks like it gives me the same output as the biopython script, which is the sequence and the information but only for the sequence header before the first "+" - is there a way to have it duplicate sequence for each time it sees the + delimiter?

0
Entering edit mode

Just to clarify you want it to keep track of the aa number and then create a new record for each ID separated by + by taking into account the actual number in brackets? That would certainly need custom code.

0
Entering edit mode

If possible in one step that would be great but I have already worked out how to do the second part of that with bedtools, for this I just want to duplicate the NZ number and create a new line every time there is a + so that

NZ_LR732827.1_2243  CBM38(188-311)+GH81(350-867)


Turns into

NZ_LR732827.1_2243  CBM38(188-311)
NZ_LR732827.1_2243  GH81(350-867)