Question

Downloading Microsyntenic Fasta Sequences with Varying Chromosome Formats

0

Entering edit mode

9 months ago

Nicolas • 0

I have been working on analyzing microsyntenic regions between different species using the OMA Python API (https://github.com/DessimozLab/pyomadb). Now I would like to download the fasta sequences of these regions with a script, but it seems that the chromosome formats vary across species, making the extraction process more complex.

Dataframe

For example, when working with species like Bos taurus, I can find and fetch chromosome 13 from the refseq without any issues. However, for other species, such as Ailuropoda melanoleuca, the chromosome is represented as an "unplaced genomic scaffold" with the accession number GL192479.1, and the previous approach doesn't work.

I am relatively new to working with this type of data, so there's a possibility that I might overlook something. If you have any other suggestions or programs to accomplish this task, I would greatly appreciate your input

Thanks!

microsyntenic-region fasta oma chromosomes • 485 views

ADD COMMENT • link 8 months ago by Nicolas • 0

score 1 · Answer 1 · 2023-08-07

1

Entering edit mode

9 months ago

GenoMax 142k

Have you tried removing word scaffold and using just the accession?

You can use EntrezDirect in this way (as example):

$ efetch -db nuccore -id GL192479.1 -seq_start 1792869 -seq_stop 1792900 -format fasta
>GL192479.1:1792869-1792900 Ailuropoda melanoleuca unplaced genomic scaffold scaffold26, whole genome shotgun sequence
TATCCAGCTCACATAGAAGACATTGACTACGA

ADD COMMENT • link 9 months ago by GenoMax 142k

0

Entering edit mode

The thing is that I have cases where the value in the chromosome columns is just "15", not the accession number, and sometimes there is an accession with out the word scaffold. So I guess I will need to deal with this with an if statement with regex to discriminate each case an treat them differently.

Thanks!

ADD REPLY • link 8 months ago by Nicolas • 0

score 1 · Answer 2 · 2023-08-07

Hi Nicolas,

on the omabrowser you can also download all the CDS and Protein sequences, either as a single fasta file or also via the API. You can load the sequences for a specific protein with c.proteins[<id>]. If you need also the intergenetic sequences, I think the approach by GenoMax might be a solution, but as the data in OMA originates from many different sources, it might not always be possible to use the EntrezDirect.

Best wishes Adrian