Downloading Microsyntenic Fasta Sequences with Varying Chromosome Formats
2
0
Entering edit mode
9 months ago
Nicolas • 0

I have been working on analyzing microsyntenic regions between different species using the OMA Python API (https://github.com/DessimozLab/pyomadb). Now I would like to download the fasta sequences of these regions with a script, but it seems that the chromosome formats vary across species, making the extraction process more complex.

Dataframe

For example, when working with species like Bos taurus, I can find and fetch chromosome 13 from the refseq without any issues. However, for other species, such as Ailuropoda melanoleuca, the chromosome is represented as an "unplaced genomic scaffold" with the accession number GL192479.1, and the previous approach doesn't work.

I am relatively new to working with this type of data, so there's a possibility that I might overlook something. If you have any other suggestions or programs to accomplish this task, I would greatly appreciate your input

Thanks!

microsyntenic-region fasta oma chromosomes • 485 views
ADD COMMENT
1
Entering edit mode
9 months ago
GenoMax 142k

Have you tried removing word scaffold and using just the accession?

You can use EntrezDirect in this way (as example):

$ efetch -db nuccore -id GL192479.1 -seq_start 1792869 -seq_stop 1792900 -format fasta
>GL192479.1:1792869-1792900 Ailuropoda melanoleuca unplaced genomic scaffold scaffold26, whole genome shotgun sequence
TATCCAGCTCACATAGAAGACATTGACTACGA
ADD COMMENT
0
Entering edit mode

The thing is that I have cases where the value in the chromosome columns is just "15", not the accession number, and sometimes there is an accession with out the word scaffold. So I guess I will need to deal with this with an if statement with regex to discriminate each case an treat them differently.

Thanks!

ADD REPLY
1
Entering edit mode
8 months ago

Hi Nicolas,

on the omabrowser you can also download all the CDS and Protein sequences, either as a single fasta file or also via the API. You can load the sequences for a specific protein with c.proteins[<id>]. If you need also the intergenetic sequences, I think the approach by GenoMax might be a solution, but as the data in OMA originates from many different sources, it might not always be possible to use the EntrezDirect.

Best wishes Adrian

ADD COMMENT

Login before adding your answer.

Traffic: 2284 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6