Convert hg38 to a nested array for binary search
1
0
Entering edit mode
11 months ago

Hi all, I am attempting to convert hg38 gtf file into a nested array so I can do binary search with it. I am trying to make the nested array based on position in which the first array is each chromosome sorted:

chromosomes = []
for i in range(1, 23):
    chromosomes.append(i)

the second array would be strand (+, -)

strands = ['+', '-']

for i in range(0,len(chromosomes)):
    chromosomes[i] = strands

the third array would be start, end positions

and the final array would be a list of attributes such as transcript_id and gene_id.

I am not sure the best way to iterate through the gtf file that I loaded to append my current array of arrays. I have this so far, but I am not sure if it is working or just taking a long time for the size:

for i in range(0, len(chromosomes)):
    for index, row in sorted_df.iterrows():
        if (str(i) == row['chr']) & (chromosomes[i][0] == row['strand']):
            positions = []
            positions.append(row['start'], row['end'])
            chromosomes[i][0] = positions
            positions.clear()

Is this the right way of thinking about this problem or is there a better way to approach it? Any help would be appreciated.

python hg38 genome binary search • 288 views
ADD COMMENT
1
Entering edit mode
11 months ago

I think you're re-inventing the wheel. You should have a look at htslib/tabix.

see also : http://genomewiki.ucsc.edu/index.php/Bin_indexing_system

ADD COMMENT

Login before adding your answer.

Traffic: 1702 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6