Question

Convert hg38 to a nested array for binary search

0

Entering edit mode

3.4 years ago

noahhelton98 • 0

Hi all, I am attempting to convert hg38 gtf file into a nested array so I can do binary search with it. I am trying to make the nested array based on position in which the first array is each chromosome sorted:

chromosomes = []
for i in range(1, 23):
    chromosomes.append(i)

the second array would be strand (+, -)

strands = ['+', '-']

for i in range(0,len(chromosomes)):
    chromosomes[i] = strands

the third array would be start, end positions

and the final array would be a list of attributes such as transcript_id and gene_id.

I am not sure the best way to iterate through the gtf file that I loaded to append my current array of arrays. I have this so far, but I am not sure if it is working or just taking a long time for the size:

for i in range(0, len(chromosomes)):
    for index, row in sorted_df.iterrows():
        if (str(i) == row['chr']) & (chromosomes[i][0] == row['strand']):
            positions = []
            positions.append(row['start'], row['end'])
            chromosomes[i][0] = positions
            positions.clear()

Is this the right way of thinking about this problem or is there a better way to approach it? Any help would be appreciated.

python hg38 genome binary search • 610 views

ADD COMMENT • link updated 3.4 years ago by Pierre Lindenbaum 161k • written 3.4 years ago by noahhelton98 • 0

score 1 · Answer 1 · 2020-12-07

1

Entering edit mode

3.4 years ago

Pierre Lindenbaum 161k

I think you're re-inventing the wheel. You should have a look at htslib/tabix.

see also : http://genomewiki.ucsc.edu/index.php/Bin_indexing_system

ADD COMMENT • link 3.4 years ago by Pierre Lindenbaum 161k