Data filtering by creating dictionaries in python
1
0
Entering edit mode
3 months ago
mdgn ▴ 10

Hello,

I am trying to do partitioning my plastid alignment using Arabidopsis.gff file as reference. I wanted to first make a position file which had assigned keys and values to each position in the AT reference sequence.

NC_000932.1 RefSeq  exon    527 551 .   +   .   ID=exon-ArthCt098-1;Parent=rna-ArthCt098;Dbxref=GeneID:1466263;gbkey=tRNA;gene=trnT
NC_000932.1 RefSeq  CDS 554 570 .   +   0   ID=cds-NP_051054.1;Parent=gene-ArthCp017;Dbxref=Genbank:NP_051054.1,GeneID:844775;Name=NP_051054.1;gbkey=CDS;gene=psbD
NC_000932.1 RefSeq  CDS 610 612 .   +   0   ID=cds-NP_051055.1;Parent=gene-ArthCp018;Dbxref=Genbank:NP_051055.1,GeneID:844773;Name=NP_051055.1;Note=CP43;gbkey=CDS;gene=psbC


I wanted to define a range of numbers (or count with while loop), and give each location in the sequence a value according to the reference. So the script should search for the intervals in 4th and 5th columns and between that interval assigning each number as a key and 3rd column for respective values.

My desired output is below;

    "1": "na"
"2": "na"
"3": "na"
.
.
"526": "na"
"527": "exon"
"528": "exon"
"529": "exon"
.
.
.
"550": "exon"
"551": "exon"
"552": "na"
"553": "na"
"554": "exon"
"555": "exon"
.
.
.
"570":"exon"
"571":"na"
.
.
.
"610": "na"
"611": "CDS"
"612": "CDS"
.
.
.
until the range ends.


I believe a while loop with a count is the correct way, but I am failing to end it correctly and it goes into an endless loop. I tried directly assigning the values as strings but I realize library is what I need because the data is too big to handle. I would be glad if you can show a way to solve this.

python alignment partitioning • 197 views
1
Entering edit mode
3 months ago

I think this should work:

#!/usr/bin/env python

import sys

d = {}
previous_stop_idx = 1
for line in sys.stdin:
elems = line.rstrip().split('\t')
start_idx = int(elems[3])
stop_idx = int(elems[4])
feature = elems[2]
for idx in range(previous_stop_idx, start_idx):
d[idx] = 'na'
for idx in range(start_idx, stop_idx + 1): # note: inclusive stop index
d[idx] = feature
previous_stop_idx = stop_idx + 1
print(d)


Example:

$./create_dict.py < annotations.gff {1: 'na', 2: 'na', ..., 527: 'exon', 528: 'exon', ..., 611: 'CDS', 612: 'CDS'}  Note: This is very redundant, perhaps too much so. If the dictionary you are making ends up being too large to fit in memory, you could perhaps look at instead specifying the start point of a "contiguous block" of a feature category (e.g. na, exon, CDS, etc.). This approach takes out the redundancy of consecutive keys, and so would use a lot less memory. With this structure, you could do a binary search with bisect on the starting point, to get to a key that returns the desired feature value. A solution of this kind is discussed here: https://stackoverflow.com/a/39358118/19410 To go with the morning coffee, I tried out a bisect-based implementation: #!/usr/bin/env python import sys import bisect table = {} idxs = [] previous_stop_idx = 1 na_feature = 'na' oob_feature = 'out_of_bounds' ''' Read standard input into the "table" dictionary and "idxs" list. ''' for line in sys.stdin: elems = line.rstrip().split('\t') start_idx = int(elems[3]) stop_idx = int(elems[4]) feature = elems[2] if previous_stop_idx < start_idx: table[previous_stop_idx] = na_feature idxs.append(previous_stop_idx) table[start_idx] = feature idxs.append(start_idx) previous_stop_idx = stop_idx + 1 table[previous_stop_idx] = oob_feature idxs.append(previous_stop_idx) ''' Query table for a given index, using a binary search on the list "idxs". This should return either 'na' or the desired feature, or raise a ValueError exception for out-of-bounds conditions. ''' def query_table_by_index(query_idx): # # if query_idx is less than 1, then we are out of bounds # if query_idx < 1: raise ValueError("Out of bounds [{}]".format(query_idx)) # # get the index of idxs that is to the left of query_idx # r = bisect.bisect(idxs, query_idx) - 1 # # if we get the end index, then we are out of bounds ("oob_feature") # if r == len(idxs) - 1: raise ValueError("Out of bounds [{}]".format(query_idx)) # # if we are at this point, we can return the desired feature # k = idxs[r] return table[k] ''' Some sample queries ''' # # we expect this to fail # try: print(query_table_by_index(-1)) except ValueError as e: print(e) pass assert(query_table_by_index(1) == 'na') print(query_table_by_index(1)) assert(query_table_by_index(526) == 'na') print(query_table_by_index(526)) assert(query_table_by_index(527) == 'exon') print(query_table_by_index(527)) assert(query_table_by_index(528) == 'exon') print(query_table_by_index(528)) assert(query_table_by_index(609) == 'na') print(query_table_by_index(609)) assert(query_table_by_index(610) == 'CDS') print(query_table_by_index(610)) # # we expect this to fail # try: query_table_by_index(613) except ValueError as e: print(e) pass  The output: % ./create_dict_v2.py < annotations.gff Out of bounds [-1] na na exon exon na CDS Out of bounds [613]  This second approach will use much, much less memory than the first approach. If you have a large number of annotations, the second approach will also read in those annotations much more quickly, because Python will take less time when building a smaller dictionary. If you have lots of queries to do, the difference between the two approaches will be in query speed. Doing an index lookup on the first approach will be O(1) — constant time. That's the advantage of having a dictionary, if you can afford the memory overhead. Doing an index lookup on the second approach will take O(log n) time to do the binary search, where n is roughly twice the number of features (features along with the na gaps in between). Lookups will be slightly slower, but the underlying data will use less memory. ADD COMMENT 0 Entering edit mode Thank you for your recommendation. However I am getting IndexError: list index out of range on start_idx = int(elems[3]) line. I didn't understand why its happening. ADD REPLY 0 Entering edit mode Run cat -te on your input file. It should look something like this: % cat -te annotations.gff NC_000932.1^IRefSeq^Iexon^I527^I551^I.^I+^I.^IID=exon-ArthCt098-1;Parent=rna-ArthCt098;Dbxref=GeneID:1466263;gbkey=tRNA;gene=trnT$
NC_000932.1^IRefSeq^ICDS^I554^I570^I.^I+^I0^IID=cds-NP_051054.1;Parent=gene-ArthCp017;Dbxref=Genbank:NP_051054.1,GeneID:844775;Name=NP_051054.1;gbkey=CDS;gene=psbD$NC_000932.1^IRefSeq^ICDS^I610^I612^I.^I+^I0^IID=cds-NP_051055.1;Parent=gene-ArthCp018;Dbxref=Genbank:NP_051055.1,GeneID:844773;Name=NP_051055.1;Note=CP43;gbkey=CDS;gene=psbC$


There should be tab characters (^I) in between columns.

If you have spaces in between column values, instead of a tab, then you need to replace those spaces with one tab.

Also, I added a bisect-based script, which should run much faster on real-world input.

0
Entering edit mode

One more thought: If you're using GFF or BED or other similar file formats for genomic data, this approach will work on data for one chromosome or contig (i.e., the first column). It is easy to modify variables table etc. to handle this case. Mainly, the point here is to demonstrate reading in the data and querying it.