Summarization¶
A key application of BioNumPy is to extract features from sequence data sets. A large set of interesting features can be computed as functions from sequences to scalar values. Examples are kmer-hashing (kmer->hash-value), minimizers(window->hash-value), string/motif-matching (sequence->bool), Position Weight Matrix scores (sequence->float). BioNumPy provides functionality to apply such functions to rolling windows across large sequence sets, through the RollableFunction class. By specifying a broadcastable function in the __call__ method, the rolling_window method will apply the function to all windows in a sequence set. Take the PositionWeightMatrix class for instance:
class PositionWeightMatrix(RollableFunction):
def __init__(self, matrix, encoding=ACTGEncoding):
self._matrix = np.asanyarray(matrix)
self.window_size = self._matrix.shape[-1]
self._indices = np.arange(self.window_size)
self._encoding = ACTGEncoding
def __call__(self, sequence: Sequence) -> float:
sequence = as_sequence_array(sequence, self._encoding)
scores = self._matrix[sequence, self._indices]
return scores.sum(axis=-1)
It’s __call__ method specifies how to calculate the score of a sequence. Calling it’s rolling_window function will calculate the scores for all windows in a data set:
>>> import numpy as np
>>> sequences = bnp.as_sequence_array(["acgttgta", "gcttca", "gttattc"], encoding=bnp.encodings.ACTGEncoding)
>>>
>>> matrix = np.log([[0.1, 0.2],
... [0.2, 0.3],
... [0.4, 0.1],
... [0.3, 0.4]])
>>> pwm = bnp.position_weight_matrix.PositionWeightMatrix(matrix)
>>> pwm("ac")
-3.506557897319982
>>> pwm(["ac", "cg"])
array([-3.5065579 , -2.52572864])
>>> pwm.rolling_window(sequences)
RaggedArray([[-3.506557897319982, -2.525728644308255, -3.506557897319982, -3.2188758248682006, -1.83258146374831, -3.506557897319982, -2.525728644308255], [-2.4079456086518722, -3.9120230054281455, -3.2188758248682006, -2.120263536200091, -3.2188758248682006], [-3.506557897319982, -3.2188758248682006, -2.525728644308255, -4.605170185988091, -3.2188758248682006, -2.120263536200091]])
Further processing can be done with numpy functions, for instance finding the max score for each sequence in the set:
>>> pwm.rolling_window(sequences).max(axis=-1)
array([-1.83258146, -2.12026354, -2.12026354])