# Analysis¶

Analysis helper functions

## Utils¶

mesmerize.analysis.utils.get_array_size(transmission: mesmerize.analysis.data_types.Transmission, data_column: str) → int[source]

Returns the size of the 1D arrays in the specified data column. Throws an exception if they do not match

Parameters: transmission (Transmission) – Desired Transmission data_column (str) – Data column of the Transmission from which to retrieve the size Size of the 1D arrays of the specified data column int
mesmerize.analysis.utils.get_frequency_linspace(transmission: mesmerize.analysis.data_types.Transmission) → Tuple[numpy.ndarray, float][source]

Get the frequency linspace.

Throwns an exception if all datablocks do not have the same linspace & Nyquist frequencies

Parameters: transmission – Transmission containing data from which to get frequency linspace tuple: (frequency linspace as a 1D numpy array, nyquist frequency) Tuple[np.ndarray, float]
mesmerize.analysis.utils.get_proportions(xs: Union[pandas.core.series.Series, numpy.ndarray, list], ys: Union[pandas.core.series.Series, numpy.ndarray], xs_name: str = 'xs', ys_name: str = 'ys', swap: bool = False, percentages: bool = True) → pandas.core.frame.DataFrame[source]

Get the proportions of xs vs ys.

xs & ys are categorical data.

Parameters: xs (Union[pd.Series, np.ndarray]) – data plotted on the x axis ys (Union[pd.Series, np.ndarray]) – proportions of unique elements in ys are calculated per xs xs_name (str) – name for the xs data, useful for labeling the axis in plots ys_name (str) – name for the ys data, useful for labeling the axis in plots swap (bool) – swap x and y DataFrame that can be plotted in a proportions bar graph pd.DataFrame
mesmerize.analysis.utils.get_sampling_rate(transmission: mesmerize.analysis.data_types.Transmission, tolerance: Optional[float] = 0.1) → float[source]

Returns the mean sampling rate of all data in a Transmission if it is within the specified tolerance. Otherwise throws an exception.

Parameters: transmission (Transmission) – Transmission object of the data from which sampling rate is obtained. tolerance (float) – Maximum tolerance (in Hertz) of sampling rate variation between different samples The mean sampling rate of all data in the Transmission float
mesmerize.analysis.utils.organize_dataframe_columns(columns: Iterable[str]) → Tuple[List[str], List[str], List[str]][source]

Organizes DataFrame columns into data column, categorical label columns, and uuid columns.

Parameters: columns – All DataFrame columns (data_columns, categorical_columns, uuid_columns) Tuple[List[str], List[str], List[str]]
mesmerize.analysis.utils.pad_arrays(a: numpy.ndarray, method: str = 'random', output_size: int = None, mode: str = 'minimum', constant: Any = None) → numpy.ndarray[source]

Pad all the input arrays so that are of the same length. The length is determined by the largest input array. The padding value for each input array is the minimum value in that array.

Padding for each input array is either done after the array’s last index to fill up to the length of the largest input array (method ‘fill-size’) or the padding is randomly flanked to the input array (method ‘random’) for easier visualization.

Parameters: a (np.ndarray) – 1D array where each element is a 1D array method (str) – one of ‘fill-size’ or ‘random’, see docstring for details output_size – not used mode (str) – one of either ‘constant’ or ‘minimum’. If ‘minimum’ the min value of the array is used as the padding value. If ‘constant’ the values passed to the “constant” argument is used as the padding value. constant (Any) – padding value if ‘mode’ is set to ‘constant’ Arrays padded according to the chosen method. 2D array of shape [n_arrays, size of largest input array] np.ndarray

## Cross correlation¶

### functions¶

Helper functions. Uses tslearn.cycc

mesmerize.analysis.math.cross_correlation.ncc_c(x: numpy.ndarray, y: numpy.ndarray) → numpy.ndarray[source]

Must pass 1D array to both x and y

Parameters: x – Input array [x1, x2, x3, … xn] y – Input array [y2, y2, x3, … yn] Returns the normalized cross correlation function (as an array) of the two input vector arguments “x” and “y” np.ndarray
mesmerize.analysis.math.cross_correlation.get_omega(x: numpy.ndarray = None, y: numpy.ndarray = None, cc: numpy.ndarray = None) → int[source]

Must pass a 1D array to either both “x” and “y” or a cross-correlation function (as an array) to “cc”

Parameters: x – Input array [x1, x2, x3, … xn] y – Input array [y2, y2, x3, … yn] cc – cross-correlation function represented as an array [c1, c2, c3, … cn] index (x-axis position) of the global maxima of the cross-correlation function np.ndarray
mesmerize.analysis.math.cross_correlation.get_lag(x: numpy.ndarray = None, y: numpy.ndarray = None, cc: numpy.ndarray = None) → float[source]

Must pass a 1D array to either both “x” and “y” or a cross-correlation function (as an array) to “cc”

Parameters: x – Input array [x1, x2, x3, … xn] y – Input array [y2, y2, x3, … yn] cc – cross-correlation function represented as a array [c1, c2, c3, … cn] Position of the maxima of the cross-correlation function with respect to middle point of the cross-correlation function np.ndarray
mesmerize.analysis.math.cross_correlation.get_epsilon(x: numpy.ndarray = None, y: numpy.ndarray = None, cc: numpy.ndarray = None) → float[source]

Must pass a 1D vector to either both “x” and “y” or a cross-correlation function to “cc”

Parameters: x – Input array [x1, x2, x3, … xn] y – Input array [y2, y2, x3, … yn] cc – cross-correlation function represented as an array [c1, c2, c3, … cn] Magnitude of the global maxima of the cross-correlationn function np.ndarray
mesmerize.analysis.math.cross_correlation.get_lag_matrix(curves: numpy.ndarray = None, ccs: numpy.ndarray = None) → numpy.ndarray[source]

Get a 2D matrix of lags. Can pass either a 2D array of 1D curves or cross-correlations

Parameters: curves – 2D array of 1D curves ccs – 2D array of 1D cross-correlation functions represented by arrays 2D matrix of lag values, shape is [n_curves, n_curves] np.ndarray
mesmerize.analysis.math.cross_correlation.get_epsilon_matrix(curves: numpy.ndarray = None, ccs: numpy.ndarray = None) → numpy.ndarray[source]

Get a 2D matrix of maximas. Can pass either a 2D array of 1D curves or cross-correlations

Parameters: curves – 2D array of 1D curves ccs – 2D array of 1D cross-correlation functions represented by arrays 2D matrix of maxima values, shape is [n_curves, n_curves] np.ndarray
mesmerize.analysis.math.cross_correlation.compute_cc_data(curves: numpy.ndarray) → mesmerize.analysis.math.cross_correlation.CC_Data[source]

Compute cross-correlation data (cc functions, lag and maxima matrices)

Parameters: curves – input curves as a 2D array, shape is [n_samples, curve_size] cross correlation data for the input curves as a CC_Data instance CC_Data
mesmerize.analysis.math.cross_correlation.compute_ccs(a: numpy.ndarray) → numpy.ndarray[source]

Compute cross-correlations between all 1D curves in a 2D input array

Parameters: a – 2D input array of 1D curves, shape is [n_samples, curve_size] np.ndarray

### CC_Data¶

Data container

Warning

All arguments MUST be numpy.ndarray type for CC_Data for the save to be saveable as an hdf5 file. Set numpy.unicode as the dtype for the curve_uuids and labels arrays. If the dtype is 'O' (object) the to_hdf5() method will fail.

class mesmerize.analysis.cross_correlation.CC_Data(ccs: numpy.ndarray = None, lag_matrix: numpy.ndarray = None, epsilon_matrix: numpy.ndarray = None, curve_uuids: numpy.ndarray = None, labels: numpy.ndarray = None)
__init__(ccs: numpy.ndarray = None, lag_matrix: numpy.ndarray = None, epsilon_matrix: numpy.ndarray = None, curve_uuids: numpy.ndarray = None, labels: numpy.ndarray = None)

Object for organizing cross-correlation data

types must be numpy.ndarray to be compatible with hdf5

Parameters: ccs (np.ndarray) – array of cross-correlation functions, shape: [n_curves, n_curves, func_length] lag_matrix (np.ndarray) – the lag matrix, shape: [n_curves, n_curves] epsilon_matrix (np.ndarray) – the maxima matrix, shape: [n_curves, n_curves] curve_uuids (np.ndarray) – uuids (str representation) for each of the curves, length: n_curves labels (np.ndarray) – labels for each curve, length: n_curves
ccs = None

array of cross-correlation functions

lag_matrix = None

lag matrix

curve_uuids = None

uuids for each curve

labels = None

labels for each curve

get_threshold_matrix(matrix_type: str, lag_thr: float, max_thr: float, lag_thr_abs: bool = True) → numpy.ndarray

Get lag or maxima matrix with thresholds applied. Values outside the threshold are set to NaN

Parameters: matrix_type – one of ‘lag’ or ‘maxima’ lag_thr – lag threshold max_thr – maxima threshold lag_thr_abs – threshold with the absolute value of lag the requested matrix with the thresholds applied to it. np.ndarray
classmethod from_dict(d: dict)

to_hdf5(path: str)

Save as an HDF5 file

Parameters: path – path to save the hdf5 file to, file must not exist.
classmethod from_hdf5(path: str)

Load cross-correlation data from an hdf5 file

Parameters: path – path to the hdf5 file

## Clustering metrics¶

mesmerize.analysis.clustering_metrics.get_centerlike(cluster_members: numpy.ndarray, metric: Union[str, callable, None] = None, dist_matrix: Optional[numpy.ndarray] = None) → Tuple[numpy.ndarray, int][source]

Finds the 1D time-series within a cluster that is the most centerlike

Parameters: cluster_members – 2D numpy array in the form [n_samples, 1D time_series] metric – Metric to use for pairwise distance calculation, simply passed to sklearn.metrics.pairwise_distances dist_matrix – Distance matrix of the cluster members The cluster member which is most centerlike, and its index in the cluster_members array
mesmerize.analysis.clustering_metrics.get_cluster_radius(cluster_members: numpy.ndarray, metric: Union[str, callable, None] = None, dist_matrix: Optional[numpy.ndarray] = None, centerlike_index: Optional[int] = None) → float[source]

Returns the cluster radius according to chosen distance metric

Parameters: cluster_members – 2D numpy array in the form [n_samples, 1D time_series] metric – Metric to use for pairwise distance calculation, simply passed to sklearn.metrics.pairwise_distances dist_matrix – Distance matrix of the cluster members centerlike_index – Index of the centerlike cluster member within the cluster_members array The cluster radius, average between the most centerlike member and all other members
mesmerize.analysis.clustering_metrics.davies_bouldin_score(data: numpy.ndarray, cluster_labels: numpy.ndarray, metric: Union[str, callable]) → float[source]

Adopted from sklearn.metrics.davies_bouldin_score to use any distance metric

Parameters: data – Data that was used for clustering, [n_samples, 1D time_series] metric – Metric to use for pairwise distance calculation, simply passed to sklearn.metrics.pairwise_distances cluster_labels – Cluster labels Davies Bouldin Score using EMD