Analysis

Analysis helper functions

Utils

mesmerize.analysis.utils.get_array_size(transmission: mesmerize.analysis.data_types.Transmission, data_column: str) → int[source]

Returns the size of the 1D arrays in the specified data column. Throws an exception if they do not match

Parameters:
  • transmission (Transmission) – Desired Transmission
  • data_column (str) – Data column of the Transmission from which to retrieve the size
Returns:

Size of the 1D arrays of the specified data column

Return type:

int

mesmerize.analysis.utils.get_frequency_linspace(transmission: mesmerize.analysis.data_types.Transmission) → Tuple[numpy.ndarray, float][source]

Get the frequency linspace.

Throwns an exception if all datablocks do not have the same linspace & Nyquist frequencies

Parameters:transmission – Transmission containing data from which to get frequency linspace
Returns:tuple: (frequency linspace as a 1D numpy array, nyquist frequency)
Return type:Tuple[np.ndarray, float]
mesmerize.analysis.utils.get_proportions(xs: Union[pandas.core.series.Series, numpy.ndarray, list], ys: Union[pandas.core.series.Series, numpy.ndarray], xs_name: str = 'xs', ys_name: str = 'ys', swap: bool = False, percentages: bool = True) → pandas.core.frame.DataFrame[source]

Get the proportions of xs vs ys.

xs & ys are categorical data.

Parameters:
  • xs (Union[pd.Series, np.ndarray]) – data plotted on the x axis
  • ys (Union[pd.Series, np.ndarray]) – proportions of unique elements in ys are calculated per xs
  • xs_name (str) – name for the xs data, useful for labeling the axis in plots
  • ys_name (str) – name for the ys data, useful for labeling the axis in plots
  • swap (bool) – swap x and y
Returns:

DataFrame that can be plotted in a proportions bar graph

Return type:

pd.DataFrame

mesmerize.analysis.utils.get_sampling_rate(transmission: mesmerize.analysis.data_types.Transmission, tolerance: Optional[float] = 0.1) → float[source]

Returns the mean sampling rate of all data in a Transmission if it is within the specified tolerance. Otherwise throws an exception.

Parameters:
  • transmission (Transmission) – Transmission object of the data from which sampling rate is obtained.
  • tolerance (float) – Maximum tolerance (in Hertz) of sampling rate variation between different samples
Returns:

The mean sampling rate of all data in the Transmission

Return type:

float

mesmerize.analysis.utils.organize_dataframe_columns(columns: Iterable[str]) → Tuple[List[str], List[str], List[str]][source]

Organizes DataFrame columns into data column, categorical label columns, and uuid columns.

Parameters:columns – All DataFrame columns
Returns:(data_columns, categorical_columns, uuid_columns)
Return type:Tuple[List[str], List[str], List[str]]
mesmerize.analysis.utils.pad_arrays(a: numpy.ndarray, method: str = 'random', output_size: int = None, mode: str = 'minimum', constant: Any = None) → numpy.ndarray[source]

Pad all the input arrays so that are of the same length. The length is determined by the largest input array. The padding value for each input array is the minimum value in that array.

Padding for each input array is either done after the array’s last index to fill up to the length of the largest input array (method ‘fill-size’) or the padding is randomly flanked to the input array (method ‘random’) for easier visualization.

Parameters:
  • a (np.ndarray) – 1D array where each element is a 1D array
  • method (str) – one of ‘fill-size’ or ‘random’, see docstring for details
  • output_size – not used
  • mode (str) – one of either ‘constant’ or ‘minimum’. If ‘minimum’ the min value of the array is used as the padding value. If ‘constant’ the values passed to the “constant” argument is used as the padding value.
  • constant (Any) – padding value if ‘mode’ is set to ‘constant’
Returns:

Arrays padded according to the chosen method. 2D array of shape [n_arrays, size of largest input array]

Return type:

np.ndarray

Cross correlation

functions

Helper functions. Uses tslearn.cycc

mesmerize.analysis.math.cross_correlation.ncc_c(x: numpy.ndarray, y: numpy.ndarray) → numpy.ndarray[source]

Must pass 1D array to both x and y

Parameters:
  • x – Input array [x1, x2, x3, … xn]
  • y – Input array [y2, y2, x3, … yn]
Returns:

Returns the normalized cross correlation function (as an array) of the two input vector arguments “x” and “y”

Return type:

np.ndarray

mesmerize.analysis.math.cross_correlation.get_omega(x: numpy.ndarray = None, y: numpy.ndarray = None, cc: numpy.ndarray = None) → int[source]

Must pass a 1D array to either both “x” and “y” or a cross-correlation function (as an array) to “cc”

Parameters:
  • x – Input array [x1, x2, x3, … xn]
  • y – Input array [y2, y2, x3, … yn]
  • cc – cross-correlation function represented as an array [c1, c2, c3, … cn]
Returns:

index (x-axis position) of the global maxima of the cross-correlation function

Return type:

np.ndarray

mesmerize.analysis.math.cross_correlation.get_lag(x: numpy.ndarray = None, y: numpy.ndarray = None, cc: numpy.ndarray = None) → float[source]

Must pass a 1D array to either both “x” and “y” or a cross-correlation function (as an array) to “cc”

Parameters:
  • x – Input array [x1, x2, x3, … xn]
  • y – Input array [y2, y2, x3, … yn]
  • cc – cross-correlation function represented as a array [c1, c2, c3, … cn]
Returns:

Position of the maxima of the cross-correlation function with respect to middle point of the cross-correlation function

Return type:

np.ndarray

mesmerize.analysis.math.cross_correlation.get_epsilon(x: numpy.ndarray = None, y: numpy.ndarray = None, cc: numpy.ndarray = None) → float[source]

Must pass a 1D vector to either both “x” and “y” or a cross-correlation function to “cc”

Parameters:
  • x – Input array [x1, x2, x3, … xn]
  • y – Input array [y2, y2, x3, … yn]
  • cc – cross-correlation function represented as an array [c1, c2, c3, … cn]
Returns:

Magnitude of the global maxima of the cross-correlationn function

Return type:

np.ndarray

mesmerize.analysis.math.cross_correlation.get_lag_matrix(curves: numpy.ndarray = None, ccs: numpy.ndarray = None) → numpy.ndarray[source]

Get a 2D matrix of lags. Can pass either a 2D array of 1D curves or cross-correlations

Parameters:
  • curves – 2D array of 1D curves
  • ccs – 2D array of 1D cross-correlation functions represented by arrays
Returns:

2D matrix of lag values, shape is [n_curves, n_curves]

Return type:

np.ndarray

mesmerize.analysis.math.cross_correlation.get_epsilon_matrix(curves: numpy.ndarray = None, ccs: numpy.ndarray = None) → numpy.ndarray[source]

Get a 2D matrix of maximas. Can pass either a 2D array of 1D curves or cross-correlations

Parameters:
  • curves – 2D array of 1D curves
  • ccs – 2D array of 1D cross-correlation functions represented by arrays
Returns:

2D matrix of maxima values, shape is [n_curves, n_curves]

Return type:

np.ndarray

mesmerize.analysis.math.cross_correlation.compute_cc_data(curves: numpy.ndarray) → mesmerize.analysis.math.cross_correlation.CC_Data[source]

Compute cross-correlation data (cc functions, lag and maxima matrices)

Parameters:curves – input curves as a 2D array, shape is [n_samples, curve_size]
Returns:cross correlation data for the input curves as a CC_Data instance
Return type:CC_Data
mesmerize.analysis.math.cross_correlation.compute_ccs(a: numpy.ndarray) → numpy.ndarray[source]

Compute cross-correlations between all 1D curves in a 2D input array

Parameters:a – 2D input array of 1D curves, shape is [n_samples, curve_size]
Return type:np.ndarray

CC_Data

Data container

Warning

All arguments MUST be numpy.ndarray type for CC_Data for the save to be saveable as an hdf5 file. Set numpy.unicode as the dtype for the curve_uuids and labels arrays. If the dtype is 'O' (object) the to_hdf5() method will fail.

class mesmerize.analysis.cross_correlation.CC_Data(ccs: numpy.ndarray = None, lag_matrix: numpy.ndarray = None, epsilon_matrix: numpy.ndarray = None, curve_uuids: numpy.ndarray = None, labels: numpy.ndarray = None)
__init__(ccs: numpy.ndarray = None, lag_matrix: numpy.ndarray = None, epsilon_matrix: numpy.ndarray = None, curve_uuids: numpy.ndarray = None, labels: numpy.ndarray = None)

Object for organizing cross-correlation data

types must be numpy.ndarray to be compatible with hdf5

Parameters:
  • ccs (np.ndarray) – array of cross-correlation functions, shape: [n_curves, n_curves, func_length]
  • lag_matrix (np.ndarray) – the lag matrix, shape: [n_curves, n_curves]
  • epsilon_matrix (np.ndarray) – the maxima matrix, shape: [n_curves, n_curves]
  • curve_uuids (np.ndarray) – uuids (str representation) for each of the curves, length: n_curves
  • labels (np.ndarray) – labels for each curve, length: n_curves
ccs = None

array of cross-correlation functions

lag_matrix = None

lag matrix

curve_uuids = None

uuids for each curve

labels = None

labels for each curve

get_threshold_matrix(matrix_type: str, lag_thr: float, max_thr: float, lag_thr_abs: bool = True) → numpy.ndarray

Get lag or maxima matrix with thresholds applied. Values outside the threshold are set to NaN

Parameters:
  • matrix_type – one of ‘lag’ or ‘maxima’
  • lag_thr – lag threshold
  • max_thr – maxima threshold
  • lag_thr_abs – threshold with the absolute value of lag
Returns:

the requested matrix with the thresholds applied to it.

Return type:

np.ndarray

classmethod from_dict(d: dict)

Load data from a dict

to_hdf5(path: str)

Save as an HDF5 file

Parameters:path – path to save the hdf5 file to, file must not exist.
classmethod from_hdf5(path: str)

Load cross-correlation data from an hdf5 file

Parameters:path – path to the hdf5 file

Clustering metrics

mesmerize.analysis.clustering_metrics.get_centerlike(cluster_members: numpy.ndarray, metric: Union[str, callable, None] = None, dist_matrix: Optional[numpy.ndarray] = None) → Tuple[numpy.ndarray, int][source]

Finds the 1D time-series within a cluster that is the most centerlike

Parameters:
  • cluster_members – 2D numpy array in the form [n_samples, 1D time_series]
  • metric – Metric to use for pairwise distance calculation, simply passed to sklearn.metrics.pairwise_distances
  • dist_matrix – Distance matrix of the cluster members
Returns:

The cluster member which is most centerlike, and its index in the cluster_members array

mesmerize.analysis.clustering_metrics.get_cluster_radius(cluster_members: numpy.ndarray, metric: Union[str, callable, None] = None, dist_matrix: Optional[numpy.ndarray] = None, centerlike_index: Optional[int] = None) → float[source]

Returns the cluster radius according to chosen distance metric

Parameters:
  • cluster_members – 2D numpy array in the form [n_samples, 1D time_series]
  • metric – Metric to use for pairwise distance calculation, simply passed to sklearn.metrics.pairwise_distances
  • dist_matrix – Distance matrix of the cluster members
  • centerlike_index – Index of the centerlike cluster member within the cluster_members array
Returns:

The cluster radius, average between the most centerlike member and all other members

mesmerize.analysis.clustering_metrics.davies_bouldin_score(data: numpy.ndarray, cluster_labels: numpy.ndarray, metric: Union[str, callable]) → float[source]

Adopted from sklearn.metrics.davies_bouldin_score to use any distance metric

Parameters:
  • data – Data that was used for clustering, [n_samples, 1D time_series]
  • metric – Metric to use for pairwise distance calculation, simply passed to sklearn.metrics.pairwise_distances
  • cluster_labels – Cluster labels
Returns:

Davies Bouldin Score using EMD