DawidSkene¶

class DawidSkene(answers: dict[Hashable, dict[Hashable, Hashable]], n_workers: Annotated[int, Ge(ge=1)], n_classes: Annotated[int, Ge(ge=1)])¶

Dawid and Skene model (1979)¶

Assumptions: - independent workers

Using: - EM algorithm

Estimating: - One confusion matrix for each workers

__init__(answers: dict[Hashable, dict[Hashable, Hashable]], n_workers: Annotated[int, Ge(ge=1)], n_classes: Annotated[int, Ge(ge=1)]) → None¶

Dawid and Skene strategy: estimate confusion matrix for each worker.

Assuming that workers are independent, the model assumes that

\[(y_i^{(j)}\ | y_i^{\star} = k) \sim \mathcal{M}\left(\pi^{(j)}_{k,\cdot}\right)\]

and maximizes the log likelihood of the model using an EM algorithm.

\[\begin{split}\underset{\rho,\\pi,T}{\mathrm{argmax}}\prod_{i\in [n_{\texttt{task}}]}\prod_{k \in [K]}\bigg[\rho_k\prod_{j\in [n_{\texttt{worker}}]}\prod_{\ell\in [K]}\big(\pi^{(j)}_{k, \ell}\big)^{\mathbf{1}_{\{y_i^{(j)}=\ell\}}}\bigg]^{T_{i,k}},\end{split}\]

where \(\rho\) is the class marginals, \(\pi\) is the confusion matrix and \(T\) is the indicator variables of belonging to each class.

Parameters:

answers (dict) –

Dictionary of workers answers with format

{
    task0: {worker0: label, worker1: label},
    task1: {worker1: label}
}

sparse – If the number of workers/tasks/label is large (\(>10^{6}\) for at least one), # use sparse=True to run per task
n_classes (int, optional) – Number of possible classes, defaults to 2

classmethod from_crowd_matrix(crowd_matrix: ndarray, **kwargs: dict[str, Any]) → Self¶

run(epsilon: Annotated[float, Ge(ge=0)] = 1e-06, maxiter: Annotated[int, Ge(ge=0)] = 50) → tuple[list[float], int]¶

Run the EM optimization

Parameters:

epsilon (float, optional) – stopping criterion (\(\ell_1\) norm between two iterates of log likelihood), defaults to 1e-6
maxiter (int, optional) – Maximum number of steps, defaults to 50
verbose – Verbosity level, defaults to False

Returns:

Log likelihood values and number of steps taken

Return type:

(list,int)

get_answers() → ndarray[tuple[int, ...], dtype[_ScalarType_co]]¶: Get most probable labels

get_probas() → ndarray[tuple[int, ...], dtype[_ScalarType_co]]¶: Get soft labels distribution for each task