Datasets
With peerannot
one of our goal is to standardize crowdsourcing datasets formats. In order to do so, we need to remind users some notations.
Notations
We remind that $y_i^{(j)}\in[K]$ is the label answered by the worker $w_j$ given the task $x_i$ (e.g an image). The answered label is coded as a number representing the $K$ possible classes.
Datasets structure
Each dataset is defined as:
dataset
├── train
│ ├── class0
│ │ ├─ task0-<vote_index_0>.png
│ │ ├─ task1-<vote_index_1>.png
│ │ ├─ ...
│ │ └─ taskn0-<vote_index_n0>.png
│ ├── class1
│ ├── ...
│ └── classK
├── val
├── test
├── dataset.py
├── metadata.json
└── answers.json
The crowdsourced labels for each training task are contained in the anwers.json
file. They are formatted
as follows:
{
0: {<worker_id>: <label>, <another_worker_id>: <label>},
1: {<yet_another_worker_id>: <label>,}
}
Note that the task index in the answers.json
file might not match the order of tasks in the
train
folder... Thence, each task's name contains the associated votes file index.
The number of tasks in the train
folder must match the number of entry keys in the
answers.json
file.
The metadata.json
file contains general information about the dataset. A minimal example would be:
{
"name": <dataset>,
"n_classes": K,
"n_workers": <n_workers>,
}
In a dataset without tasks (for example when we only receive the crowdsourced labels), the train
, val
and test
folders are omitted.
However, by doing so many learning strategies are not available.
Download and install real datasets easily
The dataset.py
is not mandatory but is here to facilitate the dataset's installation procedure. A
minimal example:
class mydataset:
def __init__(self):
self.DIR = Path(__file__).parent.resolve()
# download the data needed
# ...
def setfolders(self):
print(f"Loading data folders at {self.DIR}")
train_path = self.DIR / "train"
test_path = self.DIR / "test"
valid_path = self.DIR / "val"
# Create train/val/test tasks with matching index
# ...
print("Created:")
for set, path in zip(
("train", "val", "test"), [train_path, valid_path, test_path]
):
print(f"- {set}: {path}")
self.get_crowd_labels()
print(f"Train crowd labels are in {self.DIR / 'answers.json'}")
def get_crowd_labels(self):
# create answers.json dictionnary in presented format
# ...
with open(self.DIR / "answers.json", "w") as answ:
json.dump(dictionnary, answ, ensure_ascii=False, indent=3)
One can also instantiate in this installation file any necessary access or restrictions needed.
Simulate datasets
The simulate
module of peerannot
allows creating simulated labels given a strategy.
Examples
Examples of simulation strategies can be found in the following: