Datasets

Notations
Datasets structure
Download and install real datasets easily
Simulate datasets
- Examples

With peerannot one of our goal is to standardize crowdsourcing datasets formats. In order to do so, we need to remind users some notations.

Notations

We remind that $y_i^{(j)}\in[K]$ is the label answered by the worker $w_j$ given the task $x_i$ (e.g an image). The answered label is coded as a number representing the $K$ possible classes.

Datasets structure

Each dataset is defined as:

dataset
      ├── train
      │     ├── class0
      │     │     ├─ task0-<vote_index_0>.png
      │     │     ├─ task1-<vote_index_1>.png
      │     │     ├─ ...
      │     │     └─ taskn0-<vote_index_n0>.png
      │     ├── class1
      │     ├── ...
      │     └── classK
      ├── val
      ├── test
      ├── dataset.py
      ├── metadata.json
      └── answers.json

The crowdsourced labels for each training task are contained in the anwers.json file. They are formatted as follows:

{
      0: {<worker_id>: <label>, <another_worker_id>: <label>},
      1: {<yet_another_worker_id>: <label>,}
  }

Note that the task index in the answers.json file might not match the order of tasks in the train folder... Thence, each task's name contains the associated votes file index. The number of tasks in the train folder must match the number of entry keys in the answers.json file.

The metadata.json file contains general information about the dataset. A minimal example would be:

{
      "name": <dataset>,
      "n_classes": K,
      "n_workers": <n_workers>,
  }

In a dataset without tasks (for example when we only receive the crowdsourced labels), the train, val and test folders are omitted. However, by doing so many learning strategies are not available.

Download and install real datasets easily

The dataset.py is not mandatory but is here to facilitate the dataset's installation procedure. A minimal example:

class mydataset:
      def __init__(self):
          self.DIR = Path(__file__).parent.resolve()
          # download the data needed
          # ...

      def setfolders(self):
          print(f"Loading data folders at {self.DIR}")
          train_path = self.DIR / "train"
          test_path = self.DIR / "test"
          valid_path = self.DIR / "val"

          # Create train/val/test tasks with matching index
          # ...

          print("Created:")
          for set, path in zip(
              ("train", "val", "test"), [train_path, valid_path, test_path]
          ):
              print(f"- {set}: {path}")
          self.get_crowd_labels()
          print(f"Train crowd labels are in {self.DIR / 'answers.json'}")

      def get_crowd_labels(self):
          # create answers.json dictionnary in presented format
          # ...
          with open(self.DIR / "answers.json", "w") as answ:
              json.dump(dictionnary, answ, ensure_ascii=False, indent=3)

One can also instantiate in this installation file any necessary access or restrictions needed.

Simulate datasets

The simulate module of peerannot allows creating simulated labels given a strategy.

Examples

Examples of simulation strategies can be found in the following:

Search for