vision_unlearning.datasets.others
These datasets do not follow YET the standard Dataset interface. They are more like a set of loosely related functions. In any case, they may help you get started
Functions
|
Create metadata.jsonl for all .jpg images in folder. |
|
|
|
|
|
Subsample df without replacement to produce target rows (or as many as available) |
|
Downloads and already splits (TODO: separate that into two functions) |
|
|
|
Given an already downloaded Taras Dog Breeds dataset at downloaded_folder (one folder per class), split images into forget and retain sets. |
|
Do not perform splits |
|
Given an already downloaded SUN dataset at downloaded_folder (one folder per class), split images into forget and retain sets. |
|
|
|
|
|
|
|
|
|
Module Contents
- vision_unlearning.datasets.others.create_metadata_jsonl(folder: pathlib.Path)
Create metadata.jsonl for all .jpg images in folder. Each line: {“file_name”: “<filename>”, “text”: “<class_name>”}
- vision_unlearning.datasets.others.jsonl_dump(data: list, path: str) None
- vision_unlearning.datasets.others.jsonl_load(path: str) list
- vision_unlearning.datasets.others.balanced_subsample_lib(df: pandas.DataFrame, group_cols: List[str], priority_col: str, target: int = 100, random_state: int = 42, dropna: bool = True) pandas.DataFrame
Subsample df without replacement to produce target rows (or as many as available) balanced as evenly as possible across the combinations of group_cols (i.e. strata).
Within each stratum, the top rows are selected by highest dataset_n_original. Final output is globally ordered by decreasing priority_col.
Every stratum (combiantion of group_cols) gets as close as possible to equal share. Does not give preference to any one group. Balances stratum-by-stratum (intersectional fairness), which may overall groups to be underpresented if there isn’t enoguh data
index is dropped (TODO refactor?)
- vision_unlearning.datasets.others.count_classes_dataset_lfw()
- vision_unlearning.datasets.others.download_dataset_lfw(dataset_forget_name: str, dataset_retain_name: str, target: str, forget_max_img: int = 0, retain_max_img_per_class: int = 0, restrict_labels: List[str] | None = None) Dict[str, int]
Downloads and already splits (TODO: separate that into two functions) @param forget_max_img: if >0, no more than this number of images will be saved for the forget set @param retain_max_img_per_class: if >0, will stratify the retain set such that no more images of one class are saved @param restrict_labels: if not none, save only those entities @return how many classes of each entity were saved
- vision_unlearning.datasets.others.count_classes_dataset_taras_breeds(dataset_base_path: str) List[Tuple[str, int]]
- vision_unlearning.datasets.others.split_dataset_taras_breeds(downloaded_folder: str, dataset_forget_name: str, dataset_retain_name: str, target: str, forget_max_img: int = 0, retain_max_img_per_class: int = 0, restrict_labels: List[str] | None = None) Dict[str, int]
Given an already downloaded Taras Dog Breeds dataset at downloaded_folder (one folder per class), split images into forget and retain sets. @param forget_max_img: if >0, no more than this number of images will be saved for the forget set @param retain_max_img_per_class: if >0, will stratify the retain set such that no more images of one class are saved @param restrict_labels: if not none, save only those entities @return how many classes of each entity were saved
- vision_unlearning.datasets.others.download_dataset_taras_breeds(dataset_base_path: str, cache_folder: str) None
Do not perform splits Already convert images to jpg Do not overwrite if already exists (TODO: paremetrize?)
- vision_unlearning.datasets.others.split_dataset_sun(downloaded_folder: str, dataset_forget_name: str, dataset_retain_name: str, target: str, forget_max_img: int = 0, retain_max_img_per_class: int = 0, restrict_labels: List[str] | None = None) Dict[str, int]
Given an already downloaded SUN dataset at downloaded_folder (one folder per class), split images into forget and retain sets. @param forget_max_img: if >0, no more than this number of images will be saved for the forget set @param retain_max_img_per_class: if >0, will stratify the retain set such that no more images of one class are saved @param restrict_labels: if not none, save only those entities @return how many classes of each entity were saved
- vision_unlearning.datasets.others.download_dataset_sun()
- vision_unlearning.datasets.others.normalize_string(s: str) str
- vision_unlearning.datasets.others.akc_find_closest_match(df, name: str, group: str) str | None
- vision_unlearning.datasets.others.download_dataset_akc(output_path: str) pandas.DataFrame
- vision_unlearning.datasets.others.download_dataset_pantheon(url: str = 'https://storage.googleapis.com/pantheon-public-data/person_2025_update.csv.bz2', bz2_path: str = 'assets/datasets/person_2025_update.csv.bz2', csv_path: str = 'assets/datasets/person_2025_update.csv') pandas.DataFrame
- vision_unlearning.datasets.others.pantheon_find_closest_match(df, name: str) str | None