vision_unlearning.datasets.others ================================= .. py:module:: vision_unlearning.datasets.others .. autoapi-nested-parse:: These datasets do not follow YET the standard Dataset interface. They are more like a set of loosely related functions. In any case, they may help you get started Functions --------- .. autoapisummary:: vision_unlearning.datasets.others.create_metadata_jsonl vision_unlearning.datasets.others.jsonl_dump vision_unlearning.datasets.others.jsonl_load vision_unlearning.datasets.others.balanced_subsample_lib vision_unlearning.datasets.others.count_classes_dataset_lfw vision_unlearning.datasets.others.download_dataset_lfw vision_unlearning.datasets.others.count_classes_dataset_taras_breeds vision_unlearning.datasets.others.split_dataset_taras_breeds vision_unlearning.datasets.others.download_dataset_taras_breeds vision_unlearning.datasets.others.split_dataset_sun vision_unlearning.datasets.others.download_dataset_sun vision_unlearning.datasets.others.normalize_string vision_unlearning.datasets.others.akc_find_closest_match vision_unlearning.datasets.others.download_dataset_akc vision_unlearning.datasets.others.download_dataset_pantheon vision_unlearning.datasets.others.pantheon_find_closest_match Module Contents --------------- .. py:function:: create_metadata_jsonl(folder: pathlib.Path) Create metadata.jsonl for all .jpg images in `folder`. Each line: {"file_name": "", "text": ""} .. py:function:: jsonl_dump(data: list, path: str) -> None .. py:function:: jsonl_load(path: str) -> list .. py:function:: balanced_subsample_lib(df: pandas.DataFrame, group_cols: List[str], priority_col: str, target: int = 100, random_state: int = 42, dropna: bool = True) -> pandas.DataFrame Subsample `df` without replacement to produce `target` rows (or as many as available) balanced as evenly as possible across the combinations of `group_cols` (i.e. strata). Within each stratum, the top rows are selected by highest `dataset_n_original`. Final output is globally ordered by decreasing `priority_col`. Every stratum (combiantion of group_cols) gets as close as possible to equal share. Does not give preference to any one group. Balances stratum-by-stratum (intersectional fairness), which may overall groups to be underpresented if there isn't enoguh data index is dropped (TODO refactor?) .. py:function:: count_classes_dataset_lfw() .. py:function:: download_dataset_lfw(dataset_forget_name: str, dataset_retain_name: str, target: str, forget_max_img: int = 0, retain_max_img_per_class: int = 0, restrict_labels: Optional[List[str]] = None) -> Dict[str, int] Downloads and already splits (TODO: separate that into two functions) @param forget_max_img: if >0, no more than this number of images will be saved for the forget set @param retain_max_img_per_class: if >0, will stratify the retain set such that no more images of one class are saved @param restrict_labels: if not none, save only those entities @return how many classes of each entity were saved .. py:function:: count_classes_dataset_taras_breeds(dataset_base_path: str) -> List[Tuple[str, int]] .. py:function:: split_dataset_taras_breeds(downloaded_folder: str, dataset_forget_name: str, dataset_retain_name: str, target: str, forget_max_img: int = 0, retain_max_img_per_class: int = 0, restrict_labels: Optional[List[str]] = None) -> Dict[str, int] Given an already downloaded Taras Dog Breeds dataset at `downloaded_folder` (one folder per class), split images into forget and retain sets. @param forget_max_img: if >0, no more than this number of images will be saved for the forget set @param retain_max_img_per_class: if >0, will stratify the retain set such that no more images of one class are saved @param restrict_labels: if not none, save only those entities @return how many classes of each entity were saved .. py:function:: download_dataset_taras_breeds(dataset_base_path: str, cache_folder: str) -> None Do not perform splits Already convert images to jpg Do not overwrite if already exists (TODO: paremetrize?) .. py:function:: split_dataset_sun(downloaded_folder: str, dataset_forget_name: str, dataset_retain_name: str, target: str, forget_max_img: int = 0, retain_max_img_per_class: int = 0, restrict_labels: Optional[List[str]] = None) -> Dict[str, int] Given an already downloaded SUN dataset at `downloaded_folder` (one folder per class), split images into forget and retain sets. @param forget_max_img: if >0, no more than this number of images will be saved for the forget set @param retain_max_img_per_class: if >0, will stratify the retain set such that no more images of one class are saved @param restrict_labels: if not none, save only those entities @return how many classes of each entity were saved .. py:function:: download_dataset_sun() .. py:function:: normalize_string(s: str) -> str .. py:function:: akc_find_closest_match(df, name: str, group: str) -> Optional[str] .. py:function:: download_dataset_akc(output_path: str) -> pandas.DataFrame .. py:function:: download_dataset_pantheon(url: str = 'https://storage.googleapis.com/pantheon-public-data/person_2025_update.csv.bz2', bz2_path: str = 'assets/datasets/person_2025_update.csv.bz2', csv_path: str = 'assets/datasets/person_2025_update.csv') -> pandas.DataFrame .. py:function:: pantheon_find_closest_match(df, name: str) -> Optional[str]