vision_unlearning.datasets.others
=================================

.. py:module:: vision_unlearning.datasets.others

.. autoapi-nested-parse::

   These datasets do not follow YET the standard Dataset interface.
   They are more like a set of loosely related functions.
   In any case, they may help you get started


Functions
---------

.. autoapisummary::

   vision_unlearning.datasets.others.create_metadata_jsonl
   vision_unlearning.datasets.others.jsonl_dump
   vision_unlearning.datasets.others.jsonl_load
   vision_unlearning.datasets.others.balanced_subsample_lib
   vision_unlearning.datasets.others.count_classes_dataset_lfw
   vision_unlearning.datasets.others.download_dataset_lfw
   vision_unlearning.datasets.others.count_classes_dataset_taras_breeds
   vision_unlearning.datasets.others.split_dataset_taras_breeds
   vision_unlearning.datasets.others.download_dataset_taras_breeds
   vision_unlearning.datasets.others.split_dataset_sun
   vision_unlearning.datasets.others.download_dataset_sun
   vision_unlearning.datasets.others.normalize_string
   vision_unlearning.datasets.others.akc_find_closest_match
   vision_unlearning.datasets.others.download_dataset_akc
   vision_unlearning.datasets.others.download_dataset_pantheon
   vision_unlearning.datasets.others.pantheon_find_closest_match


Module Contents
---------------

.. py:function:: create_metadata_jsonl(folder: pathlib.Path)

   Create metadata.jsonl for all .jpg images in `folder`.
   Each line: {"file_name": "<filename>", "text": "<class_name>"}


.. py:function:: jsonl_dump(data: list, path: str) -> None

.. py:function:: jsonl_load(path: str) -> list

.. py:function:: balanced_subsample_lib(df: pandas.DataFrame, group_cols: List[str], priority_col: str, target: int = 100, random_state: int = 42, dropna: bool = True) -> pandas.DataFrame

   Subsample `df` without replacement to produce `target` rows (or as many as available)
   balanced as evenly as possible across the combinations of `group_cols` (i.e. strata).

   Within each stratum, the top rows are selected by highest `dataset_n_original`.
   Final output is globally ordered by decreasing `priority_col`.

   Every stratum (combiantion of group_cols) gets as close as possible to equal share. Does not give preference to any one group.
   Balances stratum-by-stratum (intersectional fairness), which may overall groups to be underpresented if there isn't enoguh data

   index is dropped (TODO refactor?)


.. py:function:: count_classes_dataset_lfw()

.. py:function:: download_dataset_lfw(dataset_forget_name: str, dataset_retain_name: str, target: str, forget_max_img: int = 0, retain_max_img_per_class: int = 0, restrict_labels: Optional[List[str]] = None) -> Dict[str, int]

   Downloads and already splits (TODO: separate that into two functions)
   @param forget_max_img: if >0, no more than this number of images will be saved for the forget set
   @param retain_max_img_per_class: if >0, will stratify the retain set such that no more images of one class are saved
   @param restrict_labels: if not none, save only those entities
   @return how many classes of each entity were saved


.. py:function:: count_classes_dataset_taras_breeds(dataset_base_path: str) -> List[Tuple[str, int]]

.. py:function:: split_dataset_taras_breeds(downloaded_folder: str, dataset_forget_name: str, dataset_retain_name: str, target: str, forget_max_img: int = 0, retain_max_img_per_class: int = 0, restrict_labels: Optional[List[str]] = None) -> Dict[str, int]

   Given an already downloaded Taras Dog Breeds dataset at `downloaded_folder` (one folder per class), split images into forget and retain sets.
   @param forget_max_img: if >0, no more than this number of images will be saved for the forget set
   @param retain_max_img_per_class: if >0, will stratify the retain set such that no more images of one class are saved
   @param restrict_labels: if not none, save only those entities
   @return how many classes of each entity were saved


.. py:function:: download_dataset_taras_breeds(dataset_base_path: str, cache_folder: str) -> None

   Do not perform splits
   Already convert images to jpg
   Do not overwrite if already exists (TODO: paremetrize?)


.. py:function:: split_dataset_sun(downloaded_folder: str, dataset_forget_name: str, dataset_retain_name: str, target: str, forget_max_img: int = 0, retain_max_img_per_class: int = 0, restrict_labels: Optional[List[str]] = None) -> Dict[str, int]

   Given an already downloaded SUN dataset at `downloaded_folder` (one folder per class), split images into forget and retain sets.
   @param forget_max_img: if >0, no more than this number of images will be saved for the forget set
   @param retain_max_img_per_class: if >0, will stratify the retain set such that no more images of one class are saved
   @param restrict_labels: if not none, save only those entities
   @return how many classes of each entity were saved


.. py:function:: download_dataset_sun()

.. py:function:: normalize_string(s: str) -> str

.. py:function:: akc_find_closest_match(df, name: str, group: str) -> Optional[str]

.. py:function:: download_dataset_akc(output_path: str) -> pandas.DataFrame

.. py:function:: download_dataset_pantheon(url: str = 'https://storage.googleapis.com/pantheon-public-data/person_2025_update.csv.bz2', bz2_path: str = 'assets/datasets/person_2025_update.csv.bz2', csv_path: str = 'assets/datasets/person_2025_update.csv') -> pandas.DataFrame

.. py:function:: pantheon_find_closest_match(df, name: str) -> Optional[str]