vision_unlearning.datasets
Submodules
Attributes
Classes
Abstraction over generated image dataset folders. |
Functions
|
|
|
|
|
@return preprocessed target, target_overwrite |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Return True if the entity dataset folder contains all expected on_* images. |
|
Return the task-level shared baseline folder path. |
|
Return the path to a baseline (lora_state='off') image for a given entity/seed/prompt. |
|
|
|
|
|
|
|
Plot a heatmap for a square DataFrame with all labels visible. |
Package Contents
- vision_unlearning.datasets.get_logger(name: str, level=logging.INFO) logging.Logger
- vision_unlearning.datasets.logger
- vision_unlearning.datasets._type_task
- vision_unlearning.datasets._type_method
- vision_unlearning.datasets.get_target_preprocessed(task: Literal['scenes', 'objects', 'breeds', 'people'], target: str) str
- vision_unlearning.datasets.get_target_overwrite(task: Literal['scenes', 'objects', 'breeds', 'people'], method: Literal['munba', 'uce', 'distil'], target: str) Tuple[str, str]
@return preprocessed target, target_overwrite
- vision_unlearning.datasets.get_metadata_filtered_path(task: Literal['scenes', 'objects', 'breeds', 'people'], base_folder: str = 'assets') str
- vision_unlearning.datasets.get_metadata_filtered(task: Literal['scenes', 'objects', 'breeds', 'people'], base_folder: str = 'assets') List[Dict[str, Any]]
- vision_unlearning.datasets.save_metadata_filtered(task: Literal['scenes', 'objects', 'breeds', 'people'], metadata_filtered: List[Dict[str, Any]], base_folder: str = 'assets')
- vision_unlearning.datasets.exists_metadata_filtered(task: Literal['scenes', 'objects', 'breeds', 'people'], base_folder: str = 'assets') bool
- vision_unlearning.datasets.get_attribute_for_entity(metadata_filtered: List[Dict[str, Any]], entity_name: str, attribute: str) Any
- vision_unlearning.datasets.task_to_dataset_map: Dict[Literal['scenes', 'objects', 'breeds', 'people'], str]
- vision_unlearning.datasets.get_unlearned_model_folder(task: Literal['scenes', 'objects', 'breeds', 'people'], method: Literal['munba', 'uce', 'distil'], num_train_epochs: int, target: str, base_folder: str = 'assets') str
- vision_unlearning.datasets.exists_unlearned_model(task: Literal['scenes', 'objects', 'breeds', 'people'], method: Literal['munba', 'uce', 'distil'], num_train_epochs: int, target: str, base_folder: str = 'assets') bool
- vision_unlearning.datasets.get_generated_dataset_folder(task: Literal['scenes', 'objects', 'breeds', 'people'], method: Literal['munba', 'uce', 'distil'], num_train_epochs: int, target: str, base_folder: str = 'assets') str
- vision_unlearning.datasets.get_generated_dataset_file(lora_state: Literal['on', 'off'], seed: int, prompt: str) str
- vision_unlearning.datasets.exists_unlearned_dataset(generated_dataset_output_path: str, generate_dataset_seeds: List[int], prompts: List[str]) bool
Return True if the entity dataset folder contains all expected on_* images.
Only on_* (unlearned model) images are counted. off_* files that may exist in legacy entity folders (pre-baseline-refactor data) are ignored so that old datasets remain valid without requiring a re-generation pass.
Baseline lora_state=’off’ images live in the shared baseline folder; see get_shared_baseline_folder() and get_off_image_path().
Expected: len(seeds) * len(prompts) on_*.png files + 1 metadata.jsonl.
Return the task-level shared baseline folder path.
A single shared folder per task holds ALL method-agnostic baseline images (generated by 0_generate_dataset_original.py, run once per task, with no LoRA). Images are independent of which entity is being forgotten, so one folder serves all entities and all methods.
Convention: assets/datasets/generated_{task}_baseline/
- vision_unlearning.datasets.get_off_image_path(task: Literal['scenes', 'objects', 'breeds', 'people'], target: str, method: Literal['munba', 'uce', 'distil'], num_train_epochs: int, seed: int, prompt: str, base_folder: str = 'assets', seeds: List[int] | None = None, prompts: List[str] | None = None) str
Return the path to a baseline (lora_state=’off’) image for a given entity/seed/prompt.
Note
This module-level function and
GeneratedDataset.get_off_image_path()(classmethod) provide identical functionality. Both exist because the module-level version predates theGeneratedDatasetclass; the classmethod delegates to this function. New code should prefer the classmethod for consistency with the OO abstraction, but the module-level function is NOT vestigial — it is used by legacy callers and remains the implementation backing both entry points.Fallback / download cascade: 1. If the shared task-level baseline folder exists locally, use it (preferred). 2. If
seedsandprompts(the full task-level lists) are provided and thebaseline folder is absent locally, attempt to download it from HuggingFace via
GeneratedDataset(task, method=None).compute(seeds, prompts). This mirrors the OO cascade: local → HF → scratch. If HF has the data it is downloaded; if not,_compute_from_scratchis called (which requires the base SD pipeline).Otherwise fall back to the legacy entity folder (get_generated_dataset_folder), which was the pre-refactor location for both on_* and off_* images.
- Parameters:
task – Used for the legacy entity-folder fallback (step 3) and to identify the baseline.
target – Used for the legacy entity-folder fallback (step 3) and to identify the baseline.
method – Used for the legacy entity-folder fallback (step 3) and to identify the baseline.
num_train_epochs – Used for the legacy entity-folder fallback (step 3) and to identify the baseline.
seed – Identify the specific image file to return.
prompt – Identify the specific image file to return.
base_folder – Root assets directory.
seeds – Full task-level seed and prompt lists — required for
exists()andcompute()on the shared baseline. When provided the function will attempt an HF download if the baseline folder is missing locally (step 2). If omitted, the function skips the download attempt and falls back directly to the entity folder (backward-compatible).prompts – Full task-level seed and prompt lists — required for
exists()andcompute()on the shared baseline. When provided the function will attempt an HF download if the baseline folder is missing locally (step 2). If omitted, the function skips the download attempt and falls back directly to the entity folder (backward-compatible).
- class vision_unlearning.datasets.GeneratedDataset(/, **data: Any)
Bases:
pydantic.BaseModelAbstraction over generated image dataset folders.
Represents exactly one dataset folder — either the shared task-level baseline or a method-specific entity dataset.
Folder conventions
- Shared baseline (
method=None): assets/datasets/generated_{task}_baseline/All method-agnostic off-images for the whole task live here. Generated once per task by 0_generate_dataset_original.py.
- Shared baseline (
- Entity dataset (
method=<str>,target=<str>): assets/datasets/generated_{task}_{target}_{method}_{epochs:03d}/Containson_*unlearned images (and possibly legacyoff_*).
- Entity dataset (
compute()resolves data in priority order: 1. Already complete locally → return immediately. 2. Present in HuggingFace → download, then return. 3. Neither → call_compute_from_scratch(), which generates imagesfrom scratch using the Stable Diffusion pipeline.
After
_compute_from_scratch()completes, ifupload_if_recomputedis True the dataset folder is uploaded to HuggingFace.The
get_off_image_pathclass method encapsulates the full fallback chain for a baseline image: shared baseline → entity folder.- task: _type_task
- target: str | None = None
- method: _type_method | None = None
- num_train_epochs: int | None = None
- base_folder: str = 'assets'
- remote_repository_name: str = 'LeonardoBenitez/VisionUnlearningEvaluationTestbeds'
- recompute_if_exists: bool = False
- upload_if_recomputed: bool = False
- _validate_consistency() GeneratedDataset
- property is_baseline: bool
True when this dataset holds baseline (lora-off) images.
- property folder_path: str
Local path to the dataset folder.
- Replaces:
get_shared_baseline_folder()
get_generated_dataset_folder()
- property hf_config_name: str
HuggingFace config / folder name (basename of folder_path).
This is the bare folder name used for local path computation. Use
hf_path_in_repowhen you need the full HF-side path.
- property hf_path_in_repo: str
Full path inside the HuggingFace repository where this dataset lives.
All generated datasets (baseline and entity) live under the
datasets/prefix in the HF repo, matching the convention used by the legacy synchronisation notebook (0b. Synchronize.ipynb).Example:
"datasets/generated_breeds_baseline"
- file_path(lora_state: Literal['on', 'off'], seed: int, prompt: str) str
Full path to one image file inside this dataset folder.
Replaces get_generated_dataset_file() when used together with a GeneratedDataset instance.
- Note: lora_state=’on’ is only valid for entity datasets (method set).
lora_state=’off’ is valid for all dataset types.
- exists(seeds: List[int], prompts: List[str]) bool
Return True if all expected images and metadata are present locally.
For entity datasets, only on_* images are counted (off_* legacy files are ignored — same contract as exists_unlearned_dataset()). For baseline datasets, only off_* images are counted.
Replaces exists_unlearned_dataset() for entity datasets and provides the equivalent for baseline folders.
WARNING — shared baseline: The shared baseline folder contains images for ALL entities in the task (N_entities * len(seeds) images total), not just the entities in the
promptsargument. This method counts existing off_* files and compares againstlen(seeds) * len(prompts).If
promptsis a partial (subset) list of the full task prompts,exists()will count more images than expected and incorrectly return False, triggering a full re-generation. Always pass the COMPLETE prompt list for the task when callingexists()on a shared baseline dataset.For entity datasets this restriction does not apply because the entity folder contains only the images for that specific entity.
- _compute_from_scratch(seeds: List[int], prompts: List[str], batch_size: int = 16) str
Generate images from scratch and return the folder path.
For the shared baseline (method=None): loads the base SD pipeline once and generates all off-images for all (seed, prompt) pairs, storing them in folder_path with the
off_{seed}_{prompt}.pngfilename convention.For entity datasets (method set): loads the already-trained unlearned model identified by (task, target, method, num_train_epochs) and generates on-images. Raises FileNotFoundError if the trained model does not exist on disk — the caller must run 1_unlearn_from_metadata.py first to produce the model weights before calling compute().
In both cases the method returns self.folder_path after generation.
Note on metadata.jsonl (entity datasets):
generate_dataset()writesmetadata.jsonltoself.folder_pathas its last step. This is verified end-to-end for the shared baseline path. For entity datasets,generate_dataset()itself writes the file in both the LoRA and UCE paths (see vision_unlearning/utils/data_generation.py line 165), but the unit tests for this method mockgenerate_datasetand therefore do not exercise the actual file write. If thegenerate_datasetimplementation changes and stops writingmetadata.jsonl, the entity path here would silently produce an incomplete dataset.- Parameters:
seeds (list of int) – Generation seeds.
prompts (list of str) – Text prompts — one per image template, excluding seed variation.
batch_size (int) – Number of prompts per pipeline call. Default 16 (optimal for 8–12 GB VRAM on this hardware; see perf test in PLAN-TASK-2026-05-19-Baseline.md).
- compute(seeds: List[int], prompts: List[str], batch_size: int = 16) str
Ensure the dataset is available locally and return its folder path.
Resolution order: 1. Already complete locally → return immediately. 2. Present in HuggingFace → download, return. 3. Neither → call
_compute_from_scratch().After generation completes, if
upload_if_recomputed=True, upload the folder to HuggingFace.- Parameters:
seeds (list of int) – Generation seeds.
prompts (list of str) – Text prompts. For shared baseline datasets, this MUST be the complete prompt list for the task (all entities). Passing a partial list will cause
exists()to return False and trigger unnecessary re-generation. Seeexists()docstring for details.batch_size (int) – Prompts per pipeline call, forwarded to
_compute_from_scratch(). Ignored if the data is already available locally or on HuggingFace. Default 16 (optimal for 8–12 GB VRAM; see perf results in PLAN-TASK-2026-05-19-Baseline.md).
- Returns:
The local folder path to the (now complete) dataset.
- Return type:
str
- classmethod get_off_image_path(task: _type_task, target: str, method: _type_method, num_train_epochs: int, seed: int, prompt: str, base_folder: str = 'assets', seeds: List[int] | None = None, prompts: List[str] | None = None) str
Return the path to a baseline (lora_state=’off’) image.
Fallback / download cascade: 1. Shared task-level baseline folder present locally (preferred). 2. If
seedsandprompts(full task-level lists) are provided and thebaseline folder is absent, download it from HuggingFace via
GeneratedDataset(task, method=None).compute(seeds, prompts).Legacy entity folder (pre-refactor mixed on_* + off_* format).
This class method delegates to the module-level get_off_image_path() which implements the same cascade. Both exist; prefer this classmethod for new code using GeneratedDataset.
- Parameters:
task – Used for the legacy entity-folder fallback (step 3).
target – Used for the legacy entity-folder fallback (step 3).
method – Used for the legacy entity-folder fallback (step 3).
num_train_epochs – Used for the legacy entity-folder fallback (step 3).
seed – Identify the specific image file.
prompt – Identify the specific image file.
base_folder – Root assets directory.
seeds – Full task-level seed and prompt lists. Required to enable the HF download cascade (step 2). When omitted the function falls back directly to the entity folder (backward-compatible).
prompts – Full task-level seed and prompt lists. Required to enable the HF download cascade (step 2). When omitted the function falls back directly to the entity folder (backward-compatible).
- vision_unlearning.datasets.get_similarity_clip_path(task: Literal['scenes', 'objects', 'breeds', 'people'], base_folder: str = 'assets') str
- vision_unlearning.datasets.get_similarity_clip_df(task: Literal['scenes', 'objects', 'breeds', 'people'], base_folder: str = 'assets') pandas.DataFrame
- vision_unlearning.datasets.calculate_similarity_clip(task: Literal['scenes', 'objects', 'breeds', 'people'], labels: List[str], base_folder: str = 'assets') pandas.DataFrame
- vision_unlearning.datasets.plot_heatmap(df, figsize=None, cmap='viridis', title='Heatmap')
Plot a heatmap for a square DataFrame with all labels visible.
- Parameters:
df (pd.DataFrame) – A square DataFrame with same string labels for index and columns.
figsize (tuple) – Figure size (width, height). Increase if labels overlap.
cmap (str) – Colormap name for matplotlib.