Skip to content

Dataset Profiling Workflow

This workflow orchestrates the profiling lifecycle of a dataset, from profile generation to metadata integration and cleanup. It produces two complementary profiles— light and heavy— and ensures that both are correctly persisted before releasing profiling resources.

Input

Initially, the workflow is provided with the following information:

Parameter Type Special Format Mandatory Description
id string uuid yes A globally unique identifier of the dataset to be profiled. This identifier links profiling results to the correct dataset entity in downstream systems.
code string no An optional short or human-friendly code used to reference the dataset. Primarily intended for cataloging or internal identification.
name string no The human-readable name of the dataset. Used for identification and traceability throughout the profiling lifecycle.
description string no A detailed textual description of the dataset. Provides contextual information that may be associated with profiling results.
headline string no A concise, high-level summary of the dataset. Typically used for display purposes in user interfaces.
fields_of_science array of strings no A list of scientific or domain classifications associated with the dataset. Used for categorization and analytical context.
languages array of strings no The languages represented in the dataset content. Relevant for interpreting profiling metrics, especially for textual data.
keywords string no Free-form keywords associated with the dataset. Support discoverability and semantic interpretation of profiling results.
countries array of strings no A list of countries relevant to the dataset. May indicate geographic coverage, origin, or regulatory context.
publishedUrl string uri no A public or external URL where the dataset is described or published. May reference documentation, landing pages, or external repositories.
citeAs string no A recommended citation string for the dataset. Used to support attribution and reuse in publications.
conformsTo string no A reference to a standard, schema, or specification that the dataset adheres to. Provides structural or semantic context for profiling outputs.
license string no The license under which the dataset is distributed. Defines legal and usage constraints relevant to downstream consumers.
size number integer no An approximate size indicator for the dataset. May represent file size, record count, or another agreed-upon metric.
version string no The version identifier of the dataset. Used to distinguish profiling results across different dataset revisions.
mime_type string no The MIME type describing the dataset's format (e.g. text/csv, application/json). Helps the profiling system select appropriate analysis strategies.
date_published string date no The publication date of the dataset. Used for provenance tracking and temporal context.
userId string no An identifier representing the user or actor who initiated the profiling workflow. Used for attribution, auditing, and traceability.
data_store_kind 0 or 1 yes An indicator of how the dataset is physically stored. This value determines how the profiler accesses the dataset (e.g. file-based vs. database-backed) and influences profiling execution behavior.
archivedAt string path yes The path or location where the dataset is archived and accessed for profiling. This serves as the authoritative source from which profiling jobs read the dataset.

Tasks

The workflow follows the diagram below, which illustrates parallel execution, synchronization points, and cleanup:

alt text

Trigger Profiles

The execution is split in two parallel task instance queues, one for light profiling and one for heavy, as shown in the diagram. The first level of tasks ran in parallel get the corresponding authorization token and communicate with the Profiler service to trigger the generation of the profile. This parallelization allows lightweight metadata extraction and more intensive analysis to proceed independently and efficiently.

Check if the Profiles are ready

Once profiling jobs have been triggered, each branch enters a polling phase. In this phase the workflow periodically checks the status of the corresponding profiling job. Execution is paused between checks to avoid unnecessary load. The workflow advances only when the profile has reached a "ready" state. If a profiling job fails, or is cleaned up prematurely, the workflow terminates with a failure, ensuring that incomplete or invalid profiles are never propagated. This step guarantees that downstream tasks only operate on fully generated and valid profiling results.

Fetch Profiles

After a profile is reported as ready, the workflow retrieves its contents from the Profiler service. For each profile (light and heavy) the complete profiling result is fetched with the data preserved in its original structured form, with no transformation or interpretation applied. This separation ensures that profiling generation and profiling consumption remain decoupled.

Update Data Management

Once profiling data has been retrieved, the workflow updates the dataset representation in the Data Model Management system. The light and heavy profiles are handled as distinct but complementary updates, allowing consumers to benefit from different levels of detail. Successful completion of this step means the dataset is now profile-aware within the platform.

Profile Cleanup

After both profiling branches have completed successfully, the workflow converges into a final cleanup step. Cleanup is intentionally executed after all profiling data has been safely persisted, ensuring no loss of information.