MetadataCombiner#

pydantic model sdgx.data_models.combiner.MetadataCombiner[source]#

Combine different tables with relationship, used for describing the relationship between tables.

Parameters:
  • version (str) – version

  • named_metadata (Dict[str, Any]) – pairs of table name and metadata

  • relationships (List[Any]) – list of relationships

Show JSON schema
{
   "title": "MetadataCombiner",
   "description": "Combine different tables with relationship, used for describing the relationship between tables.\n\nArgs:\n    version (str): version\n    named_metadata (Dict[str, Any]): pairs of table name and metadata\n    relationships (List[Any]): list of relationships",
   "type": "object",
   "properties": {
      "version": {
         "default": "1.0",
         "title": "Version",
         "type": "string"
      },
      "named_metadata": {
         "additionalProperties": {
            "$ref": "#/$defs/Metadata"
         },
         "default": {},
         "title": "Named Metadata",
         "type": "object"
      },
      "relationships": {
         "default": [],
         "items": {
            "$ref": "#/$defs/Relationship"
         },
         "title": "Relationships",
         "type": "array"
      }
   },
   "$defs": {
      "CategoricalEncoderType": {
         "enum": [
            "onehot",
            "label",
            "frequency"
         ],
         "title": "CategoricalEncoderType",
         "type": "string"
      },
      "KeyTuple": {
         "maxItems": 2,
         "minItems": 2,
         "prefixItems": [
            {
               "title": "Parent"
            },
            {
               "title": "Child"
            }
         ],
         "type": "array"
      },
      "Metadata": {
         "description": "Metadata is mainly used to describe the data types of all columns in a single data table.\n\nFor each column, there should be an instance of the Data Type object.\n\n.. Note::\n\n    Use ``get``, ``set``, ``add``, ``delete`` to update tags in the metadata. And use `query` for querying a column for its tags.\n\nArgs:\n    primary_keys(List[str]): The primary key, a field used to uniquely identify each row in the table.\n    The primary key of each row must be unique and not empty.\n\n    column_list(list[str]): list of the comlumn name in the table, other columns lists are used to store column information.",
         "properties": {
            "primary_keys": {
               "default": [],
               "items": {
                  "type": "string"
               },
               "title": "Primary Keys",
               "type": "array",
               "uniqueItems": true
            },
            "column_list": {
               "items": {
                  "type": "string"
               },
               "title": "The List of Column Names",
               "type": "array"
            },
            "column_inspect_level": {
               "additionalProperties": {
                  "type": "integer"
               },
               "title": "Column Inspect Level",
               "type": "object"
            },
            "pii_columns": {
               "default": [],
               "items": {
                  "type": "string"
               },
               "title": "Pii Columns",
               "type": "array",
               "uniqueItems": true
            },
            "id_columns": {
               "default": [],
               "items": {
                  "type": "string"
               },
               "title": "Id Columns",
               "type": "array",
               "uniqueItems": true
            },
            "int_columns": {
               "default": [],
               "items": {
                  "type": "string"
               },
               "title": "Int Columns",
               "type": "array",
               "uniqueItems": true
            },
            "float_columns": {
               "default": [],
               "items": {
                  "type": "string"
               },
               "title": "Float Columns",
               "type": "array",
               "uniqueItems": true
            },
            "bool_columns": {
               "default": [],
               "items": {
                  "type": "string"
               },
               "title": "Bool Columns",
               "type": "array",
               "uniqueItems": true
            },
            "discrete_columns": {
               "default": [],
               "items": {
                  "type": "string"
               },
               "title": "Discrete Columns",
               "type": "array",
               "uniqueItems": true
            },
            "datetime_columns": {
               "default": [],
               "items": {
                  "type": "string"
               },
               "title": "Datetime Columns",
               "type": "array",
               "uniqueItems": true
            },
            "const_columns": {
               "default": [],
               "items": {
                  "type": "string"
               },
               "title": "Const Columns",
               "type": "array",
               "uniqueItems": true
            },
            "datetime_format": {
               "title": "Datetime Format",
               "type": "object"
            },
            "numeric_format": {
               "title": "Numeric Format",
               "type": "object"
            },
            "categorical_encoder": {
               "anyOf": [
                  {
                     "additionalProperties": {
                        "$ref": "#/$defs/CategoricalEncoderType"
                     },
                     "type": "object"
                  },
                  {
                     "type": "null"
                  }
               ],
               "title": "Categorical Encoder"
            },
            "categorical_threshold": {
               "anyOf": [
                  {
                     "additionalProperties": {
                        "$ref": "#/$defs/CategoricalEncoderType"
                     },
                     "type": "object"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "title": "Categorical Threshold"
            },
            "version": {
               "default": "1.0",
               "title": "Version",
               "type": "string"
            }
         },
         "title": "Metadata",
         "type": "object"
      },
      "Relationship": {
         "description": "Relationship between tables\n\nFor parent table, we don't need define primary key here.\nThe primary key is pre-defined in parent table's metadata.\n\nChild table's foreign key should be defined here.",
         "properties": {
            "version": {
               "default": "1.0",
               "title": "Version",
               "type": "string"
            },
            "parent_table": {
               "title": "Parent Table",
               "type": "string"
            },
            "child_table": {
               "title": "Child Table",
               "type": "string"
            },
            "foreign_keys": {
               "items": {
                  "$ref": "#/$defs/KeyTuple"
               },
               "title": "Foreign Keys",
               "type": "array"
            }
         },
         "required": [
            "parent_table",
            "child_table",
            "foreign_keys"
         ],
         "title": "Relationship",
         "type": "object"
      }
   }
}

Fields:
field named_metadata: Dict[str, Metadata] = {}#
field relationships: List[Relationship] = []#
field version: str = '1.0'#
check()[source]#

Do necessary checks:

  • Whether number of tables corresponds to relationships.

  • Whether table names corresponds to the relationship between tables;

classmethod from_dataframe(dataframes: list[DataFrame], names: list[str], metadata_from_dataloader_kwargs: None | dict = None, relationshipe_inspector: None | str | type[Inspector] = 'SubsetRelationshipInspector', relationships_inspector_kwargs: None | dict = None, relationships: None | list[Relationship] = None) MetadataCombiner[source]#

Combine multiple dataframes with relationship.

Parameters:
  • dataframes (list[pd.DataFrame]) – list of dataframes

  • names (list[str]) – list of names

  • metadata_from_dataloader_kwargs (dict) – kwargs for Metadata.from_dataloader()

  • relationshipe_inspector (str | type[Inspector]) – relationship inspector

  • relationships_inspector_kwargs (dict) – kwargs for InspectorManager.init()

  • relationships (list[Relationship]) – list of relationships

classmethod from_dataloader(dataloaders: list[DataLoader], metadata_from_dataloader_kwargs: None | dict = None, relationshipe_inspector: None | str | type[Inspector] = 'SubsetRelationshipInspector', relationships_inspector_kwargs: None | dict = None, relationships: None | list[Relationship] = None)[source]#

Combine multiple dataloaders with relationship.

Parameters:
  • dataloaders (list[DataLoader]) – list of dataloaders

  • max_chunk (int) – max chunk count for relationship inspector.

  • metadata_from_dataloader_kwargs (dict) – kwargs for Metadata.from_dataloader()

  • relationshipe_inspector (str | type[Inspector]) – relationship inspector

  • relationships_inspector_kwargs (dict) – kwargs for InspectorManager.init()

  • relationships (list[Relationship]) – list of relationships

classmethod load(save_dir: str | Path, metadata_subdir: str = 'metadata', relationship_subdir: str = 'relationship', version: str | None = None) MetadataCombiner[source]#

Load metadata from json file.

Parameters:
  • save_dir (str | Path) – directory to save

  • metadata_subdir (str) – subdirectory for metadata, default is “metadata”

  • relationship_subdir (str) – subdirectory for relationship, default is “relationship”

  • version (str) – Manual version, if not specified, try to load from version file

save(save_dir: str | Path, metadata_subdir: str = 'metadata', relationship_subdir: str = 'relationship')[source]#

Save metadata to json file.

This will create several subdirectories for metadata and relationship.

Parameters:
  • save_dir (str | Path) – directory to save

  • metadata_subdir (str) – subdirectory for metadata, default is “metadata”

  • relationship_subdir (str) – subdirectory for relationship, default is “relationship”

classmethod upgrade(old_version: str, named_metadata: dict[str, Metadata], relationships: list[Relationship]) None[source]#

Upgrade metadata from old version to new version

Metadata.upgrade and Relationship.upgrade will try upgrade when loading. So here we just do Combiner’s upgrade.

property fields: Iterable[str]#

Return all fields in MetadataCombiner.