MetadataCombiner#
- pydantic model sdgx.data_models.combiner.MetadataCombiner[source]#
Combine different tables with relationship, used for describing the relationship between tables.
- Parameters:
version (str) – version
named_metadata (Dict[str, Any]) – pairs of table name and metadata
relationships (List[Any]) – list of relationships
Show JSON schema
{ "title": "MetadataCombiner", "description": "Combine different tables with relationship, used for describing the relationship between tables.\n\nArgs:\n version (str): version\n named_metadata (Dict[str, Any]): pairs of table name and metadata\n relationships (List[Any]): list of relationships", "type": "object", "properties": { "version": { "default": "1.0", "title": "Version", "type": "string" }, "named_metadata": { "additionalProperties": { "$ref": "#/$defs/Metadata" }, "default": {}, "title": "Named Metadata", "type": "object" }, "relationships": { "default": [], "items": { "$ref": "#/$defs/Relationship" }, "title": "Relationships", "type": "array" } }, "$defs": { "CategoricalEncoderType": { "enum": [ "onehot", "label", "frequency" ], "title": "CategoricalEncoderType", "type": "string" }, "KeyTuple": { "maxItems": 2, "minItems": 2, "prefixItems": [ { "title": "Parent" }, { "title": "Child" } ], "type": "array" }, "Metadata": { "description": "Metadata is mainly used to describe the data types of all columns in a single data table.\n\nFor each column, there should be an instance of the Data Type object.\n\n.. Note::\n\n Use ``get``, ``set``, ``add``, ``delete`` to update tags in the metadata. And use `query` for querying a column for its tags.\n\nArgs:\n primary_keys(List[str]): The primary key, a field used to uniquely identify each row in the table.\n The primary key of each row must be unique and not empty.\n\n column_list(list[str]): list of the comlumn name in the table, other columns lists are used to store column information.", "properties": { "primary_keys": { "default": [], "items": { "type": "string" }, "title": "Primary Keys", "type": "array", "uniqueItems": true }, "column_list": { "items": { "type": "string" }, "title": "The List of Column Names", "type": "array" }, "column_inspect_level": { "additionalProperties": { "type": "integer" }, "title": "Column Inspect Level", "type": "object" }, "pii_columns": { "default": [], "items": { "type": "string" }, "title": "Pii Columns", "type": "array", "uniqueItems": true }, "id_columns": { "default": [], "items": { "type": "string" }, "title": "Id Columns", "type": "array", "uniqueItems": true }, "int_columns": { "default": [], "items": { "type": "string" }, "title": "Int Columns", "type": "array", "uniqueItems": true }, "float_columns": { "default": [], "items": { "type": "string" }, "title": "Float Columns", "type": "array", "uniqueItems": true }, "bool_columns": { "default": [], "items": { "type": "string" }, "title": "Bool Columns", "type": "array", "uniqueItems": true }, "discrete_columns": { "default": [], "items": { "type": "string" }, "title": "Discrete Columns", "type": "array", "uniqueItems": true }, "datetime_columns": { "default": [], "items": { "type": "string" }, "title": "Datetime Columns", "type": "array", "uniqueItems": true }, "const_columns": { "default": [], "items": { "type": "string" }, "title": "Const Columns", "type": "array", "uniqueItems": true }, "datetime_format": { "title": "Datetime Format", "type": "object" }, "numeric_format": { "title": "Numeric Format", "type": "object" }, "categorical_encoder": { "anyOf": [ { "additionalProperties": { "$ref": "#/$defs/CategoricalEncoderType" }, "type": "object" }, { "type": "null" } ], "title": "Categorical Encoder" }, "categorical_threshold": { "anyOf": [ { "additionalProperties": { "$ref": "#/$defs/CategoricalEncoderType" }, "type": "object" }, { "type": "null" } ], "default": null, "title": "Categorical Threshold" }, "version": { "default": "1.0", "title": "Version", "type": "string" } }, "title": "Metadata", "type": "object" }, "Relationship": { "description": "Relationship between tables\n\nFor parent table, we don't need define primary key here.\nThe primary key is pre-defined in parent table's metadata.\n\nChild table's foreign key should be defined here.", "properties": { "version": { "default": "1.0", "title": "Version", "type": "string" }, "parent_table": { "title": "Parent Table", "type": "string" }, "child_table": { "title": "Child Table", "type": "string" }, "foreign_keys": { "items": { "$ref": "#/$defs/KeyTuple" }, "title": "Foreign Keys", "type": "array" } }, "required": [ "parent_table", "child_table", "foreign_keys" ], "title": "Relationship", "type": "object" } } }
- Fields:
- field relationships: List[Relationship] = []#
- field version: str = '1.0'#
- check()[source]#
Do necessary checks:
Whether number of tables corresponds to relationships.
Whether table names corresponds to the relationship between tables;
- classmethod from_dataframe(dataframes: list[DataFrame], names: list[str], metadata_from_dataloader_kwargs: None | dict = None, relationshipe_inspector: None | str | type[Inspector] = 'SubsetRelationshipInspector', relationships_inspector_kwargs: None | dict = None, relationships: None | list[Relationship] = None) MetadataCombiner[source]#
Combine multiple dataframes with relationship.
- Parameters:
dataframes (list[pd.DataFrame]) – list of dataframes
names (list[str]) – list of names
metadata_from_dataloader_kwargs (dict) – kwargs for
Metadata.from_dataloader()relationshipe_inspector (str | type[Inspector]) – relationship inspector
relationships_inspector_kwargs (dict) – kwargs for
InspectorManager.init()relationships (list[Relationship]) – list of relationships
- classmethod from_dataloader(dataloaders: list[DataLoader], metadata_from_dataloader_kwargs: None | dict = None, relationshipe_inspector: None | str | type[Inspector] = 'SubsetRelationshipInspector', relationships_inspector_kwargs: None | dict = None, relationships: None | list[Relationship] = None)[source]#
Combine multiple dataloaders with relationship.
- Parameters:
dataloaders (list[DataLoader]) – list of dataloaders
max_chunk (int) – max chunk count for relationship inspector.
metadata_from_dataloader_kwargs (dict) – kwargs for
Metadata.from_dataloader()relationshipe_inspector (str | type[Inspector]) – relationship inspector
relationships_inspector_kwargs (dict) – kwargs for
InspectorManager.init()relationships (list[Relationship]) – list of relationships
- classmethod load(save_dir: str | Path, metadata_subdir: str = 'metadata', relationship_subdir: str = 'relationship', version: str | None = None) MetadataCombiner[source]#
Load metadata from json file.
- Parameters:
save_dir (str | Path) – directory to save
metadata_subdir (str) – subdirectory for metadata, default is “metadata”
relationship_subdir (str) – subdirectory for relationship, default is “relationship”
version (str) – Manual version, if not specified, try to load from version file
- save(save_dir: str | Path, metadata_subdir: str = 'metadata', relationship_subdir: str = 'relationship')[source]#
Save metadata to json file.
This will create several subdirectories for metadata and relationship.
- Parameters:
save_dir (str | Path) – directory to save
metadata_subdir (str) – subdirectory for metadata, default is “metadata”
relationship_subdir (str) – subdirectory for relationship, default is “relationship”
- classmethod upgrade(old_version: str, named_metadata: dict[str, Metadata], relationships: list[Relationship]) None[source]#
Upgrade metadata from old version to new version
Metadata.upgrade and Relationship.upgrade will try upgrade when loading. So here we just do Combiner’s upgrade.
- property fields: Iterable[str]#
Return all fields in MetadataCombiner.