Metadata#

class sdgx.data_models.metadata.CategoricalEncoderType(value)[source]#: An enumeration.

pydantic model sdgx.data_models.metadata.Metadata[source]#

Metadata is mainly used to describe the data types of all columns in a single data table.

For each column, there should be an instance of the Data Type object.

Note

Use get, set, add, delete to update tags in the metadata. And use query for querying a column for its tags.

Parameters:

primary_keys (List[str]) – The primary key, a field used to uniquely identify each row in the table.
empty. (The primary key of each row must be unique and not)
column_list (list[str]) – list of the comlumn name in the table, other columns lists are used to store column information.

Show JSON schema

{
   "title": "Metadata",
   "description": "Metadata is mainly used to describe the data types of all columns in a single data table.\n\nFor each column, there should be an instance of the Data Type object.\n\n.. Note::\n\n    Use ``get``, ``set``, ``add``, ``delete`` to update tags in the metadata. And use `query` for querying a column for its tags.\n\nArgs:\n    primary_keys(List[str]): The primary key, a field used to uniquely identify each row in the table.\n    The primary key of each row must be unique and not empty.\n\n    column_list(list[str]): list of the comlumn name in the table, other columns lists are used to store column information.",
   "type": "object",
   "properties": {
      "primary_keys": {
         "default": [],
         "items": {
            "type": "string"
         },
         "title": "Primary Keys",
         "type": "array",
         "uniqueItems": true
      },
      "column_list": {
         "items": {
            "type": "string"
         },
         "title": "The List of Column Names",
         "type": "array"
      },
      "column_inspect_level": {
         "additionalProperties": {
            "type": "integer"
         },
         "title": "Column Inspect Level",
         "type": "object"
      },
      "pii_columns": {
         "default": [],
         "items": {
            "type": "string"
         },
         "title": "Pii Columns",
         "type": "array",
         "uniqueItems": true
      },
      "id_columns": {
         "default": [],
         "items": {
            "type": "string"
         },
         "title": "Id Columns",
         "type": "array",
         "uniqueItems": true
      },
      "int_columns": {
         "default": [],
         "items": {
            "type": "string"
         },
         "title": "Int Columns",
         "type": "array",
         "uniqueItems": true
      },
      "float_columns": {
         "default": [],
         "items": {
            "type": "string"
         },
         "title": "Float Columns",
         "type": "array",
         "uniqueItems": true
      },
      "bool_columns": {
         "default": [],
         "items": {
            "type": "string"
         },
         "title": "Bool Columns",
         "type": "array",
         "uniqueItems": true
      },
      "discrete_columns": {
         "default": [],
         "items": {
            "type": "string"
         },
         "title": "Discrete Columns",
         "type": "array",
         "uniqueItems": true
      },
      "datetime_columns": {
         "default": [],
         "items": {
            "type": "string"
         },
         "title": "Datetime Columns",
         "type": "array",
         "uniqueItems": true
      },
      "const_columns": {
         "default": [],
         "items": {
            "type": "string"
         },
         "title": "Const Columns",
         "type": "array",
         "uniqueItems": true
      },
      "datetime_format": {
         "title": "Datetime Format",
         "type": "object"
      },
      "numeric_format": {
         "title": "Numeric Format",
         "type": "object"
      },
      "categorical_encoder": {
         "anyOf": [
            {
               "additionalProperties": {
                  "$ref": "#/$defs/CategoricalEncoderType"
               },
               "type": "object"
            },
            {
               "type": "null"
            }
         ],
         "title": "Categorical Encoder"
      },
      "categorical_threshold": {
         "anyOf": [
            {
               "additionalProperties": {
                  "$ref": "#/$defs/CategoricalEncoderType"
               },
               "type": "object"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Categorical Threshold"
      },
      "version": {
         "default": "1.0",
         "title": "Version",
         "type": "string"
      }
   },
   "$defs": {
      "CategoricalEncoderType": {
         "enum": [
            "onehot",
            "label",
            "frequency"
         ],
         "title": "CategoricalEncoderType",
         "type": "string"
      }
   }
}

Fields:

bool_columns (Set[str])
categorical_encoder (Dict[str, sdgx.data_models.metadata.CategoricalEncoderType] | None)
categorical_threshold (Dict[int, sdgx.data_models.metadata.CategoricalEncoderType] | None)
column_inspect_level (Dict[str, int])
column_list (List[str])
const_columns (Set[str])
datetime_columns (Set[str])
datetime_format (Dict)
discrete_columns (Set[str])
float_columns (Set[str])
id_columns (Set[str])
int_columns (Set[str])
numeric_format (Dict)
pii_columns (Set[str])
primary_keys (Set[str])
version (str)

Validators:

check_column_list » column_list

field bool_columns: Set[str] = {}#

field categorical_encoder: Dict[str, CategoricalEncoderType] | None = {}#

field categorical_threshold: Dict[int, CategoricalEncoderType] | None = None#

field column_inspect_level: Dict[str, int] = {}#: column_inspect_level is used to store every inspector’s level, to specify the true type of each column.

field column_list: List[str] [Optional]#

” column_list is the actual value of self.column_list

Validated by:

check_column_list

field const_columns: Set[str] = {}#

field datetime_columns: Set[str] = {}#

field datetime_format: Dict = {}#

field discrete_columns: Set[str] = {}#

field float_columns: Set[str] = {}#

field id_columns: Set[str] = {}#

field int_columns: Set[str] = {}#

field numeric_format: Dict = {}#

field pii_columns: Set[str] = {}#: pii_columns is used to store all PII columns’ name

field primary_keys: Set[str] = {}#: primary_keys is used to store single primary key or composite primary key

field version: str = '1.0'#

add(key: str, values: str | Iterable[str])[source]#

Add tags.

Parameters:

key (str) – The key to add.
values (str | Iterable[str]) – The value to add.

Example

# Add all id columns
m.add("id_columns", "user_id")
m.add("id_columns", "ticket_id")
# OR
m.add("id_columns", ["user_id", "ticket_id"])
# OR
# add datetime format
m.add('datetime_format',{"col_1": "%Y-%m-%d %H:%M:%S", "col_2": "%d %b %Y"})

change_column_type(column_names: str | List[str], column_original_type: str, column_new_type: str)[source]#: Change the type of column.

check()[source]#

Checks column info.

When passing as input to the next module, perform necessary checks, including:: -Is the primary key correctly defined(in column list) and has ID data type. -Is there any missing definition of each column in table. -Are there any unknown columns that have been incorrectly updated.

validator check_column_list » column_list[source]#

check_single_primary_key(input_key: str)[source]#

Check whether a primary key in column_list and has ID data type.

Parameters:: input_key (str) – the input primary_key str

delete(key: str, value: str)[source]#

Delete tags.

Parameters:

key (str) – The key to delete.
value (str) – The value to delete.

Example

# Delete misidentification id columns
m.delete("id_columns", "not_an_id_columns")

dump()[source]#

Dump model dict, can be used in downstream process, like processor.

Returns:: dumped dict.
Return type:: dict

classmethod from_dataframe(df: DataFrame, include_inspectors: list[str] | None = None, exclude_inspectors: list[str] | None = None, inspector_init_kwargs: None | dict[str, Any] = None, check: bool = False) → Metadata[source]#

Initialize a metadata from DataFrame and Inspectors

Parameters:

df (pd.DataFrame) – the input DataFrame.
include_inspectors (list[str]) – data type inspectors used in this metadata (table).
exclude_inspectors (list[str]) – data type inspectors NOT used in this metadata (table).
inspector_init_kwargs (dict) – inspector args.

classmethod from_dataloader(dataloader: DataLoader, max_chunk: int = 10, primary_keys: Set[str] | None = None, include_inspectors: Iterable[str] | None = None, exclude_inspectors: Iterable[str] | None = None, inspector_init_kwargs: None | dict[str, Any] = None, check: bool = False) → Metadata[source]#

Initialize a metadata from DataLoader and Inspectors

Parameters:

dataloader (DataLoader) – the input DataLoader.
max_chunk (int) – max chunk count.
primary_keys (list[str]) – primary keys, see Metadata for more details.
include_inspectors (list[str]) – data type inspectors used in this metadata (table).
exclude_inspectors (list[str]) – data type inspectors NOT used in this metadata (table).
inspector_init_kwargs (dict) – inspector args.

get(key: str) → Set[str][source]#

Get all tags by key.

Parameters:: key (str) – The key to get.

Example

# Get all id columns
m.get("id_columns") == {"user_id", "ticket_id"}

get_all_data_type_columns()[source]#

Get all column names from self.xxx_columns.

All Lists with the suffix _columns in model fields and extend fields need to be collected. All defined column names will be counted.

Returns:: set of all column names.
Return type:: all_dtype_cols(set)

get_column_data_type(column_name: str)[source]#

Get the exact type of specific column. :param column_name: The query colmun name. :type column_name: str

Returns:: The data type query result.
Return type:: str

get_column_encoder_by_categorical_threshold(num_categories: int) → CategoricalEncoderType | None[source]#

get_column_encoder_by_name(column_name) → CategoricalEncoderType | None[source]#

get_column_pii(column_name: str)[source]#

Return if a column is a PII column. :param column_name: The query colmun name. :type column_name: str

Returns:: The PII query result.
Return type:: bool

classmethod load(path: str | Path) → Metadata[source]#: Load metadata from json file.

classmethod loads(attributes)[source]#

model_post_init(context: Any, /) → None#

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:

self – The BaseModel instance.
context – The context.

query(field: str) → Iterable[str][source]#

Query all tags of a field.

Parameters:: field (str) – The field to query.

Example

# Assume that user_id looks like 1,2,3,4
m.query("user_id") == ["id_columns", "numeric_columns"]

remove_column(column_names: List[str] | str)[source]#: Remove a column from all columns type. :param column_names: List[str]: To removed columns name list.

save(path: str | Path)[source]#: Save metadata to json file.

set(key: str, value: Any)[source]#

Set tags, will convert value to set if value is not a set.

Parameters:

key (str) – The key to set.
value (Any) – The value to set.

Example

# Set all id columns
m.set("id_columns", {"user_id", "ticket_id"})

update(attributes: dict[str, Any])[source]#: Update tags.

update_primary_key(primary_keys: Iterable[str] | str)[source]#

Update the primary key of the table

When update the primary key, the original primary key will be erased.

Parameters:: primary_keys (Iterable[str]) – the primary keys of this table.

classmethod upgrade(old_version: str, fields: dict[str, Any]) → None[source]#

property format_fields: Iterable[str]#: Return all tag fields in this metadata.

property tag_fields: Iterable[str]#: Return all tag fields in this metadata.

Metadata#

This Page