Metadata#
- pydantic model sdgx.data_models.metadata.Metadata[source]#
Metadata is mainly used to describe the data types of all columns in a single data table.
For each column, there should be an instance of the Data Type object.
Note
Use
get,set,add,deleteto update tags in the metadata. And use query for querying a column for its tags.- Parameters:
primary_keys (List[str]) – The primary key, a field used to uniquely identify each row in the table.
empty. (The primary key of each row must be unique and not)
column_list (list[str]) – list of the comlumn name in the table, other columns lists are used to store column information.
Show JSON schema
{ "title": "Metadata", "description": "Metadata is mainly used to describe the data types of all columns in a single data table.\n\nFor each column, there should be an instance of the Data Type object.\n\n.. Note::\n\n Use ``get``, ``set``, ``add``, ``delete`` to update tags in the metadata. And use `query` for querying a column for its tags.\n\nArgs:\n primary_keys(List[str]): The primary key, a field used to uniquely identify each row in the table.\n The primary key of each row must be unique and not empty.\n\n column_list(list[str]): list of the comlumn name in the table, other columns lists are used to store column information.", "type": "object", "properties": { "primary_keys": { "default": [], "items": { "type": "string" }, "title": "Primary Keys", "type": "array", "uniqueItems": true }, "column_list": { "items": { "type": "string" }, "title": "The List of Column Names", "type": "array" }, "column_inspect_level": { "additionalProperties": { "type": "integer" }, "title": "Column Inspect Level", "type": "object" }, "pii_columns": { "default": [], "items": { "type": "string" }, "title": "Pii Columns", "type": "array", "uniqueItems": true }, "id_columns": { "default": [], "items": { "type": "string" }, "title": "Id Columns", "type": "array", "uniqueItems": true }, "int_columns": { "default": [], "items": { "type": "string" }, "title": "Int Columns", "type": "array", "uniqueItems": true }, "float_columns": { "default": [], "items": { "type": "string" }, "title": "Float Columns", "type": "array", "uniqueItems": true }, "bool_columns": { "default": [], "items": { "type": "string" }, "title": "Bool Columns", "type": "array", "uniqueItems": true }, "discrete_columns": { "default": [], "items": { "type": "string" }, "title": "Discrete Columns", "type": "array", "uniqueItems": true }, "datetime_columns": { "default": [], "items": { "type": "string" }, "title": "Datetime Columns", "type": "array", "uniqueItems": true }, "const_columns": { "default": [], "items": { "type": "string" }, "title": "Const Columns", "type": "array", "uniqueItems": true }, "datetime_format": { "title": "Datetime Format", "type": "object" }, "numeric_format": { "title": "Numeric Format", "type": "object" }, "categorical_encoder": { "anyOf": [ { "additionalProperties": { "$ref": "#/$defs/CategoricalEncoderType" }, "type": "object" }, { "type": "null" } ], "title": "Categorical Encoder" }, "categorical_threshold": { "anyOf": [ { "additionalProperties": { "$ref": "#/$defs/CategoricalEncoderType" }, "type": "object" }, { "type": "null" } ], "default": null, "title": "Categorical Threshold" }, "version": { "default": "1.0", "title": "Version", "type": "string" } }, "$defs": { "CategoricalEncoderType": { "enum": [ "onehot", "label", "frequency" ], "title": "CategoricalEncoderType", "type": "string" } } }
- Fields:
- Validators:
- field bool_columns: Set[str] = {}#
- field categorical_encoder: Dict[str, CategoricalEncoderType] | None = {}#
- field categorical_threshold: Dict[int, CategoricalEncoderType] | None = None#
- field column_inspect_level: Dict[str, int] = {}#
column_inspect_level is used to store every inspector’s level, to specify the true type of each column.
- field column_list: List[str] [Optional]#
” column_list is the actual value of self.column_list
- Validated by:
- field const_columns: Set[str] = {}#
- field datetime_columns: Set[str] = {}#
- field datetime_format: Dict = {}#
- field discrete_columns: Set[str] = {}#
- field float_columns: Set[str] = {}#
- field id_columns: Set[str] = {}#
- field int_columns: Set[str] = {}#
- field numeric_format: Dict = {}#
- field pii_columns: Set[str] = {}#
pii_columns is used to store all PII columns’ name
- field primary_keys: Set[str] = {}#
primary_keys is used to store single primary key or composite primary key
- field version: str = '1.0'#
- add(key: str, values: str | Iterable[str])[source]#
Add tags.
- Parameters:
key (str) – The key to add.
values (str | Iterable[str]) – The value to add.
Example
# Add all id columns m.add("id_columns", "user_id") m.add("id_columns", "ticket_id") # OR m.add("id_columns", ["user_id", "ticket_id"]) # OR # add datetime format m.add('datetime_format',{"col_1": "%Y-%m-%d %H:%M:%S", "col_2": "%d %b %Y"})
- change_column_type(column_names: str | List[str], column_original_type: str, column_new_type: str)[source]#
Change the type of column.
- check()[source]#
Checks column info.
- When passing as input to the next module, perform necessary checks, including:
-Is the primary key correctly defined(in column list) and has ID data type. -Is there any missing definition of each column in table. -Are there any unknown columns that have been incorrectly updated.
- validator check_column_list » column_list[source]#
- check_single_primary_key(input_key: str)[source]#
Check whether a primary key in column_list and has ID data type.
- Parameters:
input_key (str) – the input primary_key str
- delete(key: str, value: str)[source]#
Delete tags.
- Parameters:
key (str) – The key to delete.
value (str) – The value to delete.
Example
# Delete misidentification id columns m.delete("id_columns", "not_an_id_columns")
- dump()[source]#
Dump model dict, can be used in downstream process, like processor.
- Returns:
dumped dict.
- Return type:
dict
- classmethod from_dataframe(df: DataFrame, include_inspectors: list[str] | None = None, exclude_inspectors: list[str] | None = None, inspector_init_kwargs: None | dict[str, Any] = None, check: bool = False) Metadata[source]#
Initialize a metadata from DataFrame and Inspectors
- Parameters:
df (pd.DataFrame) – the input DataFrame.
include_inspectors (list[str]) – data type inspectors used in this metadata (table).
exclude_inspectors (list[str]) – data type inspectors NOT used in this metadata (table).
inspector_init_kwargs (dict) – inspector args.
- classmethod from_dataloader(dataloader: DataLoader, max_chunk: int = 10, primary_keys: Set[str] | None = None, include_inspectors: Iterable[str] | None = None, exclude_inspectors: Iterable[str] | None = None, inspector_init_kwargs: None | dict[str, Any] = None, check: bool = False) Metadata[source]#
Initialize a metadata from DataLoader and Inspectors
- Parameters:
dataloader (DataLoader) – the input DataLoader.
max_chunk (int) – max chunk count.
primary_keys (list[str]) – primary keys, see
Metadatafor more details.include_inspectors (list[str]) – data type inspectors used in this metadata (table).
exclude_inspectors (list[str]) – data type inspectors NOT used in this metadata (table).
inspector_init_kwargs (dict) – inspector args.
- get(key: str) Set[str][source]#
Get all tags by key.
- Parameters:
key (str) – The key to get.
Example
# Get all id columns m.get("id_columns") == {"user_id", "ticket_id"}
- get_all_data_type_columns()[source]#
Get all column names from self.xxx_columns.
All Lists with the suffix _columns in model fields and extend fields need to be collected. All defined column names will be counted.
- Returns:
set of all column names.
- Return type:
all_dtype_cols(set)
- get_column_data_type(column_name: str)[source]#
Get the exact type of specific column. :param column_name: The query colmun name. :type column_name: str
- Returns:
The data type query result.
- Return type:
str
- get_column_encoder_by_categorical_threshold(num_categories: int) CategoricalEncoderType | None[source]#
- get_column_encoder_by_name(column_name) CategoricalEncoderType | None[source]#
- get_column_pii(column_name: str)[source]#
Return if a column is a PII column. :param column_name: The query colmun name. :type column_name: str
- Returns:
The PII query result.
- Return type:
bool
- model_post_init(context: Any, /) None#
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Parameters:
self – The BaseModel instance.
context – The context.
- query(field: str) Iterable[str][source]#
Query all tags of a field.
- Parameters:
field (str) – The field to query.
Example
# Assume that user_id looks like 1,2,3,4 m.query("user_id") == ["id_columns", "numeric_columns"]
- remove_column(column_names: List[str] | str)[source]#
Remove a column from all columns type. :param column_names: List[str]: To removed columns name list.
- set(key: str, value: Any)[source]#
Set tags, will convert value to set if value is not a set.
- Parameters:
key (str) – The key to set.
value (Any) – The value to set.
Example
# Set all id columns m.set("id_columns", {"user_id", "ticket_id"})
- update_primary_key(primary_keys: Iterable[str] | str)[source]#
Update the primary key of the table
When update the primary key, the original primary key will be erased.
- Parameters:
primary_keys (Iterable[str]) – the primary keys of this table.
- property format_fields: Iterable[str]#
Return all tag fields in this metadata.
- property tag_fields: Iterable[str]#
Return all tag fields in this metadata.