China Mainland Post Code#
- class sdgx.data_models.inspectors.personal.ChinaMainlandPostCode(pattern: str | None = None, data_type_name: str | None = None, match_percentage: float | None = None, *args, **kwargs)[source]#
Bases:
RegexInspector- _fit_column(column_data: Series)#
Regular expression matching for a single column, returning the matching ratio.
- Parameters:
column_data (pd.Series) – the column data.
- _inspect_level: int = 20#
Private variable used to store property inspect_level’s value.
- _match_percentage: float = 0.95#
Since zip codes and six-digit integers are the same, here we increase match_percentage to prevent some pure integer columns from being recognized.
- data_type_name: str = 'china_mainland_postcode'#
data_type_name is the name of the data type, such as email, US address, HKID etc.
- domain_verification(each_sample: str)#
The function domain_verification is used to add custom domain verification logic. When a sample matches a regular expression, the domain_verification function is executed for further verification.
Additional logic checks can be performed beyond regular expressions, making it more flexible. For example, in a company name, there may be address information. When determining the type of address, if the sample ends with “Company”, domain_verification can return False to avoid misclassification, thus improving the accuracy of the inspector.
This function has the power to veto. When the function outputs False, the sample will be classified as not matching the corresponding data type of the inspector.
If this function is not overwritten, domain_verification will default to return True.
- Parameters:
each_sample (str) – string of each sample.
- fit(input_raw_data: DataFrame, *args, **kwargs)#
Fit the inspector.
Finds the list of regex columns from the tabular data (in pd.DataFrame).
- Parameters:
raw_data (pd.DataFrame) – Raw data
- inspect(*args, **kwargs) dict[str, Any]#
Inspect raw data and generate metadata.
- property inspect_level#
the email column may be recognized as email, but it may also be recognized as the id column, and it may also be recognized by different inspectors at the same time identified as a discrete column, which will cause confusion in subsequent processing), the inspect_leve is used when determining the specific type of a column.
We will preset different inspector levels for different inspectors, usually more specific inspectors will get higher levels, and general inspectors (like discrete) will have inspect_level.
The value of the variable inspect_level is limited to 1-100. In baseclass and bool, discrete and numeric types, the inspect_level is set to 10. For datetime and id types, the inspect_level is set to 20.
Current inspect_level value will make it easier for developers to insert a custom inspector from the middle.
- Type:
Inspected level is a concept newly introduced in version 0.1.6. Since a single column in the table may be marked by different inspectors at the same time (for example
- property match_percentage#
The match_percentage shoud > 0.5 and < 1.
Due to the existence of empty data, wrong data, etc., the match_percentage is the proportion of the current regular expression compound. When the number of compound regular expressions is higher than this ratio, the column can be considered fit the current data type.
- pattern: str = '^[0-9]{6}$'#
pattern is the regular expression string of current inspector.
- pii = False#
PII refers if a column contains private or sensitive information.
- ready: bool = False#
Indicates whether the inspector has completed its inference.
When completed, ready == True.