Zingg package
Contents
Zingg Python APIs for entity resolution, record linkage, data mastering and deduplication (https://www.zingg.ai)
requires python 3.6+; spark 3.1.2
Otherwise, zingg.client.Zingg()
cannot be executed
zingg.client
This module is the main entry point of the Zingg Python API
- class zingg.client.Arguments[source]
Bases:
object
This class helps supply match arguments to Zingg. There are 3 basic steps in any match process.
- Defining
specifying information about data location, fields, and our notion of similarity.
- Training
making Zingg learn the matching rules
- Matching
Running the models on the entire dataset
- static createArgumentsFromJSON(fileName, phase)[source]
Method to create an object of this class from the JSON file and phase parameter value.
- Parameters
fileName (String) – The CONF parameter value of ClientOption object
phase (String) – The PHASE parameter value of ClientOption object
- Returns
The pointer containing address of the this class object
- Return type
pointer(Arguments)
- getArgs()[source]
Method to get pointer address of this class
- Returns
The pointer containing the address of this class object
- Return type
pointer(Arguments)
- setArgs(argumentsObj)[source]
Method to set this class object
- Parameters
argumentsObj (pointer(Arguments)) – Argument object to set this object
- setData(*pipes)[source]
Method to set the file path of the file to be matched.
- Parameters
pipes (Pipe[]) – input data pipes separated by comma e.g. (pipe1,pipe2,..)
- setFieldDefinition(fieldDef)[source]
Method convert python objects to java FieldDefinition objects and set the field definitions associated with this client
- Parameters
fieldDef (List(FieldDefinition)) – pyhton FieldDefinition object list
- setLabelDataSampleSize(labelDataSampleSize)[source]
Method to set labelDataSampleSize parameter vlaue Set the fraction of data to be used from the complete data set to be used for seeding the labeled data Labelling is costly and we want a fast approximate way of looking at a small sample of the records and identifying expected matches and nonmatches
- Parameters
labelDataSampleSize (float) – value between 0.0 and 1.0 denoting portion of dataset to use in generating seed samples
- setModelId(id)[source]
Method to set the output directory where the match output will be saved
- Parameters
id (String) – model id value
- setNumPartitions(numPartitions)[source]
Method to set NumPartitions parameter vlaue Sample size to use for seeding labeled data We don’t want to run over all the data, as we want a quick way to seed some labeled data that we can manually edit
- Parameters
numPartitions (int) – number of partitions for given data pipes
- setOutput(*pipes)[source]
Method to set the output directory where the match result will be saved
- Parameters
pipes (Pipe[]) – output data pipes separated by comma e.g. (pipe1,pipe2,..)
- setStopWordsCutoff(stopWordsCutoff)[source]
Method to set stopWordsCutoff parameter vlaue By default, Zingg extracts 10% of the high frequency unique words from a dataset. If user wants different selection, they should set up StopWordsCutoff property
- Parameters
stopWordsCutoff (float) – The stop words cutoff parameter value of ClientOption object or file address of json file
- class zingg.client.ClientOptions(args=None)[source]
Bases:
object
Class that contains Client options for Zingg object :param phase: trainMatch, train, match, link, findAndLabel, findTrainingData etc :type phase: String :param args: Parse a list of Zingg command line options parameter values e.g. “–location” etc. optional argument for initializing this class. :type args: List(String) or None
- CONF = <py4j.java_gateway.JavaPackage object>
conf parameter for this class
- Type
CONF
- EMAIL = <py4j.java_gateway.JavaPackage object>
e-mail parameter for this class
- Type
EMAIL
- LICENSE = <py4j.java_gateway.JavaPackage object>
license parameter for this class
- Type
LICENSE
- LOCATION = <py4j.java_gateway.JavaPackage object>
location parameter for this class
- Type
LOCATION
- PHASE = <py4j.java_gateway.JavaPackage object>
phase parameter for this class
- Type
PHASE
- getClientOptions()[source]
Method to get pointer address of this class
- Returns
The pointer containing address of the this class object
- Return type
pointer(ClientOptions)
- getLocation()[source]
Method to get LOCATION value
- Returns
The LOCATION parameter value
- Return type
String
- getOptionValue(option)[source]
Method to get value for the key option
- Parameters
option (String) – key to geting the value
- Returns
The value which is mapped for given key
- Return type
String
- hasLocation()[source]
Method to check if this class has LOCATION parameter set as None or not
- Returns
The boolean value if LOCATION parameter is present or not
- Return type
Bool
- setOptionValue(option, value)[source]
Method to map option key to the given value
- Parameters
option (String) – key that is mapped with value
value (String) – value to be set for given key
- setPhase(newValue)[source]
Method to set PHASE value
- Parameters
newValue (String) – name of the phase
- Returns
The pointer containing address of the this class object after seting phase
- Return type
pointer(ClientOptions)
- class zingg.client.FieldDefinition(name, dataType, *matchType)[source]
Bases:
object
This class defines each field that we use in matching We can use this to configure the properties of each field we use for matching in Zingg.
- Parameters
name (String) – name of the field
dataType (String) – type of the data e.g. string, float, etc.
matchType (MatchType) – match type of this field e.g. FUSSY, EXACT, etc.
- getFieldDefinition()[source]
Method to get pointer address of this class
- Returns
The pointer containing the address of this class object
- Return type
pointer(FieldDefinition)
- class zingg.client.Zingg(args, options)[source]
Bases:
object
This class is the main point of interface with the Zingg matching product. Construct a client to Zingg using provided arguments and spark master. If running locally, set the master to local.
- Parameters
args (Arguments) – arguments for training and matching
options (ClientOptions) – client option for this class object
- getArguments()[source]
Method to get atguments of this class object
- Returns
The pointer containing address of the Arguments object of this class object
- Return type
pointer(Arguments)
- getDfFromDs(data)[source]
Method to convert spark dataset to dataframe
- Parameters
data (DataSet) – provide spark dataset
- Returns
converted spark dataframe
- Return type
DataFrame
- getMarkedRecords()[source]
Method to get marked record dataset from the inputpipe
- Returns
spark dataset containing marked records
- Return type
Dataset<Row>
- getMarkedRecordsStat(markedRecords, value)[source]
Method to get No. of records that is marked
- Parameters
markedRecords (Dataset<Row>) – spark dataset containing marked records
value (long) – flag value to check if markedRecord is initially matched or not
- Returns
The no. of marked records
- Return type
int
- getMatchedMarkedRecordsStat()[source]
Method to get No. of records that are marked and matched
- Returns
The bo. of matched marked records
- Return type
int
- getOptions()[source]
Method to get client options of this class object
- Returns
The pointer containing the address of the ClientOptions object of this class object
- Return type
pointer(ClientOptions)
- getPandasDfFromDs(data)[source]
Method to convert spark dataset to pandas dataframe
- Parameters
data (DataSet) – provide spark dataset
- Returns
converted pandas dataframe
- Return type
DataFrame
- getUnmarkedRecords()[source]
Method to get unmarked record dataset from the inputpipe
- Returns
spark dataset containing unmarked records
- Return type
Dataset<Row>
- getUnmatchedMarkedRecordsStat()[source]
Method to get No. of records that are marked and unmatched
- Returns
The no. of unmatched marked records
- Return type
int
- getUnsureMarkedRecordsStat()[source]
Method to get No. of records that are marked and Not Sure if its matched or not
- Returns
The no. of Not Sure marked records
- Return type
int
- setArguments(args)[source]
Method to set Arguments
- Parameters
args (Arguments) – provide arguments for this class object
- setOptions(options)[source]
Method to set atguments of this class object
- Parameters
options (ClientOptions) – provide client options for this class object
- Returns
The pointer containing address of the ClientOptions object of this class object
- Return type
pointer(ClientOptions)
- class zingg.client.ZinggWithSpark(args, options)[source]
Bases:
Zingg
This class is the main point of interface with the Zingg matching product. Construct a client to Zingg using provided arguments and spark master. If running locally, set the master to local.
- Parameters
args (Arguments) – arguments for training and matching
options (ClientOptions) – client option for this class object
- zingg.client.parseArguments(argv)[source]
This method is used for checking mandatory arguments and creating an arguments list from Command line arguments
- Parameters
argv (List) – Values that are passed during the calling of the program along with the calling statement.
- Returns
a list containing necessary arguments to run any phase
- Return type
List
zingg.pipes
This module is submodule of zingg to work with different types of Pipes. Classes of this module inherit the Pipe class, and use that class to create many different types of pipes.
- class zingg.pipes.BigQueryPipe(name)[source]
Bases:
Pipe
Pipe Class for working with BigQuery pipeline
- Parameters
name (String) – name of the pipe.
- CREDENTIAL_FILE = 'credentialsFile'
- TABLE = 'table'
- TEMP_GCS_BUCKET = 'temporaryGcsBucket'
- VIEWS_ENABLED = 'viewsEnabled'
- setCredentialFile(file)[source]
Method to set Credential file to the pipe
- Parameters
file (String) – credential file name
- setTable(table)[source]
Method to set Table to the pipe
- Parameters
table (String) – provide table parameter
- class zingg.pipes.CsvPipe(name, location=None, schema=None)[source]
Bases:
Pipe
Class CsvPipe: used for working with text files which uses a pipe symbol to separate units of text that belong in different columns.
- Parameters
name (String) – name of the pipe.
location (String or None) – (optional) location from where we read data
schema (Schema or None) – (optional) json schema for the pipe
- setDelimiter(delimiter)[source]
This method is used to define delimiter of CsvPipe
- Parameters
delimiter (String) – a sequence of one or more characters for specifying the boundary between separate, independent regions in data streams
- class zingg.pipes.InMemoryPipe(name, df=None)[source]
Bases:
Pipe
Pipe Class for working with InMemory pipeline
- Parameters
name (String) – name of the pipe
df (Dataset or None) – provide dataset for this pipe (optional)
- class zingg.pipes.Pipe(name, format)[source]
Bases:
object
Pipe class for working with different data-pipelines. Actual pipe def in the args. One pipe can be used at multiple places with different tables, locations, queries, etc
- Parameters
name (String) – name of the pipe
format (Format) – formate of pipe e.g. bigquery,InMemory, etc.
- addProperty(name, value)[source]
Method for adding different properties of pipe
- Parameters
name (String) – name of the property
value (String) – value you want to set for the property
- getPipe()[source]
Method to get Pipe
- Returns
pipe parameter values in the format of a list of string
- Return type
- class zingg.pipes.SnowflakePipe(name)[source]
Bases:
Pipe
Pipe Class for working with Snowflake pipeline
- Parameters
name (String) – name of the pipe
- DATABASE = 'sfDatabase'
- DBTABLE = 'dbtable'
- PASSWORD = 'sfPassword'
- SCHEMA = 'sfSchema'
- URL = 'sfUrl'
- USER = 'sfUser'
- WAREHOUSE = 'sfWarehouse'
- setDatabase(db)[source]
Method to set Database to the pipe
- Parameters
db (Database) – provide Database parameter.
- setPassword(passwd)[source]
Method to set Password to the pipe
- Parameters
passwd (String) – provide Password parameter.
- setSFSchema(schema)[source]
Method to set Schema to the pipe
- Parameters
schema (Schema) – provide schema parameter.
- setURL(url)[source]
Method to set url to the pipe
- Parameters
url (String) – provide url for this pipe