Zingg Entity Resolution Package

Zingg Python APIs for entity resolution, record linkage, data mastering and deduplication using ML (https://www.zingg.ai)

requires python 3.6+; spark 3.5.0 Otherwise, zingg.client.Zingg() cannot be executed

zingg.client

This module is the main entry point of the Zingg Python API

class zingg.client.Arguments[source]

Bases: object

This class helps supply match arguments to Zingg. There are 3 basic steps in any match process.

Defining:

specifying information about data location, fields, and our notion of similarity.

Training:

making Zingg learn the matching rules

Matching:

Running the models on the entire dataset

copyArgs(phase)[source]
static createArgumentsFromJSON(fileName, phase)[source]

Method to create an object of this class from the JSON file and phase parameter value.

Parameters:
  • fileName (String) – The CONF parameter value of ClientOption object

  • phase (String) – The PHASE parameter value of ClientOption object

Returns:

The pointer containing address of the this class object

Return type:

pointer(Arguments)

static createArgumentsFromJSONString(jsonArgs, phase)[source]
getArgs()[source]

Method to get pointer address of this class

Returns:

The pointer containing the address of this class object

Return type:

pointer(Arguments)

getModelId()[source]
getZinggBaseModelDir()[source]
getZinggBaseTrainingDataDir()[source]

Method to get the location of the folder where Zingg saves the training data found by findTrainingData

getZinggModelDir()[source]
getZinggTrainingDataMarkedDir()[source]

Method to get the location of the folder where Zingg saves the marked training data labeled by the user

getZinggTrainingDataUnmarkedDir()[source]

Method to get the location of the folder where Zingg saves the training data found by findTrainingData

setArgs(argumentsObj)[source]

Method to set this class object

Parameters:

argumentsObj (pointer(Arguments)) – Argument object to set this object

setColumn(column)[source]

Method to set stopWordsCutoff parameter value By default, Zingg extracts 10% of the high frequency unique words from a dataset. If user wants different selection, they should set up StopWordsCutoff property

Parameters:

stopWordsCutoff (float) – The stop words cutoff parameter value of ClientOption object or file address of json file

setData(*pipes)[source]

Method to set the file path of the file to be matched.

Parameters:

pipes (Pipe[]) – input data pipes separated by comma e.g. (pipe1,pipe2,..)

setFieldDefinition(fieldDef)[source]

Method convert python objects to java FieldDefinition objects and set the field definitions associated with this client

Parameters:

fieldDef (List(FieldDefinition)) – python FieldDefinition object list

setLabelDataSampleSize(labelDataSampleSize)[source]

Method to set labelDataSampleSize parameter value Set the fraction of data to be used from the complete data set to be used for seeding the labeled data Labelling is costly and we want a fast approximate way of looking at a small sample of the records and identifying expected matches and nonmatches

Parameters:

labelDataSampleSize (float) – value between 0.0 and 1.0 denoting portion of dataset to use in generating seed samples

setModelId(id)[source]

Method to set the output directory where the match output will be saved

Parameters:

id (String) – model id value

setNumPartitions(numPartitions)[source]

Method to set NumPartitions parameter value Sample size to use for seeding labeled data We don’t want to run over all the data, as we want a quick way to seed some labeled data that we can manually edit

Parameters:

numPartitions (int) – number of partitions for given data pipes

setOutput(*pipes)[source]

Method to set the output directory where the match result will be saved

Parameters:

pipes (Pipe[]) – output data pipes separated by comma e.g. (pipe1,pipe2,..)

setStopWordsCutoff(stopWordsCutoff)[source]

Method to set stopWordsCutoff parameter value By default, Zingg extracts 10% of the high frequency unique words from a dataset. If user wants different selection, they should set up StopWordsCutoff property

Parameters:

stopWordsCutoff (float) – The stop words cutoff parameter value of ClientOption object or file address of json file

setTrainingSamples(*pipes)[source]

Method to set existing training samples to be matched.

Parameters:

pipes (Pipe[]) – input training data pipes separated by comma e.g. (pipe1,pipe2,..)

setZinggDir(f)[source]

Method to set the location for Zingg to save its internal computations and models. Please set it to a place where the program has to write access.

Parameters:

f (String) – Zingg directory name of the models

writeArgumentsToJSON(fileName)[source]

Method to write JSON file from the object of this class

Parameters:

fileName (String) – The CONF parameter value of ClientOption object or file address of json file

writeArgumentsToJSONString()[source]

Method to create an object of this class from the JSON file and phase parameter value.

Parameters:
  • fileName (String) – The CONF parameter value of ClientOption object

  • phase (String) – The PHASE parameter value of ClientOption object

Returns:

The pointer containing address of the this class object

Return type:

pointer(Arguments)

class zingg.client.ClientOptions(argsSent=None)[source]

Bases: object

Class that contains Client options for Zingg object :param phase: trainMatch, train, match, link, findAndLabel, findTrainingData, recommend etc :type phase: String :param args: Parse a list of Zingg command line options parameter values e.g. “–location” etc. optional argument for initializing this class. :type args: List(String) or None

COLUMN = None

Column whose stop words are to be recommended through Zingg

Type:

COLUMN

CONF = None

conf parameter for this class

Type:

CONF

EMAIL = None

e-mail parameter for this class

Type:

EMAIL

LICENSE = None

license parameter for this class

Type:

LICENSE

LOCATION = None

location parameter for this class

Type:

LOCATION

MODEL_ID = None

ZINGG_DIR/MODEL_ID is used to save the model

Type:

MODEL_ID

PHASE = None

phase parameter for this class

Type:

PHASE

REMOTE = None

remote option used internally for running on Databricks

Type:

REMOTE

ZINGG_DIR = None

location where Zingg saves the model, training data etc

Type:

ZINGG_DIR

getClientOptions()[source]

Method to get pointer address of this class

Returns:

The pointer containing address of the this class object

Return type:

pointer(ClientOptions)

getConf()[source]

Method to get CONF value

Returns:

The CONF parameter value

Return type:

String

getLocation()[source]

Method to get LOCATION value

Returns:

The LOCATION parameter value

Return type:

String

getOptionValue(option)[source]

Method to get value for the key option

Parameters:

option (String) – key to geting the value

Returns:

The value which is mapped for given key

Return type:

String

getPhase()[source]

Method to get PHASE value

Returns:

The PHASE parameter value

Return type:

String

hasLocation()[source]

Method to check if this class has LOCATION parameter set as None or not

Returns:

The boolean value if LOCATION parameter is present or not

Return type:

Bool

setOptionValue(option, value)[source]

Method to map option key to the given value

Parameters:
  • option (String) – key that is mapped with value

  • value (String) – value to be set for given key

setPhase(newValue)[source]

Method to set PHASE value

Parameters:

newValue (String) – name of the phase

Returns:

The pointer containing address of the this class object after seting phase

Return type:

pointer(ClientOptions)

class zingg.client.FieldDefinition(name, dataType, *matchType)[source]

Bases: object

This class defines each field that we use in matching We can use this to configure the properties of each field we use for matching in Zingg.

Parameters:
  • name (String) – name of the field

  • dataType (String) – type of the data e.g. string, float, etc.

  • matchType (MatchType) – match type of this field e.g. FUSSY, EXACT, etc.

getFieldDefinition()[source]

Method to get pointer address of this class

Returns:

The pointer containing the address of this class object

Return type:

pointer(FieldDefinition)

setStopWords(stopWords)[source]

Method to add stopwords to this class object

Parameters:

stopWords (String) – The stop Words containing csv file’s location

stringify(str)[source]

Method to stringify’ed the dataType before it is set in FieldDefinition object

Parameters:

str (String) – dataType of the FieldDefinition

Returns:

The stringify’ed value of the dataType

Return type:

String

class zingg.client.Zingg(args, options)[source]

Bases: object

This class is the main point of interface with the Zingg matching product. Construct a client to Zingg using provided arguments and spark master. If running locally, set the master to local.

Parameters:
  • args (Arguments) – arguments for training and matching

  • options (ClientOptions) – client option for this class object

execute()[source]

Method to execute this class object

executeLabel()[source]

Method to run label phase

executeLabelUpdate()[source]

Method to run label update phase

getArguments()[source]

Method to get atguments of this class object

Returns:

The pointer containing address of the Arguments object of this class object

Return type:

pointer(Arguments)

getMarkedRecords()[source]

Method to get marked record dataset from the inputpipe

Returns:

spark dataset containing marked records

Return type:

Dataset<Row>

getMarkedRecordsStat(markedRecords, value)[source]

Method to get No. of records that is marked

Parameters:
  • markedRecords (Dataset<Row>) – spark dataset containing marked records

  • value (long) – flag value to check if markedRecord is initially matched or not

Returns:

The no. of marked records

Return type:

int

getMatchedMarkedRecordsStat()[source]

Method to get No. of records that are marked and matched

Returns:

The bo. of matched marked records

Return type:

int

getOptions()[source]

Method to get client options of this class object

Returns:

The pointer containing the address of the ClientOptions object of this class object

Return type:

pointer(ClientOptions)

getUnmarkedRecords()[source]

Method to get unmarked record dataset from the inputpipe

Returns:

spark dataset containing unmarked records

Return type:

Dataset<Row>

getUnmatchedMarkedRecordsStat()[source]

Method to get No. of records that are marked and unmatched

Returns:

The no. of unmatched marked records

Return type:

int

getUnsureMarkedRecordsStat()[source]

Method to get No. of records that are marked and Not Sure if its matched or not

Returns:

The no. of Not Sure marked records

Return type:

int

init()[source]

Method to initialize zingg client by reading internal configurations and functions

initAndExecute()[source]

Method to run both init and execute methods consecutively

processRecordsCli(unmarkedRecords, args)[source]

Method to get user input on unmarked records

Returns:

spark dataset containing updated records

Return type:

Dataset<Row>

processRecordsCliLabelUpdate(lines, args)[source]
setArguments(args)[source]

Method to set Arguments

Parameters:

args (Arguments) – provide arguments for this class object

setOptions(options)[source]

Method to set atguments of this class object

Parameters:

options (ClientOptions) – provide client options for this class object

Returns:

The pointer containing address of the ClientOptions object of this class object

Return type:

pointer(ClientOptions)

writeLabelledOutput(updatedRecords, args)[source]

Method to write updated records after user input

writeLabelledOutputFromPandas(candidate_pairs_pd, args)[source]

Method to write updated records (as pandas df) after user input

class zingg.client.ZinggWithSpark(args, options)[source]

Bases: Zingg

This class is the main point of interface with the Zingg matching product. Construct a client to Zingg using provided arguments and spark master. If running locally, set the master to local.

Parameters:
  • args (Arguments) – arguments for training and matching

  • options (ClientOptions) – client option for this class object

zingg.client.getDfFromDs(data)[source]

Method to convert spark dataset to dataframe

Parameters:

data (DataSet) – provide spark dataset

Returns:

converted spark dataframe

Return type:

DataFrame

zingg.client.getGateway()[source]
zingg.client.getJVM()[source]
zingg.client.getPandasDfFromDs(data)[source]

Method to convert spark dataset to pandas dataframe

Parameters:

data (DataSet) – provide spark dataset

Returns:

converted pandas dataframe

Return type:

DataFrame

zingg.client.getSparkContext()[source]
zingg.client.getSparkSession()[source]
zingg.client.getSqlContext()[source]
zingg.client.initClient()[source]
zingg.client.initDataBricksConectClient()[source]
zingg.client.initSparkClient()[source]
zingg.client.parseArguments(argv)[source]

This method is used for checking mandatory arguments and creating an arguments list from Command line arguments

Parameters:

argv (List) – Values that are passed during the calling of the program along with the calling statement.

Returns:

a list containing necessary arguments to run any phase

Return type:

List

zingg.pipes

This module is submodule of zingg to work with different types of Pipes. Classes of this module inherit the Pipe class, and use that class to create many different types of pipes.

class zingg.pipes.BigQueryPipe(name)[source]

Bases: Pipe

Pipe Class for working with BigQuery pipeline

Parameters:

name (String) – name of the pipe.

CREDENTIAL_FILE = 'credentialsFile'
TABLE = 'table'
TEMP_GCS_BUCKET = 'temporaryGcsBucket'
VIEWS_ENABLED = 'viewsEnabled'
setCredentialFile(file)[source]

Method to set Credential file to the pipe

Parameters:

file (String) – credential file name

setTable(table)[source]

Method to set Table to the pipe

Parameters:

table (String) – provide table parameter

setTemporaryGcsBucket(bucket)[source]

Method to set TemporaryGcsBucket to the pipe

Parameters:

bucket (String) – provide bucket parameter

setViewsEnabled(isEnabled)[source]

Method to set if viewsEnabled parameter is Enabled or not

Parameters:

isEnabled (Bool) – provide boolean parameter which defines if viewsEnabled option is enable or not

class zingg.pipes.CsvPipe(name, location=None, schema=None)[source]

Bases: Pipe

Class CsvPipe: used for working with text files which uses a pipe symbol to separate units of text that belong in different columns.

Parameters:
  • name (String) – name of the pipe.

  • location (String or None) – (optional) location from where we read data

  • schema (Schema or None) – (optional) json schema for the pipe

setDelimiter(delimiter)[source]

This method is used to define delimiter of CsvPipe

Parameters:

delimiter (String) – a sequence of one or more characters for specifying the boundary between separate, independent regions in data streams

setHeader(header)[source]

Method to set header property of pipe

Parameters:

header (Boolean) – true if pipe have header, false otherwise

setLocation(location)[source]

Method to set location of pipe

Parameters:

location (String) – location from where we read data

class zingg.pipes.InMemoryPipe(name, df=None)[source]

Bases: Pipe

Pipe Class for working with InMemory pipeline

Parameters:
  • name (String) – name of the pipe

  • df (Dataset or None) – provide dataset for this pipe (optional)

getDataset()[source]

Method to get Dataset from pipe

Returns:

dataset of the pipe in the format of spark dataset

Return type:

Dataset<Row>

setDataset(df)[source]

Method to set DataFrame of the pipe

Parameters:

df (DataFrame) – pandas or spark dataframe for the pipe

class zingg.pipes.Pipe(name, format)[source]

Bases: object

Pipe class for working with different data-pipelines. Actual pipe def in the args. One pipe can be used at multiple places with different tables, locations, queries, etc

Parameters:
  • name (String) – name of the pipe

  • format (Format) – formate of pipe e.g. bigquery,InMemory, etc.

addProperty(name, value)[source]

Method for adding different properties of pipe

Parameters:
  • name (String) – name of the property

  • value (String) – value you want to set for the property

getPipe()[source]

Method to get Pipe

Returns:

pipe parameter values in the format of a list of string

Return type:

Pipe

setSchema(s)[source]

Method to set pipe schema value

Parameters:

s (Schema) – json schema for the pipe

toString()[source]

Method to get pipe parameter values

Returns:

pipe information in list format

Return type:

List[String]

class zingg.pipes.SnowflakePipe(name)[source]

Bases: Pipe

Pipe Class for working with Snowflake pipeline

Parameters:

name (String) – name of the pipe

DATABASE = 'sfDatabase'
DBTABLE = 'dbtable'
PASSWORD = 'sfPassword'
SCHEMA = 'sfSchema'
URL = 'sfUrl'
USER = 'sfUser'
WAREHOUSE = 'sfWarehouse'
setDatabase(db)[source]

Method to set Database to the pipe

Parameters:

db (Database) – provide Database parameter.

setDbTable(dbtable)[source]

description

Parameters:

dbtable (String) – provide bucket parameter.

setPassword(passwd)[source]

Method to set Password to the pipe

Parameters:

passwd (String) – provide Password parameter.

setSFSchema(schema)[source]

Method to set Schema to the pipe

Parameters:

schema (Schema) – provide schema parameter.

setURL(url)[source]

Method to set url to the pipe

Parameters:

url (String) – provide url for this pipe

setUser(user)[source]

Method to set User to the pipe

Parameters:

user (String) – provide User parameter.

setWarehouse(warehouse)[source]

Method to set warehouse parameter to the pipe

Parameters:

warehouse (String) – provide warehouse parameter.