Zingg package

Contents

Zingg Python APIs for entity resolution, record linkage, data mastering and deduplication (https://www.zingg.ai)

requires python 3.6+; spark 3.1.2 Otherwise, zingg.client.Zingg() cannot be executed

zingg.client

This module is the main entry point of the Zingg Python API

class zingg.client.Arguments[source]

Bases: object

This class helps supply match arguments to Zingg. There are 3 basic steps in any match process.

Defining

specifying information about data location, fields, and our notion of similarity.

Training

making Zingg learn the matching rules

Matching

Running the models on the entire dataset

static createArgumentsFromJSON(fileName, phase)[source]

Method to create an object of this class from the JSON file and phase parameter value.

Parameters
  • fileName (String) – The CONF parameter value of ClientOption object

  • phase (String) – The PHASE parameter value of ClientOption object

Returns

The pointer containing address of the this class object

Return type

pointer(Arguments)

getArgs()[source]

Method to get pointer address of this class

Returns

The pointer containing the address of this class object

Return type

pointer(Arguments)

setArgs(argumentsObj)[source]

Method to set this class object

Parameters

argumentsObj (pointer(Arguments)) – Argument object to set this object

setData(*pipes)[source]

Method to set the file path of the file to be matched.

Parameters

pipes (Pipe[]) – input data pipes separated by comma e.g. (pipe1,pipe2,..)

setFieldDefinition(fieldDef)[source]

Method convert python objects to java FieldDefinition objects and set the field definitions associated with this client

Parameters

fieldDef (List(FieldDefinition)) – pyhton FieldDefinition object list

setLabelDataSampleSize(labelDataSampleSize)[source]

Method to set labelDataSampleSize parameter vlaue Set the fraction of data to be used from the complete data set to be used for seeding the labeled data Labelling is costly and we want a fast approximate way of looking at a small sample of the records and identifying expected matches and nonmatches

Parameters

labelDataSampleSize (float) – value between 0.0 and 1.0 denoting portion of dataset to use in generating seed samples

setModelId(id)[source]

Method to set the output directory where the match output will be saved

Parameters

id (String) – model id value

setNumPartitions(numPartitions)[source]

Method to set NumPartitions parameter vlaue Sample size to use for seeding labeled data We don’t want to run over all the data, as we want a quick way to seed some labeled data that we can manually edit

Parameters

numPartitions (int) – number of partitions for given data pipes

setOutput(*pipes)[source]

Method to set the output directory where the match result will be saved

Parameters

pipes (Pipe[]) – output data pipes separated by comma e.g. (pipe1,pipe2,..)

setStopWordsCutoff(stopWordsCutoff)[source]

Method to set stopWordsCutoff parameter vlaue By default, Zingg extracts 10% of the high frequency unique words from a dataset. If user wants different selection, they should set up StopWordsCutoff property

Parameters

stopWordsCutoff (float) – The stop words cutoff parameter value of ClientOption object or file address of json file

setZinggDir(f)[source]

Method to set the location for Zingg to save its internal computations and models. Please set it to a place where the program has to write access.

Parameters

f (String) – Zingg directory name of the models

writeArgumentsToJSON(fileName)[source]

Method to write JSON file from the object of this class

Parameters

fileName (String) – The CONF parameter value of ClientOption object or file address of json file

class zingg.client.ClientOptions(args=None)[source]

Bases: object

Class that contains Client options for Zingg object :param phase: trainMatch, train, match, link, findAndLabel, findTrainingData etc :type phase: String :param args: Parse a list of Zingg command line options parameter values e.g. “–location” etc. optional argument for initializing this class. :type args: List(String) or None

CONF = <py4j.java_gateway.JavaPackage object>

conf parameter for this class

Type

CONF

EMAIL = <py4j.java_gateway.JavaPackage object>

e-mail parameter for this class

Type

EMAIL

LICENSE = <py4j.java_gateway.JavaPackage object>

license parameter for this class

Type

LICENSE

LOCATION = <py4j.java_gateway.JavaPackage object>

location parameter for this class

Type

LOCATION

PHASE = <py4j.java_gateway.JavaPackage object>

phase parameter for this class

Type

PHASE

getClientOptions()[source]

Method to get pointer address of this class

Returns

The pointer containing address of the this class object

Return type

pointer(ClientOptions)

getConf()[source]

Method to get CONF value

Returns

The CONF parameter value

Return type

String

getLocation()[source]

Method to get LOCATION value

Returns

The LOCATION parameter value

Return type

String

getOptionValue(option)[source]

Method to get value for the key option

Parameters

option (String) – key to geting the value

Returns

The value which is mapped for given key

Return type

String

getPhase()[source]

Method to get PHASE value

Returns

The PHASE parameter value

Return type

String

hasLocation()[source]

Method to check if this class has LOCATION parameter set as None or not

Returns

The boolean value if LOCATION parameter is present or not

Return type

Bool

setOptionValue(option, value)[source]

Method to map option key to the given value

Parameters
  • option (String) – key that is mapped with value

  • value (String) – value to be set for given key

setPhase(newValue)[source]

Method to set PHASE value

Parameters

newValue (String) – name of the phase

Returns

The pointer containing address of the this class object after seting phase

Return type

pointer(ClientOptions)

class zingg.client.FieldDefinition(name, dataType, *matchType)[source]

Bases: object

This class defines each field that we use in matching We can use this to configure the properties of each field we use for matching in Zingg.

Parameters
  • name (String) – name of the field

  • dataType (String) – type of the data e.g. string, float, etc.

  • matchType (MatchType) – match type of this field e.g. FUSSY, EXACT, etc.

getFieldDefinition()[source]

Method to get pointer address of this class

Returns

The pointer containing the address of this class object

Return type

pointer(FieldDefinition)

setStopWords(stopWords)[source]

Method to add stopwords to this class object

Parameters

stopWords (String) – The stop Words containing csv file’s location

stringify(str)[source]

Method to stringify’ed the dataType before it is set in FieldDefinition object

Parameters

str (String) – dataType of the FieldDefinition

Returns

The stringify’ed value of the dataType

Return type

String

class zingg.client.Zingg(args, options)[source]

Bases: object

This class is the main point of interface with the Zingg matching product. Construct a client to Zingg using provided arguments and spark master. If running locally, set the master to local.

Parameters
  • args (Arguments) – arguments for training and matching

  • options (ClientOptions) – client option for this class object

execute()[source]

Method to execute this class object

getArguments()[source]

Method to get atguments of this class object

Returns

The pointer containing address of the Arguments object of this class object

Return type

pointer(Arguments)

getDfFromDs(data)[source]

Method to convert spark dataset to dataframe

Parameters

data (DataSet) – provide spark dataset

Returns

converted spark dataframe

Return type

DataFrame

getMarkedRecords()[source]

Method to get marked record dataset from the inputpipe

Returns

spark dataset containing marked records

Return type

Dataset<Row>

getMarkedRecordsStat(markedRecords, value)[source]

Method to get No. of records that is marked

Parameters
  • markedRecords (Dataset<Row>) – spark dataset containing marked records

  • value (long) – flag value to check if markedRecord is initially matched or not

Returns

The no. of marked records

Return type

int

getMatchedMarkedRecordsStat()[source]

Method to get No. of records that are marked and matched

Returns

The bo. of matched marked records

Return type

int

getOptions()[source]

Method to get client options of this class object

Returns

The pointer containing the address of the ClientOptions object of this class object

Return type

pointer(ClientOptions)

getPandasDfFromDs(data)[source]

Method to convert spark dataset to pandas dataframe

Parameters

data (DataSet) – provide spark dataset

Returns

converted pandas dataframe

Return type

DataFrame

getUnmarkedRecords()[source]

Method to get unmarked record dataset from the inputpipe

Returns

spark dataset containing unmarked records

Return type

Dataset<Row>

getUnmatchedMarkedRecordsStat()[source]

Method to get No. of records that are marked and unmatched

Returns

The no. of unmatched marked records

Return type

int

getUnsureMarkedRecordsStat()[source]

Method to get No. of records that are marked and Not Sure if its matched or not

Returns

The no. of Not Sure marked records

Return type

int

init()[source]

Method to initialize zingg client by reading internal configurations and functions

initAndExecute()[source]

Method to run both init and execute methods consecutively

setArguments(args)[source]

Method to set Arguments

Parameters

args (Arguments) – provide arguments for this class object

setOptions(options)[source]

Method to set atguments of this class object

Parameters

options (ClientOptions) – provide client options for this class object

Returns

The pointer containing address of the ClientOptions object of this class object

Return type

pointer(ClientOptions)

class zingg.client.ZinggWithSpark(args, options)[source]

Bases: Zingg

This class is the main point of interface with the Zingg matching product. Construct a client to Zingg using provided arguments and spark master. If running locally, set the master to local.

Parameters
  • args (Arguments) – arguments for training and matching

  • options (ClientOptions) – client option for this class object

zingg.client.parseArguments(argv)[source]

This method is used for checking mandatory arguments and creating an arguments list from Command line arguments

Parameters

argv (List) – Values that are passed during the calling of the program along with the calling statement.

Returns

a list containing necessary arguments to run any phase

Return type

List

zingg.pipes

This module is submodule of zingg to work with different types of Pipes. Classes of this module inherit the Pipe class, and use that class to create many different types of pipes.

class zingg.pipes.BigQueryPipe(name)[source]

Bases: Pipe

Pipe Class for working with BigQuery pipeline

Parameters

name (String) – name of the pipe.

CREDENTIAL_FILE = 'credentialsFile'
TABLE = 'table'
TEMP_GCS_BUCKET = 'temporaryGcsBucket'
VIEWS_ENABLED = 'viewsEnabled'
setCredentialFile(file)[source]

Method to set Credential file to the pipe

Parameters

file (String) – credential file name

setTable(table)[source]

Method to set Table to the pipe

Parameters

table (String) – provide table parameter

setTemporaryGcsBucket(bucket)[source]

Method to set TemporaryGcsBucket to the pipe

Parameters

bucket (String) – provide bucket parameter

setViewsEnabled(isEnabled)[source]

Method to set if viewsEnabled parameter is Enabled or not

Parameters

isEnabled (Bool) – provide boolean parameter which defines if viewsEnabled option is enable or not

class zingg.pipes.CsvPipe(name, location=None, schema=None)[source]

Bases: Pipe

Class CsvPipe: used for working with text files which uses a pipe symbol to separate units of text that belong in different columns.

Parameters
  • name (String) – name of the pipe.

  • location (String or None) – (optional) location from where we read data

  • schema (Schema or None) – (optional) json schema for the pipe

setDelimiter(delimiter)[source]

This method is used to define delimiter of CsvPipe

Parameters

delimiter (String) – a sequence of one or more characters for specifying the boundary between separate, independent regions in data streams

setHeader(header)[source]

Method to set header property of pipe

Parameters

header (Boolean) – true if pipe have header, false otherwise

setLocation(location)[source]

Method to set location of pipe

Parameters

location (String) – location from where we read data

class zingg.pipes.InMemoryPipe(name, df=None)[source]

Bases: Pipe

Pipe Class for working with InMemory pipeline

Parameters
  • name (String) – name of the pipe

  • df (Dataset or None) – provide dataset for this pipe (optional)

getDataset()[source]

Method to get Dataset from pipe

Returns

dataset of the pipe in the format of spark dataset

Return type

Dataset<Row>

setDataset(df)[source]

Method to set DataFrame of the pipe

Parameters

df (DataFrame) – pandas or spark dataframe for the pipe

class zingg.pipes.Pipe(name, format)[source]

Bases: object

Pipe class for working with different data-pipelines. Actual pipe def in the args. One pipe can be used at multiple places with different tables, locations, queries, etc

Parameters
  • name (String) – name of the pipe

  • format (Format) – formate of pipe e.g. bigquery,InMemory, etc.

addProperty(name, value)[source]

Method for adding different properties of pipe

Parameters
  • name (String) – name of the property

  • value (String) – value you want to set for the property

getPipe()[source]

Method to get Pipe

Returns

pipe parameter values in the format of a list of string

Return type

Pipe

setSchema(s)[source]

Method to set pipe schema value

Parameters

s (Schema) – json schema for the pipe

toString()[source]

Method to get pipe parameter values

Returns

pipe information in list format

Return type

List[String]

class zingg.pipes.SnowflakePipe(name)[source]

Bases: Pipe

Pipe Class for working with Snowflake pipeline

Parameters

name (String) – name of the pipe

DATABASE = 'sfDatabase'
DBTABLE = 'dbtable'
PASSWORD = 'sfPassword'
SCHEMA = 'sfSchema'
URL = 'sfUrl'
USER = 'sfUser'
WAREHOUSE = 'sfWarehouse'
setDatabase(db)[source]

Method to set Database to the pipe

Parameters

db (Database) – provide Database parameter.

setDbTable(dbtable)[source]

description

Parameters

dbtable (String) – provide bucket parameter.

setPassword(passwd)[source]

Method to set Password to the pipe

Parameters

passwd (String) – provide Password parameter.

setSFSchema(schema)[source]

Method to set Schema to the pipe

Parameters

schema (Schema) – provide schema parameter.

setURL(url)[source]

Method to set url to the pipe

Parameters

url (String) – provide url for this pipe

setUser(user)[source]

Method to set User to the pipe

Parameters

user (String) – provide User parameter.

setWarehouse(warehouse)[source]

Method to set warehouse parameter to the pipe

Parameters

warehouse (String) – provide warehouse parameter.