Pipeline Module Documentation

config_utilities

config module to manage global variables

This module establishes default values and argument parsers for commonly used variables

Contains module functions:

add_run_config_args(parser)
add_file_config_args(parser)
add_mysql_config_args(parser)
add_redis_config_args(parser)
add_config_args(parser)
config_args()
pretty_name(orig_name, endlen=63)
Default values for different configuration options
config_utilities.add_config_args(parser)[source]

Add global configuation options to command line arguments.

If global arguments are not specified, supplies their default values.

Parameters:parser (argparse.ArgumentParser) – a parser to add global config opts to
Returns:parser with appended global options
Return type:argparse.ArgumentParser
config_utilities.add_file_config_args(parser)[source]

Add global configuation options to command line arguments.

If global arguments are not specified, supplies their default values.
parameter argument flag description
–working_dir str -wd absolute path to toplevel working directory
–code_path str -cp absolute path of code directory
–storage_dir str -sd absolute path to toplevel shared storage directory
–data_path str -dp relative path of data directory from toplevel
–logs_path str -lp relative path of logs directory from toplevel
–export_path str -ep relative path of export directory from toplevel
–src_path str -sp relative path of srcClass directory from code_path
       
Args:      
parser (argparse.ArgumentParser): a parser to add global config opts to      
       
Returns:      
argparse.ArgumentParser: parser with appended global options      
config_utilities.add_mysql_config_args(parser)[source]

Add global configuation options to command line arguments.

If global arguments are not specified, supplies their default values.
parameter argument flag description
–mysql_host str -myh address of mySQL db
–mysql_port str -myp port for mySQL db
–mysql_dir str -myd absolute directory for MySQL db files
–mysql_mem str -mym memory for deploying MySQL container
–mysql_cpu str -myc cpus for deploying MySQL container
–mysql_conf str -mycf relative config dir for deploying MySQL
–mysql_user str -myu user for mySQL db
–mysql_pass str -myps password for mySQL db
       
Args:      
parser (argparse.ArgumentParser): a parser to add global config opts to      
       
Returns:      
argparse.ArgumentParser: parser with appended global options      
config_utilities.add_redis_config_args(parser)[source]

Add global configuation options to command line arguments.

If global arguments are not specified, supplies their default values.
parameter argument flag description
–redis_host str -rh address of Redis db
–redis_port str -rp port for Redis db
–redis_dir str -rd absolute directory for Redis db files
–redis_mem str -rm memory for deploying redis container
–redis_cpu str -rc cpus for deploying redis container
–redis_pass str -rps password for Redis db
       
Args:      
parser (argparse.ArgumentParser): a parser to add global config opts to      
       
Returns:      
argparse.ArgumentParser: parser with appended global options      
config_utilities.add_run_config_args(parser)[source]

Add global configuation options to command line arguments.

If global arguments are not specified, supplies their default values.
parameter argument flag description
–chronos str -c url of chronos scheduler or LOCAL or DOCKER
–marathon str -m url of marathon scheduler
–build_image str -i docker image name to use for kn_build pipeline
–ens_species str -es ‘,,’ separated ensembl species to run in setup pipeline
–src_classes str -srcs ‘,,’ separated source keywords to run in parse pipeline
–force_fetch bool -ff fetch even if file exists and is unchanged from last run
–test_mode bool -tm run in test mode by only printing commands
       
Args:      
parser (argparse.ArgumentParser): a parser to add global config opts to      
       
Returns:      
argparse.ArgumentParser: parser with appended global options      
config_utilities.config_args()[source]

Create a default parser with option defaults

Returns:args as populated namespace
Return type:Namespace
config_utilities.pretty_name(orig_name, endlen=63)[source]

Shortens names strs and removes problematic characters

Parameters:
  • orig_name (str) – name string before conversion
  • endlen (int) – max length of final pretty string
Returns:

string after formatting changes

Return type:

str

check_utilities

module for checking if source needs to be updated in Knowledge Network (KN).

Contains the class SrcClass which serves as the base class for each supported source in the KN.

Contains module functions:

get_SrcClass(args)
compare_versions(SrcClass)
check(module, args=None)
main_parse_args()

Examples

To run check on a single source (e.g. dip):

$ python3 code/check_utilities.py dip

To view all optional arguments that can be specified:

$ python3 code/check_utilities.py -h
class check_utilities.SrcClass(src_name, base_url, aliases, args=None)[source]

Base class to be extended by each supported source in KnowEnG.

This SrcClass provides default functions that should be extended or overridden by any source which is added to the Knowledge Network (KN).

name

str – The name of the remote source to be included in the KN.

url_base

str – The base url of the remote source, which may need additional processing to provide an actual download link (see get_remote_url).

aliases

dict – A dictionary with subsets of the source which will be included in the KN as the keys (e.g. different species, data types, or interaction types), and a short string with information about the alias as the value.

remote_file

str – The name of the file to extract if the remote source is a directory

version

dict – The release version of each alias in the source.

source_url

str – The website for the source.

reference

str – The citation for the source.

pmid

str – The pubmed ID for the source.

license

str – The license for the source.

create_mapping_dict(filename, key_col=3, value_col=4)[source]

Return a mapping dictionary for the provided file.

This returns a dictionary for use in mapping nodes or edge types from the file specified by filetype. By default it opens the file specified by filename creates a dictionary using the key_col column as the key and the value_col column as the value.

Parameters:
  • filename (str) – The name of the file containing the information needed to produce the maping dictionary.
  • key_col (int) – The column containing the key for creating the dictionary. By default this is column 3.
  • value_col (int) – The column containing the value for creating the dictionary. By default this is column 4.
Returns:

A dictionary for use in mapping nodes or edge types.

Return type:

dict

get_aliases(args=Namespace(build_image='knoweng/kn_builder:latest', chronos='127.0.0.1:8888', code_path='/kn_builder/code/', data_path='kn-rawdata', ens_species='homo_sapiens', export_path='kn-final', force_fetch=False, logs_path='kn-logs', marathon='127.0.0.1:8080', mysql_conf='build_conf/', mysql_cpu='0.5', mysql_dir='/home/ubuntu/KN_Builder/docs/kn-mysql', mysql_host='127.0.0.1', mysql_mem='0', mysql_pass='KnowEnG', mysql_port='3306', mysql_user='root', redis_cpu='0.5', redis_dir='/home/ubuntu/KN_Builder/docs/kn-redis', redis_host='127.0.0.1', redis_mem='0', redis_pass='KnowEnG', redis_port='6379', src_classes='', src_path='srcClass', storage_dir='', test_mode=False, working_dir='/home/ubuntu/KN_Builder/docs'))[source]

Helper function for producing the alias dictionary.

This returns a dictionary where alias names are keys and alias info are the values. This helper function uses the species specific information for the build of the Knowledge Network, which is produced by ensembl.py during setup utilities and is located at cf.DEFAULT_MAP_PATH/species/species.json, in order to fetch all matching species specific aliases from the source.

Parameters:args (Namespace) – args as populated namespace or ‘None’ for defaults
Returns:A dictionary of species:(taxid, division) values
Return type:dict
get_dependencies(alias)[source]

Return a list of other aliases that the provided alias depends on.

This returns a list of other aliases that must be processed before full processing of the provided alias can be completed. By default, returns a list of all aliases which are considered mapping files (see is_map)

Parameters:alias (str) – An alias defined in self.aliases.
Returns:
The other aliases defined in self.aliases that the provided
alias depends on.
Return type:list
get_local_file_info(alias)[source]

Return a dictionary with the local file information for the alias.

This returns the local file information for a given source alias, which will always contain the following keys:

'local_file_name' (str):        name of the file locally
'local_file_exists' (bool):     boolean if file exists at path
                                indicated by 'local_file_name'

and will also contain the following if ‘local_file_exists’ is True:

'local_size' (int):     size of local file in bytes
'local_date' (float):   time of last modification time of local
                        file in seconds since the epoch
Parameters:alias (str) – An alias defined in self.aliases.
Returns:The local file information for a given source alias.
Return type:dict
get_local_version_info(alias, args)[source]

Return a dictionary with the local information for the alias.

This returns the local information for a given source alias, as retrieved from the msyql database and formated as a dicitonary object. (see mysql_utilities.get_file_meta). It adds the local_file_name and local_file_exists to the fields retrieved from the database, which are the name of the file locally and a boolean indicating if it already exists on disk, respectively.

Parameters:alias (str) – An alias defined in self.aliases.
Returns:The local file information for a given source alias.
Return type:dict
get_remote_file_modified(alias)[source]

Return the remote file date modified.

This returns the remote file date modifed as specificied by the ‘last-modified’ page header.

Parameters:remote_url (str) – The url of the remote file to get the date modified of.
Returns:
time of last modification time of remote file in seconds
since the epoch
Return type:float
get_remote_file_size(alias)[source]

Return the remote file size.

This returns the remote file size as specificied by the ‘content-length’ page header. If the remote file size is unknown, this value should be -1.

Parameters:remote_url (str) – The url of the remote file to get the size of.
Returns:The remote file size in bytes.
Return type:int
get_remote_url(alias)[source]

Return the remote url needed to fetch the file corresponding to the alias.

This returns the url needed to fetch the file corresponding to the alias. By default this returns self.base_url.

Parameters:alias (str) – An alias defined in self.aliases.
Returns:The url needed to fetch the file corresponding to the alias.
Return type:str
get_source_version(alias)[source]

Return the release version of the remote source:alias.

This returns the release version of the remote source for a specific alias. This value will be the same for every alias unless the the alias can have a different release version than the source (this will be source dependent). This value is stored in the self.version dictionary object. If the value does not already exist, all aliases versions are initialized to ‘unknown’.

Parameters:alias (str) – An alias defined in self.aliases.
Returns:The remote version of the source.
Return type:str
is_map(alias)[source]

Return a boolean representing if the provided alias is used for source specific mapping of nodes or edges.

This returns a boolean representing if the alias corresponds to a file used for mapping. By default this returns True if the alias ends in ‘_map’ and False otherwise.

Parameters:alias (str) – An alias defined in self.aliases.
Returns:Whether or not the alias is used for mapping.
Return type:bool
table(raw_line, version_dict)[source]

Uses the provided raw_lines file to produce a table file, an edge_meta file, and a node_meta file (only for property nodes).

This returns nothing but produces the table formatted files from the provided raw_lines file:

raw_lines (file, line num, line_chksum, raw_line)
table table (line_cksum, n1name, n1hint, n1type, n1spec,
                n2name, n2hint, n2type, n2spec, et_hint, score)
edge_meta (line_cksum, info_type, info_desc)
node_meta (node_id,
           info_type (alt_alias, relationship, experiment, or link),
           info_desc (text))

By default this function does nothing (must be overridden)

Parameters:
  • raw_line (str) – The path to the raw_lines file
  • version_dict (dict) – A dictionary describing the attributes of the alias for a source.
check_utilities.check(module, args=None)[source]

Runs compare_versions(SrcClass) on a ‘module’ object

This runs the compare_versions function on a ‘module’ object to find the version information of the source and determine if a fetch is needed. The version information is also printed.

Parameters:
  • module (str) – string name of module defining source specific class
  • args (Namespace) – args as populated namespace or ‘None’ for defaults
Returns:

A nested dictionary describing the version information for each

alias described in source.

Return type:

dict

check_utilities.compare_versions(src_obj, args=None)[source]

Return a dictionary with the version information for each alias in the source and write a dictionary for each alias to file.

This returns a nested dictionary describing the version information of each alias in the source. The version information is also printed.

Parameters:
  • src_obj (SrcClass) – A SrcClass object for which the comparison should be performed.
  • args (Namespace) – args as populated namespace or ‘None’ for defaults
Returns:

A nested dictionary describing the version information for each

alias described in src_obj. For each alias the following keys are defined:

'source' (str):                 The source name,
'alias' (str):                  The alias name,
'alias_info' (str):             A short string with information
                                about the alias,
'is_map' (bool):                See is_map,
'dependencies' (lists):         See get_dependencies,
'remote_url' (str):             See get_remote_url,
'remote_date' (float):          See get_remote_file_modified,
'remote_version' (str):         See get_source_version,
'remote_file' (str):            File to extract if remote file
                                location is a directory,
'remote_size' (int):            See get_remote_file_size,
'local_file_name' (str):        See get_local_version_info,
'file_exists' (bool):           See get_local_version_info,
'fetch_needed' (bool):          True if file needs to be downloaded
                                from remote source. A fetch will
                                be needed if the local file does
                                not exist, or if the local and
                                remote files have different date
                                modified or file sizes.

Return type:

dict

check_utilities.get_SrcClass(args, *posargs, **kwargs)[source]

Returns an object of the source class.

This returns an object of the source class to allow access to its functions if the module is imported.

Parameters:args (Namespace) – args as populated namespace or ‘None’ for defaults
Returns:a source class object
Return type:SrcClass
check_utilities.main_parse_args()[source]

Processes command line arguments.

Expects three positional arguments(start_step, deploy_loc, run_mode) and a number of optional arguments. If arguments are missing, supplies default values.

Returns:args as populated namespace
Return type:Namespace

fetch_utilities

Utiliites for fetching and formatting source for the Knowledge Network (KN) that has been updated.

Contains module functions:

download(version_dict)
chunk(filename, total_lines)
format_raw_line(filename)
get_md5_hash(filename)
get_line_count(filename)
main_parse_args()
main(version_json, args=None)
fetch_utilities.ARCHIVES

list – list of supported archive formats.

fetch_utilities.DIR

str – the relative path to data/source/alias/ from location of script execution

fetch_utilities.MAX_CHUNKS

int – maximum number of chunks to split file into

Examples

To run fetch on a single source (e.g. dip) after check complete:

$ cd data/dip/PPI
$ python3 ../../../code/fetch_utilities.py file_metadata.json

To view all optional arguments that can be specified:

$ python3 code/fetch_utilities.py -h
class fetch_utilities.AppURLopener(*args, **kwargs)[source]

URLopener to open with a custom user-agent.

fetch_utilities.chunk(filename, total_lines, chunksize=500000)[source]

Splits the provided file into equal chunks with ceiling(num_lines/chunksize) lines each.

This takes the path to a file and reads through the file, splitting it into equal chunks with each of size ceiling(num_lines/chunksize). It then returns the number of chunks and sets up the raw_lines table in the format: (file, line num, line_chksum, raw_line)

Parameters:
  • filename (str) – the file to split into chunks
  • total_lines (int) – the number of lines in the file at filename
  • args (Namespace) – args as populated namespace or ‘None’ for defaults
  • chunksize (int) – max size of a single chunk. Defaults to 500000.
Returns:

the number of chunks filename was split into

Return type:

int

fetch_utilities.download(version_dict)[source]

Returns the standardized path to the local file after downloading it from the source and unarchiving if needed.

This returns the standardized path (path/source.alias.txt) for the source alias described in version_dict. If a download is needed (as determined by the check step), the remote file will be downloaded.

Parameters:version_dict (dict) – A dictionary describing the attributes of the alias for a source.
Returns:The relative path to the newly downloaded file.
Return type:str
fetch_utilities.format_raw_line(filename)[source]
Creates the raw_line table from the provided file and returns the
path to the output file.

This takes the path to a file and reads through the file, adding three tab separated columns to the beginning, saving to disk, and then returning the output file path. Output looks like: raw_lines table (line_hash, line_num, file_id, line_str)

Parameters:filename (str) – the file to convert to raw_line table format
Returns:the path to the output file
Return type:str
fetch_utilities.get_line_count(filename)[source]

Returns the number of lines in the file at filename.

This takes the path to a file and reads through the file line by line, producing a count of the number of lines.

Parameters:filename (str) – the file to split into chunks
Returns:the number of lines in the file at int
Return type:int
fetch_utilities.get_md5_hash(filename)[source]

Returns the md5 hash of the file at filename.

This takes the path to a file and reads through the file line by line, producing both the md5 hash and a count of the number of lines.

Parameters:filename (str) – the file to split into chunks
Returns:the md5 hash of the file at filename int: the number of lines in the file at int
Return type:str
fetch_utilities.main(version_json, args=None)[source]

Fetches and chunks the source:alias described by version_json.

This takes the path to a version_json (source.alias.json) and runs fetch (see fetch). If the source is ensembl, it runs the ensembl specific fetch (see ensembl.fetch). If the alias is a data file, it then runs raw_line (see raw_line) and then runs chunk (see chunk) on the output. If the alias is a mapping file, it runs create_mapping_dict (see create_mapping_dict in SRC.py). It also updates version_json to include the total lines in and md5 checksum of the fetched file. It then saves the updated version_json to file.

Parameters:
  • version_json (str) – path to a json file describing the source:alias
  • args (Namespace) – args as populated namespace or ‘None’ for defaults
fetch_utilities.main_parse_args()[source]

Processes command line arguments.

Expects one positional argument (metadata_json) and number of optional arguments. If arguments are missing, supplies default values.

Returns:args as populated namespace
Return type:Namespace

table_utilities

Utiliites for fetching and chunking a source for the Knowledge Network (KN) that has been updated.

Contains module functions:

csu(infile, outfile, columns=list())
main_parse_args()
main(chunkfile, version_json, args=None)

Examples

To run table on a single source (e.g. dip) after fetch complete:

$ cd data/dip/PPI
$ python3 ../../../code/table_utilities.py chunks/dip.PPI.raw_line.1.txt             file_metadata.json

To view all optional arguments that can be specified:

$ python3 code/table_utilities.py -h
table_utilities.csu(infile, outfile, columns=None)[source]

Performs a cut | sort | uniq on infile using the provided columns and stores it into outfile.

Takes a file in tsv format and sorts by the provided columns using the unix sort command and then removes duplicate elements.

Parameters:
  • infile (str) – the file to sort
  • outfile (str) – the file to save the result into
  • columns (list) – the columns to use in cut or an empty list if all columns should be used
table_utilities.main(chunkfile, version_json, args=None)[source]

Tables the source:alias described by version_json.

This takes the path to a chunked (see fetch_utilities.chunk) raw_line file and it’s correpsonding version_json (source.alias.json) and runs the source specific table command (see SrcClass.table) if the alias is a data file. If it is a mapping file, it does nothing:

raw_line (line_hash, line_num, file_id, raw_line) table_file (line_hash, n1name, n1hint, n1type, n1spec, n2name, n2hint, n2type, n2spec, et_hint, score, table_hash) edge_meta (line_hash, info_type, info_desc) node_meta (node_id, info_type (evidence, relationship, experiment, or link), info_desc (text)) node (node_id, n_alias, n_type)
Parameters:
  • version_json (str) – path to a chunk file in raw_line format
  • version_json – path to a json file describing the source:alias
  • args (Namespace) – args as populated namespace or ‘None’ for defaults
table_utilities.main_parse_args()[source]

Processes command line arguments.

Expects two positional arguments (chunkfile, metadata_json) and number of optional arguments. If arguments are missing, supplies default values.

Returns:args as populated namespace
Return type:Namespace

conv_utilities

Utiliites for mapping the gene identifiers in an edge file.

Contains module functions:

map_list(namefile, args=None)
main_parse_args()
main(tablefile, args=None)
conv_utilities.DEFAULT_HINT

str – the default mapping hint for converting identifiers

conv_utilities.DEFAULT_TAXON

int – the default taxon id to use for converting identfiers

Examples

To run conv on a single source (e.g. dip) after table complete:

$ python3 code/conv_utilities.py data/dip/PPI/chunks/dip.PPI.edge.1.txt

To run conv on a file of gene names:

$ python3 code/conv_utilities.py -mo LIST list_of_gene_names.txt

To view all optional arguments that can be specified:

$ python3 code/conv_utilities.py -h
conv_utilities.main(tablefile, args=None)[source]

Maps the nodes for the source:alias tablefile.

This takes the path to an tablefile (see table_utilities.main) and maps the nodes in it using the Redis DB. It then outputs a status files in the format (table_hash, n1, n2, edge_type, weight, edge_hash, line_hash, status, status_desc), where status is production if both nodes mapped and unmapped otherwise. It also outpus an edge file which all rows where status is production, in the format (edge_hash, n1, n2, edge_type, weight), and and edge2line file in the formate (edge_hash, line_hash).

Parameters:
  • tablefile (str) – path to an tablefile to be mapped
  • args (Namespace) – args as populated namespace or ‘None’ for defaults
conv_utilities.main_parse_args()[source]

Processes command line arguments.

Expects one positional argument (infile) and number of optional arguments. If arguments are missing, supplies default values.

Returns:args as populated namespace
Return type:Namespace
conv_utilities.map_list(namefile, args=None)[source]

Maps the nodes for the provided namefile.

This takes the path to an namefile and maps the nodes in it using the Redis DB. It then outputs an mapped file in the format (mapped, original).

Parameters:
  • namefile (str) – path to an namefile to be mapped
  • args (Namespace) – args as populated namespace or ‘None’ for defaults

import_utilities

Utiliites for importing edge, edge_meta, and node_meta into the KnowEnG MySQL datatbase.

Contains module functions:

import_file(file_name, table, ld_cmd='', dup_cmd='', args=None)
import_filemeta(version_dict, args=None)
update_filemeta(version_dict, args=None)
import_edge(edgefile, args=None)
import_nodemeta(nmfile, args=None)
import_pnode(filename, args=None)
import_utilities.enable_keys(args=None)[source]

Imports the provided file into the KnowEnG MySQL database using optimal settings.

Starts a transaction and changes some MySQL settings for optimization, which disables the keys. It then loads the data into the provided table in MySQL. Note that the keys are not re-enabled after import. To do this call mysql_utilities.get_database(‘KnowNet’, args).enable_keys().

Parameters:
  • file_name (str) – path to the file to be imported
  • table (str) – name of the permanent table to import to
  • ld_cmd (str) – optional additional command for loading data
  • args (Namespace) – args as populated namespace or ‘None’ for defaults
import_utilities.import_edge(edgefile, args=None)[source]

Imports the provided edge file and any corresponding meta files into the KnowEnG MySQL database.

Loads the data into a temporary table in MySQL. It then queries from the temporary table into the corresponding permanent table. If a duplication occurs during the query, it updates to the maximum edge score if it is an edge file, and ignores if it is metadata.

Parameters:
  • edgefile (str) – path to the file to be imported
  • args (Namespace) – args as populated namespace or ‘None’ for defaults
import_utilities.import_file(file_name, table, ld_cmd='', dup_cmd='', args=None)[source]

Imports the provided file into the KnowEnG MySQL database.

Loads the data into a temporary table in MySQL. It then queries from the temporary table into the corresponding permanent table. If a duplication occurs during the query, it uses the provided behavior to handle. If no behavior is provided, it replaces into the table.

Parameters:
  • file_name (str) – path to the file to be imported
  • table (str) – name of the permanent table to import to
  • ld_cmd (str) – optional additional command for loading data
  • dup_cmd (str) – command for handling duplicates
  • args (Namespace) – args as populated namespace or ‘None’ for defaults
import_utilities.import_file_nokeys(file_name, table, ld_cmd='', args=None)[source]

Imports the provided file into the KnowEnG MySQL database using optimal settings.

Starts a transaction and changes some MySQL settings for optimization, which disables the keys. It then loads the data into the provided table in MySQL. Note that the keys are not re-enabled after import. To do this call enable_keys(args).

Parameters:
  • file_name (str) – path to the file to be imported
  • table (str) – name of the permanent table to import to
  • ld_cmd (str) – optional additional command for loading data
  • args (Namespace) – args as populated namespace or ‘None’ for defaults
import_utilities.import_filemeta(version_dict, args=None)[source]

Imports the provided version_dict into the KnowEnG MySQL database.

Loads the data from an version dictionary into the raw_file table.

Parameters:
  • version_dict (dict) – version dictionary describing a downloaded file
  • args (Namespace) – args as populated namespace or ‘None’ for defaults
import_utilities.import_nodemeta(nmfile, args=None)[source]

Imports the provided node_meta file and any corresponding meta files into the KnowEnG MySQL database.

Loads the data into a temporary table in MySQL. It then queries from the temporary table into the corresponding permanent table. If a duplication occurs during the query, it updates to the maximum edge score if it is an edge file, and ignores if it is metadata.

Parameters:
  • nmfile (str) – path to the file to be imported
  • args (Namespace) – args as populated namespace or ‘None’ for defaults
import_utilities.import_pnode(filename, args=None)[source]

Imports the provided property node file into the KnowEnG MySQL database.

Loads the data into a temporary table in MySQL. It then queries from the temporary table into the corresponding permanent table. If a duplication occurs during the query, it updates to the maximum edge score if it is an edge file, and ignores if it is metadata.

Parameters:
  • filename (str) – path to the file to be imported
  • args (Namespace) – args as populated namespace or ‘None’ for defaults
import_utilities.import_production_edges(args=None)[source]

Query production edges from status table into the edge table.

Queries the KnowNet status table and copies all distinct production edges to the edge table. If a duplication occurs during the query, it updates to the maximum edge score and keeps the edge hash for that edge.

Parameters:args (Namespace) – args as populated namespace or ‘None’ for defaults
import_utilities.import_status(statusfile, args=None)[source]

Imports the provided status file and any corresponding meta files into the KnowEnG MySQL database.

Loads the data into a temporary table in MySQL. It then queries from the temporary table into the corresponding permanent table. If a duplication occurs during the query, it updates to the maximum edge score if it is an edge file, and ignores if it is metadata.

Parameters:
  • status (str) – path to the file to be imported
  • args (Namespace) – args as populated namespace or ‘None’ for defaults
import_utilities.main()[source]

Imports according to the given arguments.

import_utilities.main_parse_args()[source]

Processes command line arguments.

Expects one positional argument (status_file) and number of optional arguments. If arguments are missing, supplies default values.

Returns:args as populated namespace
Return type:Namespace
import_utilities.merge(merge_key, args)[source]

Uses sort to merge and unique the already sorted files of the table type and stores the results into outfile.

This takes a table type (one of: node, node_meta, edge2line, status, or edge_meta) and merges them using the unix sort command while removing any duplicate elements.

Parameters:
  • merge_key (str) – table type (one of: node, node_meta, edge2line, status, edge, or edge_meta)
  • args (Namespace) – args as populated namespace or ‘None’ for defaults
import_utilities.merge_logs(args)[source]

Merge all log files into a single file that contains all the information about the run.

import_utilities.update_filemeta(version_dict, args=None)[source]

Updates the provided filemeta into the KnowEnG MySQL database.

Updates the data from an version dictionary into the raw_file table.

Parameters:
  • version_dict (dict) – version dictionary describing a downloaded file
  • args (Namespace) – args as populated namespace or ‘None’ for defaults

export_utilities

export_utilities.convert_nodes(args, nodes)[source]

Uses redis_utilities to convert a set of nodes.

export_utilities.figure_out_class(db, et)[source]

Determines the class and bidirectionality of the edge_type.

export_utilities.get_gg(db, et, taxon)[source]

Get gene-gene nodes.

export_utilities.get_metadata(db, edges, nodes, lines, sp, et, args)[source]

Retrieves the metadata for a subnetwork.

export_utilities.get_pg(db, et, taxon)[source]

Get property-gene nodes.

export_utilities.get_sources(edges)[source]

Given a list of edges, determines the set of sources included.

export_utilities.main()[source]

Parses arguments and then exports the specified subnetworks.

export_utilities.norm_edges(edges, args)[source]

Normalizes and cleans edges according to the specified arguments.

export_utilities.num_connected_components(edges, nodes)[source]

Count the number of connected components in a graph given the edges and the nodes.

export_utilities.should_skip(cls, res)[source]

Determine if the subnetwork is especially small, and if we should skip it.

mysql_utilities

Utiliites for interacting with the KnowEnG MySQL db through python.

Contains the class MySQL which provides functionality for interacting with MySQL database

Contains module functions:

combine_tables(alias, args=None)
create_dictionary(results)
import_nodes(version_dict, args=None)
query_all_mappings(version_dict, args=None)
create_mapping_dicts(version_dict, args=None)
get_database(db=None, args=None)
get_insert_cmd(step)
import_ensembl(alias, args=None)
class mysql_utilities.MySQL(database=None, args=None)[source]

Class providing functionality for interacting with the MySQL database.

This class serves as a wrapper for interacting with the KnowEnG MySQL

host

str – the MySQL db hostname

user

str – the MySQL db username

port

str – the MySQL db port

passw

str – the MySQL db password

database

str – the MySQL database to connect to

conn

object – connection object for the database

cursor

object – cursor object for the database

close()[source]

Close connection to the MySQL server.

This commits any changes remaining and closes the connection to the MySQL server.

copy_table(old_database, old_table, new_database, new_table)[source]

Copy a table in the MySQL database

Copy the provided tablename to the MySQL database.

Parameters:
  • old_database (str) – name of the database to move from
  • old_table (str) – name of the table to move from
  • new_database (str) – name of the database to move to
  • new_table (str) – name of the table to move to
create_db(database)[source]

Add a database to the MySQL server

Adds the provided database from the MySQL server.

Parameters:database (str) – name of the database to add to the MySQL server
create_table(tablename, cmd='')[source]

Add a table to the MySQL database.

Adds the provided tablename to the MySQL database. If cmd is specified, it will create the table using the provided cmd.

Parameters:
  • tablename (str) – name of the table to add to the MySQL database
  • cmd (str) – optional string to overwrite default create table
create_temp_table(tablename, cmd='')[source]

Add a table to the MySQL database.

Adds the provided tablename to the MySQL database. If cmd is specified, it will create the table using the provided cmd.

Parameters:
  • tablename (str) – name of the table to add to the MySQL database
  • cmd (str) – optional additional command
disable_keys()[source]

Disables keys for faster operations.

Turns off autocommit, unique_checks, and foreign_key_checks for the MySQLdatabase.

drop_db(database)[source]

Remove a database from the MySQL server

Drops the provided database from the MySQL server.

Parameters:database (str) – name of the database to remove from the MySQL server
drop_table(tablename)[source]

Remove a table from the MySQL database

Drops the provided tablename from the MySQL database.

Parameters:tablename (str) – name of the table to remove from the MySQL database
drop_temp_table(tablename)[source]

Remove a temporary table from the MySQL database

Drops the provided tablename from the MySQL database.

Parameters:tablename (str) – name of the table to remove from the MySQL database
dump_table(table, file)[source]

Dump the data for the table in the provided file(name).

enable_keys()[source]

Enables keys for safer operations.

Turns on autocommit, unique_checks, and foreign_key_checks for the MySQLdatabase.

import_schema(database, sqlfile)[source]

Import the schema for the provided database from sqlfile.

Removes the provided database if it exists, creates a new one, and imports the schema as defined in the provided sqlfile.

Parameters:
  • database (str) – name of the database to add to the MySQL server
  • sqlfile (str) – name of the sql file specifying the format for the database
import_table(database, tablefile, import_flags='--delete')[source]

Import the data for the table in the provided database described by tablefile.

Imports the data as defined in the provided tablefile.

Parameters:
  • database (str) – name of the database to add to the MySQL server
  • tablefile (str) – name of the txt file specifying the data for the table
  • import_flag (str) – additional flags to pass to mysqlimport
init_knownet()[source]

Inits the Knowledge Network MySQL DB.

Creates the KnowNet database and all of its tables if they do not already exist. Also imports the edge_type, node_type, and species files, but ignores any lines that have the same unique key as those already in the tables.

insert(tablename, cmd)[source]

Insert into tablename using cmd.

Parameters:
  • tablename (str) – name of the table to add to the MySQL database
  • cmd (str) – a valid SQL command to use for inserting into tablename
insert_ignore(tablename, cmd='')[source]

Insert ignore into tablename using cmd.

Parameters:
  • tablename (str) – name of the table to add to the MySQL database
  • cmd (str) – a valid SQL command to use for inserting into tablename
load_data(filename, tablename, cmd='', sep='\\t', enc='"')[source]

Import data into table in the MySQL database.

Loads the data located on the local machine into the provided MySQL table. Uses the LOAD DATA LOCAL INFILE command.

Parameters:
  • filename (str) – name of the file to import from
  • tablename (str) – name of the table to import into
  • sep (str) – separator for fields in file
  • enc (str) – enclosing character for fields in file
  • cmd (str) – optional additional command
move_table(old_database, old_table, new_database, new_table)[source]

Move a table in the MySQL database

Moves the provided tablename to the MySQL database.

Parameters:
  • old_database (str) – name of the database to move from
  • old_table (str) – name of the table to move from
  • new_database (str) – name of the database to move to
  • new_table (str) – name of the table to move to
query_distinct(query, table, cmd='')[source]

Run the provided query distinct in MySQL.

This runs the provided distinct query from the provided table with the optional extra cmd using the current MySQL connection and cursor. It then returns the fetched results.

Parameters:
  • query (str) – the SQL query to run on the MySQL server
  • table (str) – the table to query from
  • cmd (str) – the addtional SQL command to run on the MySQL server (optional)
Returns:

the fetched results

Return type:

list

replace(tablename, cmd)[source]

Insert into tablename using cmd.

Replace into tablename using cmd.

Parameters:
  • tablename (str) – name of the table to add to the MySQL database
  • cmd (str) – a valid SQL command to use for inserting into tablename
replace_safe(tablename, cmd, values)[source]

Insert into tablename using cmd.

Replace into tablename using cmd.

Parameters:
  • tablename (str) – name of the table to add to the MySQL database
  • cmd (str) – a valid SQL command to use for inserting into tablename
run(cmd)[source]

Run the provided command in MySQL.

This runs the provided command using the current MySQL connection and cursor.

Parameters:cmd (str) – the SQL command to run on the MySQL server
Returns:the fetched results
Return type:list
set_isolation(duration='', level='REPEATABLE READ')[source]

Sets the transaction isolation level.

Modify the transaction isolation level to modulate lock status behavior. Default InnoDB is repeatable read. For other levels check online at https://dev.mysql.com/doc/refman/5.7/en/set-transaction.html

Parameters:
  • duration (str) – time for isolation level to be used. Can be empty, GLOBAL, or SESSION
  • level (str) – isolation level. In order of locking level: SERIALIZABLE, REPEATABLE READ, READ COMMITTED, READ UNCOMMITTED
start_transaction(level='REPEATABLE READ')[source]

Starts a mysql transaction with the provided isolation level

Uses the provided isolation level to start a MySQL transaction using the current connection. Transaction persists until the next commit.

Parameters:level (str) – isolation level. In order of locking level: SERIALIZABLE, REPEATABLE READ, READ COMMITTED, READ UNCOMMITTED
use_db(database)[source]

Use a database from the MySQL server

Use the provided database from the MySQL server.

Parameters:database (str) – name of the database to use from the MySQL server
mysql_utilities.combine_tables(alias, args=None)[source]

Combine all of the data imported from ensembl for the provided alias into a single database.

This combines the imported tables into a single table knownet_mappings with information from genes, transcripts, and translations. It then merges this table into the KnowNet database for use in gene identifier mapping.

Parameters:
  • alias (str) – An alias defined in ensembl.aliases.
  • args (Namespace) – args as populated namespace or ‘None’ for defaults
mysql_utilities.create_KnowNet(args=None)[source]

Returns an object of the MySQL class with KnowNet db.

This returns an object of the MySQL class to allow access to its functions if the module is imported.

Parameters:
  • db (str) – optional db to connect to
  • args (Namespace) – args as populated namespace or ‘None’ for defaults
Returns:

a source class object

Return type:

MySQL

mysql_utilities.create_dictionary(results)[source]

Creates a dictionary from a MySQL fetched results.

This returns a dictionary from the MySQL results after a query from the DB. It assumes there are two columns in the results and reads through all of the results, making them into a dictionary.

Parameters:results (list) – a list of the results returned from a MySQL query
Returns:dictionary with first column as key and second as values
Return type:dict
mysql_utilities.create_mapping_dicts(version_dict, args=None)[source]

Creates the mapping dictionaries for the provided alias.

Produces the ensembl stable mappings dictionary and the all unique mappings dictionary for the provided alias. It then saves them as json objects to file.

Parameters:
  • version_dict (dict) – the version dictionary describing the source:alias
  • args (Namespace) – args as populated namespace or ‘None’ for defaults
mysql_utilities.deploy_container(args=None)[source]

Deplays a container with marathon running MySQL using the specified args.

This replaces the placeholder args in the json describing how to deploy a container running mysql with those supplied in the users arguements.

Parameters:args (Namespace) – args as populated namespace or ‘None’ for defaults
mysql_utilities.get_database(db=None, args=None)[source]

Returns an object of the MySQL class.

This returns an object of the MySQL class to allow access to its functions if the module is imported.

Parameters:
  • db (str) – optional db to connect to
  • args (Namespace) – args as populated namespace or ‘None’ for defaults
Returns:

a source class object

Return type:

MySQL

mysql_utilities.get_file_meta(file_id, args=None)[source]

Returns the metadata for the provided file_id if it exists.

This returns the metadata for the provided file_id (in the format of “source.alias”) present locally in the MySQL database from a previous run of the pipeline. It formats this output as a dicionary, which will always contain the following keys: ‘file_id’ (str): “source.alias” which is the key used in SQL raw_file table ‘file_exists’ (bool): boolean if the file with the above file_id exists in the SQL raw_file table and will additionally contain the following keys if file_exists is True: ‘size’ (int): size of file in bytes ‘date’ (float): time of last modification time of file in seconds since the epoch ‘version’ (str): the remote version of the source

Parameters:file_id (str) – The file_id for the raw_file in the format of “source.alias”
Returns:The file_meta information for a given source alias.
Return type:dict
mysql_utilities.get_insert_cmd(step)[source]

Returns the command to be used with an insert for the provided step.

This takes a predefined step to determine which type of insert is being performed during the production of the knownet_mappings combined tables. Based off of this step, it returns a MySQL command to be used with an INSERT INTO statement.

Parameters:step (str) – the step indicating the step during the production of the combined knownet_mapping tables
Returns:the command to be used with an INSERT INTO statement at this step
Return type:str
mysql_utilities.import_ensembl(alias, args=None)[source]

Imports the ensembl data for the provided alias into the KnowEnG database.

This produces the local copy of the fetched ensembl database for alias. It drops the existing database, creates a new database, imports the relevant ensembl sql schema, and imports the table.

Parameters:
  • alias (str) – An alias defined in ensembl.aliases.
  • args (Namespace) – args as populated namespace or ‘None’ for defaults
mysql_utilities.import_nodes(version_dict, args=None)[source]

Imports the gene nodes into the KnowNet nodes and node_species tables.

Queries the imported ensembl nodes and uses the stable ids as nodes for the KnowNet nodes table and uses the taxid to create the corresponding node_species table.

Parameters:
  • version_dict (dict) – the version dictionary describing the source:alias
  • args (Namespace) – args as populated namespace or ‘None’ for defaults
mysql_utilities.main()[source]

Deploy a MySQL container using marathon with the provided command line arguements.

This uses the provided command line arguments and the defaults found in config_utilities to launch a MySQL docker container using marathon.

mysql_utilities.query_all_mappings(version_dict, args=None)[source]

Creates the all mappings dictionary for the provided alias.

Produces a dictionary of ensembl stable mappings and the all unique mappings the provided alias. It then saves them as json objects to file.

Parameters:
  • version_dict (dict) – the version dictionary describing the source:alias
  • args (Namespace) – args as populated namespace or ‘None’ for defaults

redis_utilities

Utiliites for interacting with the KnowEnG Redis db through python.

Contains module functions:

get_database(args=None)
import_ensembl(alias, args=None)
conv_gene(rdb, foreign_key, hint, taxid)
redis_utilities.conv_gene(rdb, fk_array, hint, taxid)[source]

Uses the redis database to convert a gene to ensembl stable id

This checks first if there is a unique name for the provided foreign key. If not it uses the hint and taxid to try and filter the foreign key possiblities to find a matching stable id.

Parameters:
  • rdb (redis object) – redis connection to the mapping db
  • fk_array (list) – the foreign gene identifers to be translated
  • hint (str) – a hint for conversion
  • taxid (str) – the species taxid, ‘unknown’ if unknown
Returns:

result of searching for gene in redis DB

Return type:

str

redis_utilities.deploy_container(args=None)[source]

Deplays a container with marathon running Redis using the specified args.

This replaces the placeholder args in the json describing how to deploy a container running Redis with those supplied in the users arguements.

Parameters:args (Namespace) – args as populated namespace or ‘None’ for defaults
redis_utilities.get_database(args=None)[source]

Returns a Redis database connection.

This returns a Redis database connection access to its functions if the module is imported.

Parameters:args (Namespace) – args as populated namespace or ‘None’ for defaults
Returns:a redis connection object
Return type:StrictRedis
redis_utilities.get_node_info(rdb, fk_array, ntype, hint, taxid)[source]

Uses the redis database to convert a node alias to KN internal id

Figures out the type of node for each id in fk_array and then returns all of the metadata associated or unmapped-*

Parameters:
  • rdb (redis object) – redis connection to the mapping db
  • fk_array (list) – the array of foreign gene identifers to be translated
  • ntype (str) – ‘Gene’ or ‘Property’ or None
  • hint (str) – a hint for conversion
  • taxid (str) – the species taxid, None if unknown
Returns:

list of lists containing 5 col info for each mapped gene

Return type:

list

redis_utilities.import_ensembl(alias, args=None)[source]

Imports the ensembl data for the provided alias into the Redis database.

This stores the foreign key to ensembl stable ids in the Redis database. It uses the all mappings dictionary created by mysql.query_all_mappings for alias. This then iterates through each foreign_key. If the foreign_key has not been seen before, it sets unique:foreign_key as the stable id. If the key has been seen before and maps to a different ensembl stable id, it sets the value for unique:foreign_key as unmapped:many. In each case, it sets the value of taxid:hint:foreign_key as the stable_id, and appends taxid:hint to the set with foreign_key as the key.

Parameters:
  • alias (str) – An alias defined in ensembl.aliases.
  • args (Namespace) – args as populated namespace or ‘None’ for defaults
redis_utilities.import_gene_nodes(node_table, args=None)[source]

Import gene node metadata into redis.

redis_utilities.import_node_meta(nmfile, args=None)[source]

Import node metadata into redis.

redis_utilities.main()[source]

Deploy a Redis container using marathon with the provided command line arguements.

This uses the provided command line arguments and the defaults found in config_utilities to launch a Redis docker container using marathon.

redis_utilities.node_desc(rdb, stable_array)[source]

Uses the redis database to find metadata about node given its stable id

Return all metadata for each element of stable_array

Parameters:
  • rdb (redis object) – redis connection to the mapping db
  • stable_array (str) – the array of stable identifers to be searched
Returns:

list of lists containing 4 col info for each mapped node

Return type:

list

job_utilities

Utiliites for Jobs class which store all important information for each job to run

Classes:
Jobs: Stores all important information for each job to run on cluster

Contains module functions:

queue_starter_job(args, jobname='starter-jobname', dummy=1)
run_job_step(args, job_type, tmpdict)

run_local_fetch(args)
curl_handler(args, jobname, job_str)
chronos_parent_str(parentlist)
job_utilities.CURL_PREFIX

list – parts of the chronos curl command

class job_utilities.Job(jobtype, args)[source]

Base class for each job to be run in pipeline.

This Job class provides attributes and default functions to store information about and perform operations on job.

jobtype

str – the type of job to be referenced in components.json

jobname

str – name of job as appears on chronos

tmpdict

dict – dictionary of default tmp variable substitutions

cjobfile

str – chronos local file name of json job descriptor

cjobstr

str – contents of json job descriptor as single string.

args

namespace – command line arguments and default arguments to method

print_chronos_job()[source]

Prints out job description to .json file

This creates a directory and in it prints a .json file containing self.cjobstr. It saves the created file as self.cjobfile

Args:

Returns:

queue_chronos_job()[source]

puts the job on the chronos queue

Using the chronos url from args.chronos, this creates a tmp .sh job that runs the curl statement to sent job to chronos

replace_jobtmp(tmpdict)[source]

Replaces temporary strings in self.cjobstr with specific values

This loops through all keys in tmpdict and replaces any placeholder matches in self.cjobstr with the key values. Also, adds tmpdict to self.tmpdict.

Parameters:tmpdict (dict) – dictionary of default tmp variable substitutions

Returns:

run_docker_job()[source]

runs the job locally using docker

Using the args, tmpdict, and cjobstr create a command line call to docker run that executes the job and removes itself

run_job()[source]

Sends job to chronos job queue

Using the chronos url from args.chronos, this creates a tmp .sh job that runs the curl statement to sent job to chronos

run_local_job()[source]

prints and runs the job in the local environment

Using the args, tmpdict, and cjobstr create a command line call to executes the job

job_utilities.chronos_parent_str(parentlist)[source]

Returns correct string for parent dependencies.

Formatting of returned string depends on number of parents

Parameters:parentlist (list) – names of parent jobs

Returns: string

job_utilities.queue_starter_job(args, jobname='starter-jobname', dummy=1)[source]

Queues a starter job.

If dummy=1, creates and queues a dummy job that will never run, else it queues a simple job with a single print statement that will run immediately on which other jobs will depend

Parameters:
  • jobstr (str) – contents of json job descriptor as single string.
  • dummy (bool) – 1, queue jobs that does not run, 0, queue job that does

Returns: Job object

job_utilities.run_job_step(args, job_type, tmpdict)[source]

Creates and runs a job.

Using the tmpdict description of the job will create and queue a new job that runs when its dependencies finish in the correct mode

Parameters:
  • args (namespace) – arguments from main_parse_args().
  • job_type (string) – type of job to be created
  • tmpdict (dict) – dictionary with all of the arguments values required

Returns: Job object

workflow_utilities

Utiliites for running single or multiple steps of the setup or data pipeline
either locally , in docker, or on the cloud.

Contains module functions:

list_sources(args)
generic_dict(args, ns_parent=None)
run_check(args)
run_fetch(args)
run_table(args)
run_map(args)
main_parse_args()
main()
workflow_utilities.DEFAULT_START_STEP

str – first step of setup

workflow_utilities.POSSIBLE_STEPS

list – list of all steps

workflow_utilities.SETUP_FILES

list – list of setup SrcClasses

workflow_utilities.SPECIAL_MODES

list – list of modes that run breadth first

Examples

To view all optional arguments that can be specified:

$ python3 code/workflow_utilities.py -h

To run just check step of one setup src (e.g. ppi) locally:

$ python3 code/workflow_utilities.py CHECK -su -os -c LOCAL -p ppi

To run all steps of setup on cloud:

$ python3 code/workflow_utilities.py CHECK -su

To run all steps one pipeline src (e.g. kegg) locally:

$ python3 code/workflow_utilities.py CHECK -os -c LOCAL -p kegg
workflow_utilities.generic_dict(args, ns_parent=None)[source]

Creates a dictionary to specify variables for a job

Creates a dictionary used to substitute temporary job variables in the specification of the command line call. ns_parent should be defined for only a next step caller job.

Parameters:args (Namespace) – args as populated namespace from parse_args
Returns:tmp substitution dictionary with appropriate values depending on args
Return type:dict
workflow_utilities.list_sources(args)[source]

creates a list of all sources for step to process

Depending on args.setup, loops through all sources in the srccode directory pulling out valid names or return SETUP_FILES

Parameters:args (Namespace) – args as populated namespace from parse_args
workflow_utilities.main()[source]

Runs the ‘start_step’ step of the main or args.setup pipeline on the args.chronos location, and all subsequent steps if not args.one_step

Parses the arguments and runs the specified part of the pipeline using the specified local or cloud resources.

workflow_utilities.main_parse_args()[source]

Processes command line arguments.

Expects one argument (start_step) and a number of optional arguments. If argument is missing, supplies default value.
parameter argument type flag description
[start_step]     string indicating which pipeline stage to start with
–setup   -su run db inits instead of source specific pipelines
–one_step   -os run for a single step instead of rest of pipeline
–step_parameters str -p parameters to specify calls of a single step in pipeline
–no_ensembl   -ne do not run ensembl in setup pipeline
–dependencies str -d names of parent jobs that must finish
       
Returns:      
Namespace: args as populated namespace      
workflow_utilities.run_check(args)[source]

Runs checks for all sources.

This loops through args.parameters sources, creates a job for each that calls check_utilities clean() (and if not args.one_step, calls workflow_utilities FETCH), and runs job in args.chronos location.

Parameters:args (Namespace) – args as populated namespace from parse_args
workflow_utilities.run_export(args)[source]

TODO: Documentationr.

Parameters:args (Namespace) – args as populated namespace from parse_args, specify –step_parameters(-p) as ‘,,’ separated list of files to import or the allowed possible SQL table names: node, node_meta, edge2line, status, or edge_meta. If not specified, by default it will try to import all tables.
workflow_utilities.run_fetch(args)[source]

Runs fetches for all aliases of a single source.

This loops through aliases of args.parameters sources, creates a job for each that calls fetch_utilities main() (and if not args.one_step, calls workflow_utilities TABLE), and runs job in args.chronos location.

Parameters:args (Namespace) – args as populated namespace from parse_args, must specify –step_parameters(-p) as ‘,,’ separated list of sources
workflow_utilities.run_import(args)[source]

Merges sorted files and runs import on output file on the cloud.

This loops through args.step_parameters (see Args below), and creates a job for each that merges the already sorted and unique files found in the data path (if args.merge is True), then calls import_utilities main().

Parameters:args (Namespace) – args as populated namespace from parse_args, specify –step_parameters(-p) as ‘,,’ separated list of files to import or the allowed possible SQL table names: node, node_meta, edge2line, status, or edge_meta. If not specified, by default it will try to import all tables.
workflow_utilities.run_map(args)[source]

Runs id conversion for a single .table. file on the cloud.

This loops through args.parameters tablefiles, creates a job for each that calls conv_utilities main(), and runs job in args.chronos location.

Parameters:args (Namespace) – args as populated namespace from parse_args, must specify –step_parameters(-p) as ‘,,’ separated list of ‘source.alias.table.chunk.txt’ file names
workflow_utilities.run_table(args)[source]

Runs tables for all chunks of a single source alias.

This loops through chunks of args.parameters aliases, creates a job for each that calls table_utilities main() (and if not args.one_step, calls workflow_utilities MAP), and runs job in args.chronos location.

Parameters:args (Namespace) – args as populated namespace from parse_args, must specify –step_parameters(-p) as ‘,,’ separated list of ‘source,alias’

sanitize_utilities

sanitize_utilities.add_config_args(parser)[source]

Add arguments specific to this module.

Parameters:parser (argparse.parser) – the parser to add arguments to
Returns:the parser with the arguments added
Return type:argparse.parser
sanitize_utilities.drop_duplicates_by_type_or_node(n_df, n1, n2, typ)[source]

Drop the duplicates in the network, by type or by node.

For each set of “duplicate” edges, only the edge with the maximum weight will be kept.

By type, the duplicates are where nd1, nd2, and typ are identical; by node, the duplicates are where nd1, and nd2 are identical.

Parameters:
  • n_df (list) – the data
  • n1 (int) – the column for the firts node
  • n2 (int) – the column for the second node
  • typ (int) – the column for the type
Returns:

the modified data

Return type:

list

sanitize_utilities.make_network_undirected(n_df)[source]

Make the network undirected; that is, the network should be symmetric, but only the edges in one direction are included. So make the edges in the other direction explicit in the network. This assumes that the first two columns are the two nodes.

Parameters:n_df (list) – the data
Returns:the modified data
Return type:list
sanitize_utilities.make_network_unweighted(n_df, wgt)[source]

Make the network unweighted, by setting the weights on all the edges to the same value (1).

Parameters:
  • n_df (list) – the data
  • wgt (int) – the weight column
Returns:

the modified data

Return type:

list

sanitize_utilities.normalize_network_by_type(n_df, typ, wgt)[source]

Normalize the network.

Currently the only normalization method implemented is by type.

Parameters:
  • n_df (list) – the data
  • typ (int) – the type column
  • wgt (int) – the weight column
Returns:

the modified data

Return type:

list

sanitize_utilities.sort_network(n_df)[source]

Sort the network.

Parameters:n_df (list) – the data
Returns:the modified data
Return type:list
sanitize_utilities.upper_triangle(n_df, n1, n2)[source]

Makes a (sparse) matrix upper triangular.