Pipeline Module Documentation¶
config_utilities¶
config module to manage global variables
This module establishes default values and argument parsers for commonly used variables
Contains module functions:
add_run_config_args(parser)
add_file_config_args(parser)
add_mysql_config_args(parser)
add_redis_config_args(parser)
add_config_args(parser)
config_args()
pretty_name(orig_name, endlen=63)
-
Default values for different configuration options
-
config_utilities.
add_config_args
(parser)[source]¶ Add global configuation options to command line arguments.
If global arguments are not specified, supplies their default values.
Parameters: parser (argparse.ArgumentParser) – a parser to add global config opts to Returns: parser with appended global options Return type: argparse.ArgumentParser
-
config_utilities.
add_file_config_args
(parser)[source]¶ Add global configuation options to command line arguments.
If global arguments are not specified, supplies their default values.parameter argument flag description –working_dir str -wd absolute path to toplevel working directory –code_path str -cp absolute path of code directory –storage_dir str -sd absolute path to toplevel shared storage directory –data_path str -dp relative path of data directory from toplevel –logs_path str -lp relative path of logs directory from toplevel –export_path str -ep relative path of export directory from toplevel –src_path str -sp relative path of srcClass directory from code_path Args: parser (argparse.ArgumentParser): a parser to add global config opts to Returns: argparse.ArgumentParser: parser with appended global options
-
config_utilities.
add_mysql_config_args
(parser)[source]¶ Add global configuation options to command line arguments.
If global arguments are not specified, supplies their default values.parameter argument flag description –mysql_host str -myh address of mySQL db –mysql_port str -myp port for mySQL db –mysql_dir str -myd absolute directory for MySQL db files –mysql_mem str -mym memory for deploying MySQL container –mysql_cpu str -myc cpus for deploying MySQL container –mysql_conf str -mycf relative config dir for deploying MySQL –mysql_user str -myu user for mySQL db –mysql_pass str -myps password for mySQL db Args: parser (argparse.ArgumentParser): a parser to add global config opts to Returns: argparse.ArgumentParser: parser with appended global options
-
config_utilities.
add_redis_config_args
(parser)[source]¶ Add global configuation options to command line arguments.
If global arguments are not specified, supplies their default values.parameter argument flag description –redis_host str -rh address of Redis db –redis_port str -rp port for Redis db –redis_dir str -rd absolute directory for Redis db files –redis_mem str -rm memory for deploying redis container –redis_cpu str -rc cpus for deploying redis container –redis_pass str -rps password for Redis db Args: parser (argparse.ArgumentParser): a parser to add global config opts to Returns: argparse.ArgumentParser: parser with appended global options
-
config_utilities.
add_run_config_args
(parser)[source]¶ Add global configuation options to command line arguments.
If global arguments are not specified, supplies their default values.parameter argument flag description –chronos str -c url of chronos scheduler or LOCAL or DOCKER –marathon str -m url of marathon scheduler –build_image str -i docker image name to use for kn_build pipeline –ens_species str -es ‘,,’ separated ensembl species to run in setup pipeline –src_classes str -srcs ‘,,’ separated source keywords to run in parse pipeline –force_fetch bool -ff fetch even if file exists and is unchanged from last run –test_mode bool -tm run in test mode by only printing commands Args: parser (argparse.ArgumentParser): a parser to add global config opts to Returns: argparse.ArgumentParser: parser with appended global options
check_utilities¶
module for checking if source needs to be updated in Knowledge Network (KN).
Contains the class SrcClass which serves as the base class for each supported source in the KN.
Contains module functions:
get_SrcClass(args)
compare_versions(SrcClass)
check(module, args=None)
main_parse_args()
Examples
To run check on a single source (e.g. dip):
$ python3 code/check_utilities.py dip
To view all optional arguments that can be specified:
$ python3 code/check_utilities.py -h
-
class
check_utilities.
SrcClass
(src_name, base_url, aliases, args=None)[source]¶ Base class to be extended by each supported source in KnowEnG.
This SrcClass provides default functions that should be extended or overridden by any source which is added to the Knowledge Network (KN).
-
name
¶ str – The name of the remote source to be included in the KN.
-
url_base
¶ str – The base url of the remote source, which may need additional processing to provide an actual download link (see get_remote_url).
-
aliases
¶ dict – A dictionary with subsets of the source which will be included in the KN as the keys (e.g. different species, data types, or interaction types), and a short string with information about the alias as the value.
-
remote_file
¶ str – The name of the file to extract if the remote source is a directory
-
version
¶ dict – The release version of each alias in the source.
-
source_url
¶ str – The website for the source.
-
reference
¶ str – The citation for the source.
-
pmid
¶ str – The pubmed ID for the source.
-
license
¶ str – The license for the source.
-
create_mapping_dict
(filename, key_col=3, value_col=4)[source]¶ Return a mapping dictionary for the provided file.
This returns a dictionary for use in mapping nodes or edge types from the file specified by filetype. By default it opens the file specified by filename creates a dictionary using the key_col column as the key and the value_col column as the value.
Parameters: - filename (str) – The name of the file containing the information needed to produce the maping dictionary.
- key_col (int) – The column containing the key for creating the dictionary. By default this is column 3.
- value_col (int) – The column containing the value for creating the dictionary. By default this is column 4.
Returns: A dictionary for use in mapping nodes or edge types.
Return type: dict
-
get_aliases
(args=Namespace(build_image='knoweng/kn_builder:latest', chronos='127.0.0.1:8888', code_path='/kn_builder/code/', data_path='kn-rawdata', ens_species='homo_sapiens', export_path='kn-final', force_fetch=False, logs_path='kn-logs', marathon='127.0.0.1:8080', mysql_conf='build_conf/', mysql_cpu='0.5', mysql_dir='/home/ubuntu/KN_Builder/docs/kn-mysql', mysql_host='127.0.0.1', mysql_mem='0', mysql_pass='KnowEnG', mysql_port='3306', mysql_user='root', redis_cpu='0.5', redis_dir='/home/ubuntu/KN_Builder/docs/kn-redis', redis_host='127.0.0.1', redis_mem='0', redis_pass='KnowEnG', redis_port='6379', src_classes='', src_path='srcClass', storage_dir='', test_mode=False, working_dir='/home/ubuntu/KN_Builder/docs'))[source]¶ Helper function for producing the alias dictionary.
This returns a dictionary where alias names are keys and alias info are the values. This helper function uses the species specific information for the build of the Knowledge Network, which is produced by ensembl.py during setup utilities and is located at cf.DEFAULT_MAP_PATH/species/species.json, in order to fetch all matching species specific aliases from the source.
Parameters: args (Namespace) – args as populated namespace or ‘None’ for defaults Returns: A dictionary of species:(taxid, division) values Return type: dict
-
get_dependencies
(alias)[source]¶ Return a list of other aliases that the provided alias depends on.
This returns a list of other aliases that must be processed before full processing of the provided alias can be completed. By default, returns a list of all aliases which are considered mapping files (see is_map)
Parameters: alias (str) – An alias defined in self.aliases. Returns: - The other aliases defined in self.aliases that the provided
- alias depends on.
Return type: list
-
get_local_file_info
(alias)[source]¶ Return a dictionary with the local file information for the alias.
This returns the local file information for a given source alias, which will always contain the following keys:
'local_file_name' (str): name of the file locally 'local_file_exists' (bool): boolean if file exists at path indicated by 'local_file_name'
and will also contain the following if ‘local_file_exists’ is True:
'local_size' (int): size of local file in bytes 'local_date' (float): time of last modification time of local file in seconds since the epoch
Parameters: alias (str) – An alias defined in self.aliases. Returns: The local file information for a given source alias. Return type: dict
-
get_local_version_info
(alias, args)[source]¶ Return a dictionary with the local information for the alias.
This returns the local information for a given source alias, as retrieved from the msyql database and formated as a dicitonary object. (see mysql_utilities.get_file_meta). It adds the local_file_name and local_file_exists to the fields retrieved from the database, which are the name of the file locally and a boolean indicating if it already exists on disk, respectively.
Parameters: alias (str) – An alias defined in self.aliases. Returns: The local file information for a given source alias. Return type: dict
-
get_remote_file_modified
(alias)[source]¶ Return the remote file date modified.
This returns the remote file date modifed as specificied by the ‘last-modified’ page header.
Parameters: remote_url (str) – The url of the remote file to get the date modified of. Returns: - time of last modification time of remote file in seconds
- since the epoch
Return type: float
-
get_remote_file_size
(alias)[source]¶ Return the remote file size.
This returns the remote file size as specificied by the ‘content-length’ page header. If the remote file size is unknown, this value should be -1.
Parameters: remote_url (str) – The url of the remote file to get the size of. Returns: The remote file size in bytes. Return type: int
-
get_remote_url
(alias)[source]¶ Return the remote url needed to fetch the file corresponding to the alias.
This returns the url needed to fetch the file corresponding to the alias. By default this returns self.base_url.
Parameters: alias (str) – An alias defined in self.aliases. Returns: The url needed to fetch the file corresponding to the alias. Return type: str
-
get_source_version
(alias)[source]¶ Return the release version of the remote source:alias.
This returns the release version of the remote source for a specific alias. This value will be the same for every alias unless the the alias can have a different release version than the source (this will be source dependent). This value is stored in the self.version dictionary object. If the value does not already exist, all aliases versions are initialized to ‘unknown’.
Parameters: alias (str) – An alias defined in self.aliases. Returns: The remote version of the source. Return type: str
-
is_map
(alias)[source]¶ Return a boolean representing if the provided alias is used for source specific mapping of nodes or edges.
This returns a boolean representing if the alias corresponds to a file used for mapping. By default this returns True if the alias ends in ‘_map’ and False otherwise.
Parameters: alias (str) – An alias defined in self.aliases. Returns: Whether or not the alias is used for mapping. Return type: bool
-
table
(raw_line, version_dict)[source]¶ Uses the provided raw_lines file to produce a table file, an edge_meta file, and a node_meta file (only for property nodes).
This returns nothing but produces the table formatted files from the provided raw_lines file:
raw_lines (file, line num, line_chksum, raw_line) table table (line_cksum, n1name, n1hint, n1type, n1spec, n2name, n2hint, n2type, n2spec, et_hint, score) edge_meta (line_cksum, info_type, info_desc) node_meta (node_id, info_type (alt_alias, relationship, experiment, or link), info_desc (text))
By default this function does nothing (must be overridden)
Parameters: - raw_line (str) – The path to the raw_lines file
- version_dict (dict) – A dictionary describing the attributes of the alias for a source.
-
-
check_utilities.
check
(module, args=None)[source]¶ Runs compare_versions(SrcClass) on a ‘module’ object
This runs the compare_versions function on a ‘module’ object to find the version information of the source and determine if a fetch is needed. The version information is also printed.
Parameters: - module (str) – string name of module defining source specific class
- args (Namespace) – args as populated namespace or ‘None’ for defaults
Returns: - A nested dictionary describing the version information for each
alias described in source.
Return type: dict
-
check_utilities.
compare_versions
(src_obj, args=None)[source]¶ Return a dictionary with the version information for each alias in the source and write a dictionary for each alias to file.
This returns a nested dictionary describing the version information of each alias in the source. The version information is also printed.
Parameters: - src_obj (SrcClass) – A SrcClass object for which the comparison should be performed.
- args (Namespace) – args as populated namespace or ‘None’ for defaults
Returns: - A nested dictionary describing the version information for each
alias described in src_obj. For each alias the following keys are defined:
'source' (str): The source name, 'alias' (str): The alias name, 'alias_info' (str): A short string with information about the alias, 'is_map' (bool): See is_map, 'dependencies' (lists): See get_dependencies, 'remote_url' (str): See get_remote_url, 'remote_date' (float): See get_remote_file_modified, 'remote_version' (str): See get_source_version, 'remote_file' (str): File to extract if remote file location is a directory, 'remote_size' (int): See get_remote_file_size, 'local_file_name' (str): See get_local_version_info, 'file_exists' (bool): See get_local_version_info, 'fetch_needed' (bool): True if file needs to be downloaded from remote source. A fetch will be needed if the local file does not exist, or if the local and remote files have different date modified or file sizes.
Return type: dict
-
check_utilities.
get_SrcClass
(args, *posargs, **kwargs)[source]¶ Returns an object of the source class.
This returns an object of the source class to allow access to its functions if the module is imported.
Parameters: args (Namespace) – args as populated namespace or ‘None’ for defaults Returns: a source class object Return type: SrcClass
fetch_utilities¶
Utiliites for fetching and formatting source for the Knowledge Network (KN) that has been updated.
Contains module functions:
download(version_dict)
chunk(filename, total_lines)
format_raw_line(filename)
get_md5_hash(filename)
get_line_count(filename)
main_parse_args()
main(version_json, args=None)
-
fetch_utilities.
ARCHIVES
¶ list – list of supported archive formats.
-
fetch_utilities.
DIR
¶ str – the relative path to data/source/alias/ from location of script execution
-
fetch_utilities.
MAX_CHUNKS
¶ int – maximum number of chunks to split file into
Examples
To run fetch on a single source (e.g. dip) after check complete:
$ cd data/dip/PPI
$ python3 ../../../code/fetch_utilities.py file_metadata.json
To view all optional arguments that can be specified:
$ python3 code/fetch_utilities.py -h
-
class
fetch_utilities.
AppURLopener
(*args, **kwargs)[source]¶ URLopener to open with a custom user-agent.
-
fetch_utilities.
chunk
(filename, total_lines, chunksize=500000)[source]¶ Splits the provided file into equal chunks with ceiling(num_lines/chunksize) lines each.
This takes the path to a file and reads through the file, splitting it into equal chunks with each of size ceiling(num_lines/chunksize). It then returns the number of chunks and sets up the raw_lines table in the format: (file, line num, line_chksum, raw_line)
Parameters: - filename (str) – the file to split into chunks
- total_lines (int) – the number of lines in the file at filename
- args (Namespace) – args as populated namespace or ‘None’ for defaults
- chunksize (int) – max size of a single chunk. Defaults to 500000.
Returns: the number of chunks filename was split into
Return type: int
-
fetch_utilities.
download
(version_dict)[source]¶ Returns the standardized path to the local file after downloading it from the source and unarchiving if needed.
This returns the standardized path (path/source.alias.txt) for the source alias described in version_dict. If a download is needed (as determined by the check step), the remote file will be downloaded.
Parameters: version_dict (dict) – A dictionary describing the attributes of the alias for a source. Returns: The relative path to the newly downloaded file. Return type: str
-
fetch_utilities.
format_raw_line
(filename)[source]¶ - Creates the raw_line table from the provided file and returns the
- path to the output file.
This takes the path to a file and reads through the file, adding three tab separated columns to the beginning, saving to disk, and then returning the output file path. Output looks like: raw_lines table (line_hash, line_num, file_id, line_str)
Parameters: filename (str) – the file to convert to raw_line table format Returns: the path to the output file Return type: str
-
fetch_utilities.
get_line_count
(filename)[source]¶ Returns the number of lines in the file at filename.
This takes the path to a file and reads through the file line by line, producing a count of the number of lines.
Parameters: filename (str) – the file to split into chunks Returns: the number of lines in the file at int Return type: int
-
fetch_utilities.
get_md5_hash
(filename)[source]¶ Returns the md5 hash of the file at filename.
This takes the path to a file and reads through the file line by line, producing both the md5 hash and a count of the number of lines.
Parameters: filename (str) – the file to split into chunks Returns: the md5 hash of the file at filename int: the number of lines in the file at int Return type: str
-
fetch_utilities.
main
(version_json, args=None)[source]¶ Fetches and chunks the source:alias described by version_json.
This takes the path to a version_json (source.alias.json) and runs fetch (see fetch). If the source is ensembl, it runs the ensembl specific fetch (see ensembl.fetch). If the alias is a data file, it then runs raw_line (see raw_line) and then runs chunk (see chunk) on the output. If the alias is a mapping file, it runs create_mapping_dict (see create_mapping_dict in SRC.py). It also updates version_json to include the total lines in and md5 checksum of the fetched file. It then saves the updated version_json to file.
Parameters: - version_json (str) – path to a json file describing the source:alias
- args (Namespace) – args as populated namespace or ‘None’ for defaults
table_utilities¶
Utiliites for fetching and chunking a source for the Knowledge Network (KN) that has been updated.
Contains module functions:
csu(infile, outfile, columns=list())
main_parse_args()
main(chunkfile, version_json, args=None)
Examples
To run table on a single source (e.g. dip) after fetch complete:
$ cd data/dip/PPI
$ python3 ../../../code/table_utilities.py chunks/dip.PPI.raw_line.1.txt file_metadata.json
To view all optional arguments that can be specified:
$ python3 code/table_utilities.py -h
-
table_utilities.
csu
(infile, outfile, columns=None)[source]¶ Performs a cut | sort | uniq on infile using the provided columns and stores it into outfile.
Takes a file in tsv format and sorts by the provided columns using the unix sort command and then removes duplicate elements.
Parameters: - infile (str) – the file to sort
- outfile (str) – the file to save the result into
- columns (list) – the columns to use in cut or an empty list if all columns should be used
-
table_utilities.
main
(chunkfile, version_json, args=None)[source]¶ Tables the source:alias described by version_json.
This takes the path to a chunked (see fetch_utilities.chunk) raw_line file and it’s correpsonding version_json (source.alias.json) and runs the source specific table command (see SrcClass.table) if the alias is a data file. If it is a mapping file, it does nothing:
raw_line (line_hash, line_num, file_id, raw_line) table_file (line_hash, n1name, n1hint, n1type, n1spec, n2name, n2hint, n2type, n2spec, et_hint, score, table_hash) edge_meta (line_hash, info_type, info_desc) node_meta (node_id, info_type (evidence, relationship, experiment, or link), info_desc (text)) node (node_id, n_alias, n_type)Parameters: - version_json (str) – path to a chunk file in raw_line format
- version_json – path to a json file describing the source:alias
- args (Namespace) – args as populated namespace or ‘None’ for defaults
conv_utilities¶
Utiliites for mapping the gene identifiers in an edge file.
Contains module functions:
map_list(namefile, args=None)
main_parse_args()
main(tablefile, args=None)
-
conv_utilities.
DEFAULT_HINT
¶ str – the default mapping hint for converting identifiers
-
conv_utilities.
DEFAULT_TAXON
¶ int – the default taxon id to use for converting identfiers
Examples
To run conv on a single source (e.g. dip) after table complete:
$ python3 code/conv_utilities.py data/dip/PPI/chunks/dip.PPI.edge.1.txt
To run conv on a file of gene names:
$ python3 code/conv_utilities.py -mo LIST list_of_gene_names.txt
To view all optional arguments that can be specified:
$ python3 code/conv_utilities.py -h
-
conv_utilities.
main
(tablefile, args=None)[source]¶ Maps the nodes for the source:alias tablefile.
This takes the path to an tablefile (see table_utilities.main) and maps the nodes in it using the Redis DB. It then outputs a status files in the format (table_hash, n1, n2, edge_type, weight, edge_hash, line_hash, status, status_desc), where status is production if both nodes mapped and unmapped otherwise. It also outpus an edge file which all rows where status is production, in the format (edge_hash, n1, n2, edge_type, weight), and and edge2line file in the formate (edge_hash, line_hash).
Parameters: - tablefile (str) – path to an tablefile to be mapped
- args (Namespace) – args as populated namespace or ‘None’ for defaults
-
conv_utilities.
main_parse_args
()[source]¶ Processes command line arguments.
Expects one positional argument (infile) and number of optional arguments. If arguments are missing, supplies default values.
Returns: args as populated namespace Return type: Namespace
-
conv_utilities.
map_list
(namefile, args=None)[source]¶ Maps the nodes for the provided namefile.
This takes the path to an namefile and maps the nodes in it using the Redis DB. It then outputs an mapped file in the format (mapped, original).
Parameters: - namefile (str) – path to an namefile to be mapped
- args (Namespace) – args as populated namespace or ‘None’ for defaults
import_utilities¶
Utiliites for importing edge, edge_meta, and node_meta into the KnowEnG MySQL datatbase.
Contains module functions:
import_file(file_name, table, ld_cmd='', dup_cmd='', args=None)
import_filemeta(version_dict, args=None)
update_filemeta(version_dict, args=None)
import_edge(edgefile, args=None)
import_nodemeta(nmfile, args=None)
import_pnode(filename, args=None)
-
import_utilities.
enable_keys
(args=None)[source]¶ Imports the provided file into the KnowEnG MySQL database using optimal settings.
Starts a transaction and changes some MySQL settings for optimization, which disables the keys. It then loads the data into the provided table in MySQL. Note that the keys are not re-enabled after import. To do this call mysql_utilities.get_database(‘KnowNet’, args).enable_keys().
Parameters: - file_name (str) – path to the file to be imported
- table (str) – name of the permanent table to import to
- ld_cmd (str) – optional additional command for loading data
- args (Namespace) – args as populated namespace or ‘None’ for defaults
-
import_utilities.
import_edge
(edgefile, args=None)[source]¶ Imports the provided edge file and any corresponding meta files into the KnowEnG MySQL database.
Loads the data into a temporary table in MySQL. It then queries from the temporary table into the corresponding permanent table. If a duplication occurs during the query, it updates to the maximum edge score if it is an edge file, and ignores if it is metadata.
Parameters: - edgefile (str) – path to the file to be imported
- args (Namespace) – args as populated namespace or ‘None’ for defaults
-
import_utilities.
import_file
(file_name, table, ld_cmd='', dup_cmd='', args=None)[source]¶ Imports the provided file into the KnowEnG MySQL database.
Loads the data into a temporary table in MySQL. It then queries from the temporary table into the corresponding permanent table. If a duplication occurs during the query, it uses the provided behavior to handle. If no behavior is provided, it replaces into the table.
Parameters: - file_name (str) – path to the file to be imported
- table (str) – name of the permanent table to import to
- ld_cmd (str) – optional additional command for loading data
- dup_cmd (str) – command for handling duplicates
- args (Namespace) – args as populated namespace or ‘None’ for defaults
-
import_utilities.
import_file_nokeys
(file_name, table, ld_cmd='', args=None)[source]¶ Imports the provided file into the KnowEnG MySQL database using optimal settings.
Starts a transaction and changes some MySQL settings for optimization, which disables the keys. It then loads the data into the provided table in MySQL. Note that the keys are not re-enabled after import. To do this call enable_keys(args).
Parameters: - file_name (str) – path to the file to be imported
- table (str) – name of the permanent table to import to
- ld_cmd (str) – optional additional command for loading data
- args (Namespace) – args as populated namespace or ‘None’ for defaults
-
import_utilities.
import_filemeta
(version_dict, args=None)[source]¶ Imports the provided version_dict into the KnowEnG MySQL database.
Loads the data from an version dictionary into the raw_file table.
Parameters: - version_dict (dict) – version dictionary describing a downloaded file
- args (Namespace) – args as populated namespace or ‘None’ for defaults
-
import_utilities.
import_nodemeta
(nmfile, args=None)[source]¶ Imports the provided node_meta file and any corresponding meta files into the KnowEnG MySQL database.
Loads the data into a temporary table in MySQL. It then queries from the temporary table into the corresponding permanent table. If a duplication occurs during the query, it updates to the maximum edge score if it is an edge file, and ignores if it is metadata.
Parameters: - nmfile (str) – path to the file to be imported
- args (Namespace) – args as populated namespace or ‘None’ for defaults
-
import_utilities.
import_pnode
(filename, args=None)[source]¶ Imports the provided property node file into the KnowEnG MySQL database.
Loads the data into a temporary table in MySQL. It then queries from the temporary table into the corresponding permanent table. If a duplication occurs during the query, it updates to the maximum edge score if it is an edge file, and ignores if it is metadata.
Parameters: - filename (str) – path to the file to be imported
- args (Namespace) – args as populated namespace or ‘None’ for defaults
-
import_utilities.
import_production_edges
(args=None)[source]¶ Query production edges from status table into the edge table.
Queries the KnowNet status table and copies all distinct production edges to the edge table. If a duplication occurs during the query, it updates to the maximum edge score and keeps the edge hash for that edge.
Parameters: args (Namespace) – args as populated namespace or ‘None’ for defaults
-
import_utilities.
import_status
(statusfile, args=None)[source]¶ Imports the provided status file and any corresponding meta files into the KnowEnG MySQL database.
Loads the data into a temporary table in MySQL. It then queries from the temporary table into the corresponding permanent table. If a duplication occurs during the query, it updates to the maximum edge score if it is an edge file, and ignores if it is metadata.
Parameters: - status (str) – path to the file to be imported
- args (Namespace) – args as populated namespace or ‘None’ for defaults
-
import_utilities.
main_parse_args
()[source]¶ Processes command line arguments.
Expects one positional argument (status_file) and number of optional arguments. If arguments are missing, supplies default values.
Returns: args as populated namespace Return type: Namespace
-
import_utilities.
merge
(merge_key, args)[source]¶ Uses sort to merge and unique the already sorted files of the table type and stores the results into outfile.
This takes a table type (one of: node, node_meta, edge2line, status, or edge_meta) and merges them using the unix sort command while removing any duplicate elements.
Parameters: - merge_key (str) – table type (one of: node, node_meta, edge2line, status, edge, or edge_meta)
- args (Namespace) – args as populated namespace or ‘None’ for defaults
-
import_utilities.
merge_logs
(args)[source]¶ Merge all log files into a single file that contains all the information about the run.
-
import_utilities.
update_filemeta
(version_dict, args=None)[source]¶ Updates the provided filemeta into the KnowEnG MySQL database.
Updates the data from an version dictionary into the raw_file table.
Parameters: - version_dict (dict) – version dictionary describing a downloaded file
- args (Namespace) – args as populated namespace or ‘None’ for defaults
export_utilities¶
-
export_utilities.
convert_nodes
(args, nodes)[source]¶ Uses redis_utilities to convert a set of nodes.
-
export_utilities.
figure_out_class
(db, et)[source]¶ Determines the class and bidirectionality of the edge_type.
-
export_utilities.
get_metadata
(db, edges, nodes, lines, sp, et, args)[source]¶ Retrieves the metadata for a subnetwork.
-
export_utilities.
get_sources
(edges)[source]¶ Given a list of edges, determines the set of sources included.
-
export_utilities.
norm_edges
(edges, args)[source]¶ Normalizes and cleans edges according to the specified arguments.
mysql_utilities¶
Utiliites for interacting with the KnowEnG MySQL db through python.
Contains the class MySQL which provides functionality for interacting with MySQL database
Contains module functions:
combine_tables(alias, args=None)
create_dictionary(results)
import_nodes(version_dict, args=None)
query_all_mappings(version_dict, args=None)
create_mapping_dicts(version_dict, args=None)
get_database(db=None, args=None)
get_insert_cmd(step)
import_ensembl(alias, args=None)
-
class
mysql_utilities.
MySQL
(database=None, args=None)[source]¶ Class providing functionality for interacting with the MySQL database.
This class serves as a wrapper for interacting with the KnowEnG MySQL
-
host
¶ str – the MySQL db hostname
-
user
¶ str – the MySQL db username
-
port
¶ str – the MySQL db port
-
passw
¶ str – the MySQL db password
-
database
¶ str – the MySQL database to connect to
-
conn
¶ object – connection object for the database
-
cursor
¶ object – cursor object for the database
-
close
()[source]¶ Close connection to the MySQL server.
This commits any changes remaining and closes the connection to the MySQL server.
-
copy_table
(old_database, old_table, new_database, new_table)[source]¶ Copy a table in the MySQL database
Copy the provided tablename to the MySQL database.
Parameters: - old_database (str) – name of the database to move from
- old_table (str) – name of the table to move from
- new_database (str) – name of the database to move to
- new_table (str) – name of the table to move to
-
create_db
(database)[source]¶ Add a database to the MySQL server
Adds the provided database from the MySQL server.
Parameters: database (str) – name of the database to add to the MySQL server
-
create_table
(tablename, cmd='')[source]¶ Add a table to the MySQL database.
Adds the provided tablename to the MySQL database. If cmd is specified, it will create the table using the provided cmd.
Parameters: - tablename (str) – name of the table to add to the MySQL database
- cmd (str) – optional string to overwrite default create table
-
create_temp_table
(tablename, cmd='')[source]¶ Add a table to the MySQL database.
Adds the provided tablename to the MySQL database. If cmd is specified, it will create the table using the provided cmd.
Parameters: - tablename (str) – name of the table to add to the MySQL database
- cmd (str) – optional additional command
-
disable_keys
()[source]¶ Disables keys for faster operations.
Turns off autocommit, unique_checks, and foreign_key_checks for the MySQLdatabase.
-
drop_db
(database)[source]¶ Remove a database from the MySQL server
Drops the provided database from the MySQL server.
Parameters: database (str) – name of the database to remove from the MySQL server
-
drop_table
(tablename)[source]¶ Remove a table from the MySQL database
Drops the provided tablename from the MySQL database.
Parameters: tablename (str) – name of the table to remove from the MySQL database
-
drop_temp_table
(tablename)[source]¶ Remove a temporary table from the MySQL database
Drops the provided tablename from the MySQL database.
Parameters: tablename (str) – name of the table to remove from the MySQL database
-
enable_keys
()[source]¶ Enables keys for safer operations.
Turns on autocommit, unique_checks, and foreign_key_checks for the MySQLdatabase.
-
import_schema
(database, sqlfile)[source]¶ Import the schema for the provided database from sqlfile.
Removes the provided database if it exists, creates a new one, and imports the schema as defined in the provided sqlfile.
Parameters: - database (str) – name of the database to add to the MySQL server
- sqlfile (str) – name of the sql file specifying the format for the database
-
import_table
(database, tablefile, import_flags='--delete')[source]¶ Import the data for the table in the provided database described by tablefile.
Imports the data as defined in the provided tablefile.
Parameters: - database (str) – name of the database to add to the MySQL server
- tablefile (str) – name of the txt file specifying the data for the table
- import_flag (str) – additional flags to pass to mysqlimport
-
init_knownet
()[source]¶ Inits the Knowledge Network MySQL DB.
Creates the KnowNet database and all of its tables if they do not already exist. Also imports the edge_type, node_type, and species files, but ignores any lines that have the same unique key as those already in the tables.
-
insert
(tablename, cmd)[source]¶ Insert into tablename using cmd.
Parameters: - tablename (str) – name of the table to add to the MySQL database
- cmd (str) – a valid SQL command to use for inserting into tablename
-
insert_ignore
(tablename, cmd='')[source]¶ Insert ignore into tablename using cmd.
Parameters: - tablename (str) – name of the table to add to the MySQL database
- cmd (str) – a valid SQL command to use for inserting into tablename
-
load_data
(filename, tablename, cmd='', sep='\\t', enc='"')[source]¶ Import data into table in the MySQL database.
Loads the data located on the local machine into the provided MySQL table. Uses the LOAD DATA LOCAL INFILE command.
Parameters: - filename (str) – name of the file to import from
- tablename (str) – name of the table to import into
- sep (str) – separator for fields in file
- enc (str) – enclosing character for fields in file
- cmd (str) – optional additional command
-
move_table
(old_database, old_table, new_database, new_table)[source]¶ Move a table in the MySQL database
Moves the provided tablename to the MySQL database.
Parameters: - old_database (str) – name of the database to move from
- old_table (str) – name of the table to move from
- new_database (str) – name of the database to move to
- new_table (str) – name of the table to move to
-
query_distinct
(query, table, cmd='')[source]¶ Run the provided query distinct in MySQL.
This runs the provided distinct query from the provided table with the optional extra cmd using the current MySQL connection and cursor. It then returns the fetched results.
Parameters: - query (str) – the SQL query to run on the MySQL server
- table (str) – the table to query from
- cmd (str) – the addtional SQL command to run on the MySQL server (optional)
Returns: the fetched results
Return type: list
-
replace
(tablename, cmd)[source]¶ Insert into tablename using cmd.
Replace into tablename using cmd.
Parameters: - tablename (str) – name of the table to add to the MySQL database
- cmd (str) – a valid SQL command to use for inserting into tablename
-
replace_safe
(tablename, cmd, values)[source]¶ Insert into tablename using cmd.
Replace into tablename using cmd.
Parameters: - tablename (str) – name of the table to add to the MySQL database
- cmd (str) – a valid SQL command to use for inserting into tablename
-
run
(cmd)[source]¶ Run the provided command in MySQL.
This runs the provided command using the current MySQL connection and cursor.
Parameters: cmd (str) – the SQL command to run on the MySQL server Returns: the fetched results Return type: list
-
set_isolation
(duration='', level='REPEATABLE READ')[source]¶ Sets the transaction isolation level.
Modify the transaction isolation level to modulate lock status behavior. Default InnoDB is repeatable read. For other levels check online at https://dev.mysql.com/doc/refman/5.7/en/set-transaction.html
Parameters: - duration (str) – time for isolation level to be used. Can be empty, GLOBAL, or SESSION
- level (str) – isolation level. In order of locking level: SERIALIZABLE, REPEATABLE READ, READ COMMITTED, READ UNCOMMITTED
-
start_transaction
(level='REPEATABLE READ')[source]¶ Starts a mysql transaction with the provided isolation level
Uses the provided isolation level to start a MySQL transaction using the current connection. Transaction persists until the next commit.
Parameters: level (str) – isolation level. In order of locking level: SERIALIZABLE, REPEATABLE READ, READ COMMITTED, READ UNCOMMITTED
-
-
mysql_utilities.
combine_tables
(alias, args=None)[source]¶ Combine all of the data imported from ensembl for the provided alias into a single database.
This combines the imported tables into a single table knownet_mappings with information from genes, transcripts, and translations. It then merges this table into the KnowNet database for use in gene identifier mapping.
Parameters: - alias (str) – An alias defined in ensembl.aliases.
- args (Namespace) – args as populated namespace or ‘None’ for defaults
-
mysql_utilities.
create_KnowNet
(args=None)[source]¶ Returns an object of the MySQL class with KnowNet db.
This returns an object of the MySQL class to allow access to its functions if the module is imported.
Parameters: - db (str) – optional db to connect to
- args (Namespace) – args as populated namespace or ‘None’ for defaults
Returns: a source class object
Return type:
-
mysql_utilities.
create_dictionary
(results)[source]¶ Creates a dictionary from a MySQL fetched results.
This returns a dictionary from the MySQL results after a query from the DB. It assumes there are two columns in the results and reads through all of the results, making them into a dictionary.
Parameters: results (list) – a list of the results returned from a MySQL query Returns: dictionary with first column as key and second as values Return type: dict
-
mysql_utilities.
create_mapping_dicts
(version_dict, args=None)[source]¶ Creates the mapping dictionaries for the provided alias.
Produces the ensembl stable mappings dictionary and the all unique mappings dictionary for the provided alias. It then saves them as json objects to file.
Parameters: - version_dict (dict) – the version dictionary describing the source:alias
- args (Namespace) – args as populated namespace or ‘None’ for defaults
-
mysql_utilities.
deploy_container
(args=None)[source]¶ Deplays a container with marathon running MySQL using the specified args.
This replaces the placeholder args in the json describing how to deploy a container running mysql with those supplied in the users arguements.
Parameters: args (Namespace) – args as populated namespace or ‘None’ for defaults
-
mysql_utilities.
get_database
(db=None, args=None)[source]¶ Returns an object of the MySQL class.
This returns an object of the MySQL class to allow access to its functions if the module is imported.
Parameters: - db (str) – optional db to connect to
- args (Namespace) – args as populated namespace or ‘None’ for defaults
Returns: a source class object
Return type:
-
mysql_utilities.
get_file_meta
(file_id, args=None)[source]¶ Returns the metadata for the provided file_id if it exists.
This returns the metadata for the provided file_id (in the format of “source.alias”) present locally in the MySQL database from a previous run of the pipeline. It formats this output as a dicionary, which will always contain the following keys: ‘file_id’ (str): “source.alias” which is the key used in SQL raw_file table ‘file_exists’ (bool): boolean if the file with the above file_id exists in the SQL raw_file table and will additionally contain the following keys if file_exists is True: ‘size’ (int): size of file in bytes ‘date’ (float): time of last modification time of file in seconds since the epoch ‘version’ (str): the remote version of the source
Parameters: file_id (str) – The file_id for the raw_file in the format of “source.alias” Returns: The file_meta information for a given source alias. Return type: dict
-
mysql_utilities.
get_insert_cmd
(step)[source]¶ Returns the command to be used with an insert for the provided step.
This takes a predefined step to determine which type of insert is being performed during the production of the knownet_mappings combined tables. Based off of this step, it returns a MySQL command to be used with an INSERT INTO statement.
Parameters: step (str) – the step indicating the step during the production of the combined knownet_mapping tables Returns: the command to be used with an INSERT INTO statement at this step Return type: str
-
mysql_utilities.
import_ensembl
(alias, args=None)[source]¶ Imports the ensembl data for the provided alias into the KnowEnG database.
This produces the local copy of the fetched ensembl database for alias. It drops the existing database, creates a new database, imports the relevant ensembl sql schema, and imports the table.
Parameters: - alias (str) – An alias defined in ensembl.aliases.
- args (Namespace) – args as populated namespace or ‘None’ for defaults
-
mysql_utilities.
import_nodes
(version_dict, args=None)[source]¶ Imports the gene nodes into the KnowNet nodes and node_species tables.
Queries the imported ensembl nodes and uses the stable ids as nodes for the KnowNet nodes table and uses the taxid to create the corresponding node_species table.
Parameters: - version_dict (dict) – the version dictionary describing the source:alias
- args (Namespace) – args as populated namespace or ‘None’ for defaults
-
mysql_utilities.
main
()[source]¶ Deploy a MySQL container using marathon with the provided command line arguements.
This uses the provided command line arguments and the defaults found in config_utilities to launch a MySQL docker container using marathon.
-
mysql_utilities.
query_all_mappings
(version_dict, args=None)[source]¶ Creates the all mappings dictionary for the provided alias.
Produces a dictionary of ensembl stable mappings and the all unique mappings the provided alias. It then saves them as json objects to file.
Parameters: - version_dict (dict) – the version dictionary describing the source:alias
- args (Namespace) – args as populated namespace or ‘None’ for defaults
redis_utilities¶
Utiliites for interacting with the KnowEnG Redis db through python.
Contains module functions:
get_database(args=None)
import_ensembl(alias, args=None)
conv_gene(rdb, foreign_key, hint, taxid)
-
redis_utilities.
conv_gene
(rdb, fk_array, hint, taxid)[source]¶ Uses the redis database to convert a gene to ensembl stable id
This checks first if there is a unique name for the provided foreign key. If not it uses the hint and taxid to try and filter the foreign key possiblities to find a matching stable id.
Parameters: - rdb (redis object) – redis connection to the mapping db
- fk_array (list) – the foreign gene identifers to be translated
- hint (str) – a hint for conversion
- taxid (str) – the species taxid, ‘unknown’ if unknown
Returns: result of searching for gene in redis DB
Return type: str
-
redis_utilities.
deploy_container
(args=None)[source]¶ Deplays a container with marathon running Redis using the specified args.
This replaces the placeholder args in the json describing how to deploy a container running Redis with those supplied in the users arguements.
Parameters: args (Namespace) – args as populated namespace or ‘None’ for defaults
-
redis_utilities.
get_database
(args=None)[source]¶ Returns a Redis database connection.
This returns a Redis database connection access to its functions if the module is imported.
Parameters: args (Namespace) – args as populated namespace or ‘None’ for defaults Returns: a redis connection object Return type: StrictRedis
-
redis_utilities.
get_node_info
(rdb, fk_array, ntype, hint, taxid)[source]¶ Uses the redis database to convert a node alias to KN internal id
Figures out the type of node for each id in fk_array and then returns all of the metadata associated or unmapped-*
Parameters: - rdb (redis object) – redis connection to the mapping db
- fk_array (list) – the array of foreign gene identifers to be translated
- ntype (str) – ‘Gene’ or ‘Property’ or None
- hint (str) – a hint for conversion
- taxid (str) – the species taxid, None if unknown
Returns: list of lists containing 5 col info for each mapped gene
Return type: list
-
redis_utilities.
import_ensembl
(alias, args=None)[source]¶ Imports the ensembl data for the provided alias into the Redis database.
This stores the foreign key to ensembl stable ids in the Redis database. It uses the all mappings dictionary created by mysql.query_all_mappings for alias. This then iterates through each foreign_key. If the foreign_key has not been seen before, it sets unique:foreign_key as the stable id. If the key has been seen before and maps to a different ensembl stable id, it sets the value for unique:foreign_key as unmapped:many. In each case, it sets the value of taxid:hint:foreign_key as the stable_id, and appends taxid:hint to the set with foreign_key as the key.
Parameters: - alias (str) – An alias defined in ensembl.aliases.
- args (Namespace) – args as populated namespace or ‘None’ for defaults
-
redis_utilities.
import_gene_nodes
(node_table, args=None)[source]¶ Import gene node metadata into redis.
-
redis_utilities.
main
()[source]¶ Deploy a Redis container using marathon with the provided command line arguements.
This uses the provided command line arguments and the defaults found in config_utilities to launch a Redis docker container using marathon.
-
redis_utilities.
node_desc
(rdb, stable_array)[source]¶ Uses the redis database to find metadata about node given its stable id
Return all metadata for each element of stable_array
Parameters: - rdb (redis object) – redis connection to the mapping db
- stable_array (str) – the array of stable identifers to be searched
Returns: list of lists containing 4 col info for each mapped node
Return type: list
job_utilities¶
Utiliites for Jobs class which store all important information for each job to run
- Classes:
- Jobs: Stores all important information for each job to run on cluster
Contains module functions:
queue_starter_job(args, jobname='starter-jobname', dummy=1)
run_job_step(args, job_type, tmpdict)
run_local_fetch(args)
curl_handler(args, jobname, job_str)
chronos_parent_str(parentlist)
-
job_utilities.
CURL_PREFIX
¶ list – parts of the chronos curl command
-
class
job_utilities.
Job
(jobtype, args)[source]¶ Base class for each job to be run in pipeline.
This Job class provides attributes and default functions to store information about and perform operations on job.
-
jobtype
¶ str – the type of job to be referenced in components.json
-
jobname
¶ str – name of job as appears on chronos
-
tmpdict
¶ dict – dictionary of default tmp variable substitutions
-
cjobfile
¶ str – chronos local file name of json job descriptor
-
cjobstr
¶ str – contents of json job descriptor as single string.
-
args
¶ namespace – command line arguments and default arguments to method
-
print_chronos_job
()[source]¶ Prints out job description to .json file
This creates a directory and in it prints a .json file containing self.cjobstr. It saves the created file as self.cjobfile
Args:
Returns:
-
queue_chronos_job
()[source]¶ puts the job on the chronos queue
Using the chronos url from args.chronos, this creates a tmp .sh job that runs the curl statement to sent job to chronos
-
replace_jobtmp
(tmpdict)[source]¶ Replaces temporary strings in self.cjobstr with specific values
This loops through all keys in tmpdict and replaces any placeholder matches in self.cjobstr with the key values. Also, adds tmpdict to self.tmpdict.
Parameters: tmpdict (dict) – dictionary of default tmp variable substitutions Returns:
-
run_docker_job
()[source]¶ runs the job locally using docker
Using the args, tmpdict, and cjobstr create a command line call to docker run that executes the job and removes itself
-
-
job_utilities.
chronos_parent_str
(parentlist)[source]¶ Returns correct string for parent dependencies.
Formatting of returned string depends on number of parents
Parameters: parentlist (list) – names of parent jobs Returns: string
-
job_utilities.
queue_starter_job
(args, jobname='starter-jobname', dummy=1)[source]¶ Queues a starter job.
If dummy=1, creates and queues a dummy job that will never run, else it queues a simple job with a single print statement that will run immediately on which other jobs will depend
Parameters: - jobstr (str) – contents of json job descriptor as single string.
- dummy (bool) – 1, queue jobs that does not run, 0, queue job that does
Returns: Job object
-
job_utilities.
run_job_step
(args, job_type, tmpdict)[source]¶ Creates and runs a job.
Using the tmpdict description of the job will create and queue a new job that runs when its dependencies finish in the correct mode
Parameters: - args (namespace) – arguments from main_parse_args().
- job_type (string) – type of job to be created
- tmpdict (dict) – dictionary with all of the arguments values required
Returns: Job object
workflow_utilities¶
- Utiliites for running single or multiple steps of the setup or data pipeline
- either locally , in docker, or on the cloud.
Contains module functions:
list_sources(args)
generic_dict(args, ns_parent=None)
run_check(args)
run_fetch(args)
run_table(args)
run_map(args)
main_parse_args()
main()
-
workflow_utilities.
DEFAULT_START_STEP
¶ str – first step of setup
-
workflow_utilities.
POSSIBLE_STEPS
¶ list – list of all steps
-
workflow_utilities.
SETUP_FILES
¶ list – list of setup SrcClasses
-
workflow_utilities.
SPECIAL_MODES
¶ list – list of modes that run breadth first
Examples
To view all optional arguments that can be specified:
$ python3 code/workflow_utilities.py -h
To run just check step of one setup src (e.g. ppi) locally:
$ python3 code/workflow_utilities.py CHECK -su -os -c LOCAL -p ppi
To run all steps of setup on cloud:
$ python3 code/workflow_utilities.py CHECK -su
To run all steps one pipeline src (e.g. kegg) locally:
$ python3 code/workflow_utilities.py CHECK -os -c LOCAL -p kegg
-
workflow_utilities.
generic_dict
(args, ns_parent=None)[source]¶ Creates a dictionary to specify variables for a job
Creates a dictionary used to substitute temporary job variables in the specification of the command line call. ns_parent should be defined for only a next step caller job.
Parameters: args (Namespace) – args as populated namespace from parse_args Returns: tmp substitution dictionary with appropriate values depending on args Return type: dict
-
workflow_utilities.
list_sources
(args)[source]¶ creates a list of all sources for step to process
Depending on args.setup, loops through all sources in the srccode directory pulling out valid names or return SETUP_FILES
Parameters: args (Namespace) – args as populated namespace from parse_args
-
workflow_utilities.
main
()[source]¶ Runs the ‘start_step’ step of the main or args.setup pipeline on the args.chronos location, and all subsequent steps if not args.one_step
Parses the arguments and runs the specified part of the pipeline using the specified local or cloud resources.
-
workflow_utilities.
main_parse_args
()[source]¶ Processes command line arguments.
Expects one argument (start_step) and a number of optional arguments. If argument is missing, supplies default value.parameter argument type flag description [start_step] string indicating which pipeline stage to start with –setup -su run db inits instead of source specific pipelines –one_step -os run for a single step instead of rest of pipeline –step_parameters str -p parameters to specify calls of a single step in pipeline –no_ensembl -ne do not run ensembl in setup pipeline –dependencies str -d names of parent jobs that must finish Returns: Namespace: args as populated namespace
-
workflow_utilities.
run_check
(args)[source]¶ Runs checks for all sources.
This loops through args.parameters sources, creates a job for each that calls check_utilities clean() (and if not args.one_step, calls workflow_utilities FETCH), and runs job in args.chronos location.
Parameters: args (Namespace) – args as populated namespace from parse_args
-
workflow_utilities.
run_export
(args)[source]¶ TODO: Documentationr.
Parameters: args (Namespace) – args as populated namespace from parse_args, specify –step_parameters(-p) as ‘,,’ separated list of files to import or the allowed possible SQL table names: node, node_meta, edge2line, status, or edge_meta. If not specified, by default it will try to import all tables.
-
workflow_utilities.
run_fetch
(args)[source]¶ Runs fetches for all aliases of a single source.
This loops through aliases of args.parameters sources, creates a job for each that calls fetch_utilities main() (and if not args.one_step, calls workflow_utilities TABLE), and runs job in args.chronos location.
Parameters: args (Namespace) – args as populated namespace from parse_args, must specify –step_parameters(-p) as ‘,,’ separated list of sources
-
workflow_utilities.
run_import
(args)[source]¶ Merges sorted files and runs import on output file on the cloud.
This loops through args.step_parameters (see Args below), and creates a job for each that merges the already sorted and unique files found in the data path (if args.merge is True), then calls import_utilities main().
Parameters: args (Namespace) – args as populated namespace from parse_args, specify –step_parameters(-p) as ‘,,’ separated list of files to import or the allowed possible SQL table names: node, node_meta, edge2line, status, or edge_meta. If not specified, by default it will try to import all tables.
-
workflow_utilities.
run_map
(args)[source]¶ Runs id conversion for a single .table. file on the cloud.
This loops through args.parameters tablefiles, creates a job for each that calls conv_utilities main(), and runs job in args.chronos location.
Parameters: args (Namespace) – args as populated namespace from parse_args, must specify –step_parameters(-p) as ‘,,’ separated list of ‘source.alias.table.chunk.txt’ file names
-
workflow_utilities.
run_table
(args)[source]¶ Runs tables for all chunks of a single source alias.
This loops through chunks of args.parameters aliases, creates a job for each that calls table_utilities main() (and if not args.one_step, calls workflow_utilities MAP), and runs job in args.chronos location.
Parameters: args (Namespace) – args as populated namespace from parse_args, must specify –step_parameters(-p) as ‘,,’ separated list of ‘source,alias’
sanitize_utilities¶
-
sanitize_utilities.
add_config_args
(parser)[source]¶ Add arguments specific to this module.
Parameters: parser (argparse.parser) – the parser to add arguments to Returns: the parser with the arguments added Return type: argparse.parser
-
sanitize_utilities.
drop_duplicates_by_type_or_node
(n_df, n1, n2, typ)[source]¶ Drop the duplicates in the network, by type or by node.
For each set of “duplicate” edges, only the edge with the maximum weight will be kept.
By type, the duplicates are where nd1, nd2, and typ are identical; by node, the duplicates are where nd1, and nd2 are identical.
Parameters: - n_df (list) – the data
- n1 (int) – the column for the firts node
- n2 (int) – the column for the second node
- typ (int) – the column for the type
Returns: the modified data
Return type: list
-
sanitize_utilities.
make_network_undirected
(n_df)[source]¶ Make the network undirected; that is, the network should be symmetric, but only the edges in one direction are included. So make the edges in the other direction explicit in the network. This assumes that the first two columns are the two nodes.
Parameters: n_df (list) – the data Returns: the modified data Return type: list
-
sanitize_utilities.
make_network_unweighted
(n_df, wgt)[source]¶ Make the network unweighted, by setting the weights on all the edges to the same value (1).
Parameters: - n_df (list) – the data
- wgt (int) – the weight column
Returns: the modified data
Return type: list
-
sanitize_utilities.
normalize_network_by_type
(n_df, typ, wgt)[source]¶ Normalize the network.
Currently the only normalization method implemented is by type.
Parameters: - n_df (list) – the data
- typ (int) – the type column
- wgt (int) – the weight column
Returns: the modified data
Return type: list