Creating a new SrcClass

Writing the Class

The easiest way to create a new SrcClass is by modifying an existing one to suit your needs. If there is already a source that has the same or similar format as the one you are trying to add, you can start with that. This will mean that you have a good starting point for the table() method, which is the most complex part of each srcClass. SrcClasses can be found in the src/code/srcClass/ directory, and that is also where new ones should go.

The __init__() Method

To begin, you will need to modify the __init__() method to initialize some attributes that describe metadata about the source, making sure to call __init__() of the superclass:

SrcClass
class check_utilities.SrcClass(src_name, base_url, aliases, args=None)[source]

Base class to be extended by each supported source in KnowEnG.

This SrcClass provides default functions that should be extended or overridden by any source which is added to the Knowledge Network (KN).

name

str – The name of the remote source to be included in the KN.

url_base

str – The base url of the remote source, which may need additional processing to provide an actual download link (see get_remote_url).

aliases

dict – A dictionary with subsets of the source which will be included in the KN as the keys (e.g. different species, data types, or interaction types), and a short string with information about the alias as the value.

remote_file

str – The name of the file to extract if the remote source is a directory

version

dict – The release version of each alias in the source.

source_url

str – The website for the source.

reference

str – The citation for the source.

pmid

str – The pubmed ID for the source.

license

str – The license for the source.

Other Methods

You will then need to override or modify the other methods of SrcClass if the existing ones do not work for your new source. The most important methods are:

table
SrcClass.table(raw_line, version_dict)[source]

Uses the provided raw_lines file to produce a table file, an edge_meta file, and a node_meta file (only for property nodes).

This returns nothing but produces the table formatted files from the provided raw_lines file:

raw_lines (file, line num, line_chksum, raw_line)
table table (line_cksum, n1name, n1hint, n1type, n1spec,
                n2name, n2hint, n2type, n2spec, et_hint, score)
edge_meta (line_cksum, info_type, info_desc)
node_meta (node_id,
           info_type (alt_alias, relationship, experiment, or link),
           info_desc (text))

By default this function does nothing (must be overridden)

Parameters:
  • raw_line (str) – The path to the raw_lines file
  • version_dict (dict) – A dictionary describing the attributes of the alias for a source.
get_remote_url
SrcClass.get_remote_url(alias)[source]

Return the remote url needed to fetch the file corresponding to the alias.

This returns the url needed to fetch the file corresponding to the alias. By default this returns self.base_url.

Parameters:alias (str) – An alias defined in self.aliases.
Returns:The url needed to fetch the file corresponding to the alias.
Return type:str
get_source_version
SrcClass.get_source_version(alias)[source]

Return the release version of the remote source:alias.

This returns the release version of the remote source for a specific alias. This value will be the same for every alias unless the the alias can have a different release version than the source (this will be source dependent). This value is stored in the self.version dictionary object. If the value does not already exist, all aliases versions are initialized to ‘unknown’.

Parameters:alias (str) – An alias defined in self.aliases.
Returns:The remote version of the source.
Return type:str
get_aliases
SrcClass.get_aliases(args=Namespace(build_image='knoweng/kn_builder:latest', chronos='127.0.0.1:8888', code_path='/kn_builder/code/', data_path='kn-rawdata', ens_species='homo_sapiens', export_path='kn-final', force_fetch=False, logs_path='kn-logs', marathon='127.0.0.1:8080', mysql_conf='build_conf/', mysql_cpu='0.5', mysql_dir='/home/ubuntu/KN_Builder/docs/kn-mysql', mysql_host='127.0.0.1', mysql_mem='0', mysql_pass='KnowEnG', mysql_port='3306', mysql_user='root', redis_cpu='0.5', redis_dir='/home/ubuntu/KN_Builder/docs/kn-redis', redis_host='127.0.0.1', redis_mem='0', redis_pass='KnowEnG', redis_port='6379', src_classes='', src_path='srcClass', storage_dir='', test_mode=False, working_dir='/home/ubuntu/KN_Builder/docs'))[source]

Helper function for producing the alias dictionary.

This returns a dictionary where alias names are keys and alias info are the values. This helper function uses the species specific information for the build of the Knowledge Network, which is produced by ensembl.py during setup utilities and is located at cf.DEFAULT_MAP_PATH/species/species.json, in order to fetch all matching species specific aliases from the source.

Parameters:args (Namespace) – args as populated namespace or ‘None’ for defaults
Returns:A dictionary of species:(taxid, division) values
Return type:dict

Most of the time the defaults can be used for the other methods of the SrcClass subclass.

Testing and Running the Class

If you want to run the code using docker (the official method), you can build a local docker image that includes your code:

docker build path/to/KN_Builder/src/ -t knoweng/kn_builder:latest

If you do so, remember to remove the image before trying to run the official release again.

You can then run the pipeline as normal, making sure to specify your source in the arguments:

cd path/to/Knownet_Pipeline_Tools/
make SOURCES=<source>