System Requirements =================== The run the Knowledge Network Build Pipeline with the KN_Builder, you must have three tools installed: 1) make 2) Docker 3) Docker Compose: https://docs.docker.com/compose/install/#install-compose Your system must also meet the following requirements: a) Minimum of 4 CPUs, 16GB RAM, 2TB disk b) Must not have Mesos/Zookeeper/Chronos/Marathon or anything running at their default ports (5050/2181/8888/8080) Quick Start =========== First, check out the quick run repo: .. code:: git clone https://github.com/KnowEnG/KnowNet_Pipeline_Tools cd KnowNet_Pipeline_Tools .. note:: Depending on your setup, some of the following commands may require root. This is because docker by default does not allow non-root processes to start jobs. In addition, the jobs are run as root inside docker, so all the output and intermediate files will be created as root. Then, running the pipeline is as simple as running :code:`make`: .. code:: make knownet This will start up our Mesos environment and then run the build pipeline for all officially supported species and sources. Overview of Build Pipeline ========================== The make command will produce a large amount of output. First, it will show the status of starting up Mesos and Chronos and then show starting up the databases. After it finishes that phase, it will start the build pipeline and periodically print the status of the pipeline. It should return when either an error occurs or the pipeline finishes running. The build pipeline consists of several stages: 1) SETUP: Downloads and imports Ensembl and sets up gene mapping information. 2) CHECK: Downloads and processes the rest of the sources. This consists of several substeps. a) fetch: Downloads the source data files. b) table: Takes the source files and reformats it into our table file format. c) map: Maps the identifiers in the source to our internal identifiers. 3) IMPORT: Imports all of the files into mysql and redis databases. 4) EXPORT: Exports the Knowledge Network into flatfiles and dumps the mysql and redis databases. Output Files ------------ Running the pipeline will create several directories: ========== ========================================= Directory Contents ========== ========================================= kn-final Stores the final processed output files. kn-logs Stores the log files. kn-rawdata Stores the downloaded and processed data. kn-mysql Stores the MySQL database. kn-redis Stores the redis database. ========== ========================================= Information about the output and intermediate file and database formats can be found :ref:`here `. Clean Up After Build ==================== To clean up the files (except :code:`kn-logs` and :code:`kn-final`), as well as Chronos, Marathon, and Mesos, run: .. code:: make clean make destroy Primary Parameters ================== To build the Knowledge Network for only a subset of species or sources, you can specify them as :code:`,,`-separated variables, like so: .. code:: make knownet SPECIES=homo_sapiens,,mus_musculus SOURCES=kegg,,stringdb The names of the SPECIES should be all lowercase and spaces replaced by underscores. The possible SOURCES names can be found here: SrcClasses_ Additional Resources ==================== 1) Summary_ of Current Knowledge Network Contents. 2) Details_ of Current Knowledge Network Contents. 3) List of Related Knowledge Network Tools_. .. _Summary: https://knoweng.org/kn-overview/ .. _Details: https://knoweng.org/kn-data-references/ .. _Tools: https://knoweng.org/kn-tools/ .. _SrcClasses: https://github.com/KnowEnG/KN_Builder/tree/master/src/code/srcClass Basic Troubleshooting ===================== If you run into errors when building the Knowledge Network, you can look at the status of all remaining jobs on Chronos .. code:: curl -L -s -X GET 127.0.0.1:8888/scheduler/graph/csv | grep node, | \ awk -F, '{print $3"\t"$4"\t"$1"\t"$2}' | sort | uniq | grep -v succ For any failed job (e.g. JOBNAME), you can look to 1) the original Chronos command at: kn-logs/chronos_jobs/JOBNAME.json or 2) the captured output log at: kn-logs/JOBNAME.json. These may provide you with a reason that the job is failing. If the original source has changed their format, you may rerun using the SOURCES parameter, specifying all sources except the problematic ones.