System Requirements

The run the Knowledge Network Build Pipeline with the KN_Builder, you must have three tools installed:
  1. make
  2. Docker
  3. Docker Compose: https://docs.docker.com/compose/install/#install-compose
Your system must also meet the following requirements:
  1. Minimum of 4 CPUs, 16GB RAM, 2TB disk
  2. Must not have Mesos/Zookeeper/Chronos/Marathon or anything running at their default ports (5050/2181/8888/8080)

Quick Start

First, check out the quick run repo:

git clone https://github.com/KnowEnG/KnowNet_Pipeline_Tools
cd KnowNet_Pipeline_Tools

Note

Depending on your setup, some of the following commands may require root. This is because docker by default does not allow non-root processes to start jobs. In addition, the jobs are run as root inside docker, so all the output and intermediate files will be created as root.

Then, running the pipeline is as simple as running make:

make knownet

This will start up our Mesos environment and then run the build pipeline for all officially supported species and sources.

Overview of Build Pipeline

The make command will produce a large amount of output. First, it will show the status of starting up Mesos and Chronos and then show starting up the databases. After it finishes that phase, it will start the build pipeline and periodically print the status of the pipeline. It should return when either an error occurs or the pipeline finishes running.

The build pipeline consists of several stages:

  1. SETUP: Downloads and imports Ensembl and sets up gene mapping information.
  2. CHECK: Downloads and processes the rest of the sources. This consists of several substeps.
    1. fetch: Downloads the source data files.
    2. table: Takes the source files and reformats it into our table file format.
    3. map: Maps the identifiers in the source to our internal identifiers.
  3. IMPORT: Imports all of the files into mysql and redis databases.
  4. EXPORT: Exports the Knowledge Network into flatfiles and dumps the mysql and redis databases.

Output Files

Running the pipeline will create several directories:

Directory Contents
kn-final Stores the final processed output files.
kn-logs Stores the log files.
kn-rawdata Stores the downloaded and processed data.
kn-mysql Stores the MySQL database.
kn-redis Stores the redis database.

Information about the output and intermediate file and database formats can be found here.

Clean Up After Build

To clean up the files (except kn-logs and kn-final), as well as Chronos, Marathon, and Mesos, run:

make clean
make destroy

Primary Parameters

To build the Knowledge Network for only a subset of species or sources, you can specify them as ,,-separated variables, like so:

make knownet SPECIES=homo_sapiens,,mus_musculus SOURCES=kegg,,stringdb

The names of the SPECIES should be all lowercase and spaces replaced by underscores.

The possible SOURCES names can be found here: SrcClasses

Additional Resources

  1. Summary of Current Knowledge Network Contents.
  2. Details of Current Knowledge Network Contents.
  3. List of Related Knowledge Network Tools.

Basic Troubleshooting

If you run into errors when building the Knowledge Network, you can look at the status of all remaining jobs on Chronos

curl -L -s -X GET 127.0.0.1:8888/scheduler/graph/csv | grep node, | \
  awk -F, '{print $3"\t"$4"\t"$1"\t"$2}' | sort | uniq | grep -v succ
For any failed job (e.g. JOBNAME), you can look to
  1. the original Chronos command at: kn-logs/chronos_jobs/JOBNAME.json or
  2. the captured output log at: kn-logs/JOBNAME.json.

These may provide you with a reason that the job is failing. If the original source has changed their format, you may rerun using the SOURCES parameter, specifying all sources except the problematic ones.