Skip to content

Getting started

Requirements

  • Git client >= 2.1.0
  • Python >= 3.12 with pip and venv
  • Optionally, a modern package manager (uv (recommended), poetry, or similar)

Running the corpus CLI tool

If you use uv, all you need to do is clone this repository and run the corpus command:

git clone <URL of this Git repository> corpus
cd corpus
uv run corpus

Otherwise, you need to install the dependencies and package first:

git clone <URL of this Git repository> corpus
cd corpus
python -m venv .venv  # Create a virtual environment
source .venv/bin/activate  # Activate the environment
pip install .  # Install dependencies declared in pyproject.toml
corpus  # Run the corpus CLI, should display a help message

To view the help page, simply use corpus or corpus --help

Basic configuration

There are three files that are often needed to run corpus:

  1. gitlab.cfg (mandatory)
  2. filters.yaml (required when using corpus filter or corpus build)
  3. neo4j.cfg (optional)

Per default, corpus looks for these files in the ./resources/ directory. Locations of these files can also be passed to corpus commands as arguments.

gitlab.cfg - configuration of the GitLab instance to work with

If you need help with the content of that configuration file, read the docs here: python-gitlab docs. It is a known bug, that sometimes the execution stops with the ReadTimeout error. Until now, there is no better solution, than setting the timeout value in the configuration file to a higher value.

filter.yaml - configure the filtering stage of corpus building

If you want to use the corpus build or corpus filter commands, you should specify a filter file. For more information on how to write a filter file read here: How to write a filter file.

neo4j.cfg - configure the Neo4J export functionality

To use the Neo4J export functionality you need to create a neo4j.cfg file. For more information on how to write the Neo4J-configuration file read here: How to write the Neo4J configuration.

Information

If you use corpus build or corpus extract with the parameter --all-elements it may take some time (especially pipelines that extract a huge number of projects and export them to Neo4J can take up to several hours). So I really recommend, that you do a corpus extract --all-elements only once. In the following you can then use corpus filter --out=path/to/file.json. This will prevent, that your previously extracted corpus will be overwritten, as you probably do not want to crawl all projects again everytime you try a new filter.

You can find interesting templates for filters here: filter templates.