Getting started

Requirements

Git client >= 2.1.0
Python >= 3.12 with pip and venv
Optionally, a modern package manager (uv (recommended), poetry, or similar)

Running the `corpus` CLI tool

If you use uv, all you need to do is clone this repository and run the corpus command:

git clone <URL of this Git repository> corpus
cd corpus
uv run corpus

Otherwise, you need to install the dependencies and package first:

git clone <URL of this Git repository> corpus
cd corpus
python -m venv .venv  # Create a virtual environment
source .venv/bin/activate  # Activate the environment
pip install .  # Install dependencies declared in pyproject.toml
corpus  # Run the corpus CLI, should display a help message

To view the help page, simply use corpus or corpus --help

Basic configuration

There are three files that are often needed to run corpus:

gitlab.cfg (mandatory)
filters.yaml (required when using corpus filter or corpus build)
neo4j.cfg (optional)

Per default, corpus looks for these files in the ./resources/ directory. Locations of these files can also be passed to corpus commands as arguments.

`gitlab.cfg` - configuration of the GitLab instance to work with

If you need help with the content of that configuration file, read the docs here: python-gitlab docs. It is a known bug, that sometimes the execution stops with the ReadTimeout error. Until now, there is no better solution, than setting the timeout value in the configuration file to a higher value.

`filter.yaml` - configure the filtering stage of corpus building

If you want to use the corpus build or corpus filter commands, you should specify a filter file. For more information on how to write a filter file read here: How to write a filter file.

`neo4j.cfg` - configure the Neo4J export functionality

To use the Neo4J export functionality you need to create a neo4j.cfg file. For more information on how to write the Neo4J-configuration file read here: How to write the Neo4J configuration.

Information

If you use corpus build or corpus extract with the parameter --all-elements it may take some time (especially pipelines that extract a huge number of projects and export them to Neo4J can take up to several hours). So I really recommend, that you do a corpus extract --all-elements only once. In the following you can then use corpus filter --out=path/to/file.json. This will prevent, that your previously extracted corpus will be overwritten, as you probably do not want to crawl all projects again everytime you try a new filter.

You can find interesting templates for filters here: filter templates.

Getting started

Requirements

Running the corpus CLI tool

Basic configuration

gitlab.cfg - configuration of the GitLab instance to work with

filter.yaml - configure the filtering stage of corpus building

neo4j.cfg - configure the Neo4J export functionality

Information

Running the `corpus` CLI tool

`gitlab.cfg` - configuration of the GitLab instance to work with

`filter.yaml` - configure the filtering stage of corpus building

`neo4j.cfg` - configure the Neo4J export functionality