Getting started
Requirements
- Git client >= 2.1.0
- Python >= 3.12 with pip and venv
- Optionally, a modern package manager (
uv(recommended),poetry, or similar)
Running the corpus CLI tool
If you use uv, all you need to do is clone this repository and run the corpus command:
git clone <URL of this Git repository> corpus
cd corpus
uv run corpus
Otherwise, you need to install the dependencies and package first:
git clone <URL of this Git repository> corpus
cd corpus
python -m venv .venv # Create a virtual environment
source .venv/bin/activate # Activate the environment
pip install . # Install dependencies declared in pyproject.toml
corpus # Run the corpus CLI, should display a help message
To view the help page, simply use corpus or corpus --help
Basic configuration
There are three files that are often needed to run corpus:
gitlab.cfg(mandatory)filters.yaml(required when usingcorpus filterorcorpus build)neo4j.cfg(optional)
Per default, corpus looks for these files in the ./resources/ directory.
Locations of these files can also be passed to corpus commands as arguments.
gitlab.cfg - configuration of the GitLab instance to work with
If you need help with the content of that configuration file, read the
docs here: python-gitlab docs.
It is a known bug, that
sometimes the execution stops with the ReadTimeout error. Until now,
there is no better solution, than setting the timeout value in the
configuration file to a higher value.
filter.yaml - configure the filtering stage of corpus building
If you want to use the corpus build or corpus filter commands,
you should specify a filter file. For more information on how to write
a filter file read here: How to write a filter file.
neo4j.cfg - configure the Neo4J export functionality
To use the Neo4J export functionality you need to create a neo4j.cfg
file. For more information
on how to write the Neo4J-configuration file read here: How to write the Neo4J configuration.
Information
If you use corpus build or corpus extract with the parameter
--all-elements it may take some time (especially pipelines that
extract a huge number of projects and export them to Neo4J can take up
to several hours). So I really recommend, that you do a
corpus extract --all-elements only once. In the following you can then
use corpus filter --out=path/to/file.json. This will prevent, that
your previously extracted corpus will be overwritten, as you probably do
not want to crawl all projects again everytime you try a new filter.
You can find interesting templates for filters here: filter templates.