Getting started
Requirements
- Git client >= 2.1.0
- Python >= 3.12 with pip and venv
- Optionally, a modern package manager (
uv
(recommended),poetry
, or similar)
Running the corpus
CLI tool
If you use uv
, all you need to do is clone this repository and run the corpus
command:
git clone <URL of this Git repository> corpus
cd corpus
uv run corpus
Otherwise, you need to install the dependencies and package first:
git clone <URL of this Git repository> corpus
cd corpus
python -m venv .venv # Create a virtual environment
source .venv/bin/activate # Activate the environment
pip install . # Install dependencies declared in pyproject.toml
corpus # Run the corpus CLI, should display a help message
To view the help page, simply use corpus
or corpus --help
Basic configuration
There are three files that are often needed to run corpus
:
gitlab.cfg
(mandatory)filters.yaml
(required when usingcorpus filter
orcorpus build
)neo4j.cfg
(optional)
Per default, corpus
looks for these files in the ./resources/
directory.
Locations of these files can also be passed to corpus
commands as arguments.
gitlab.cfg
- configuration of the GitLab instance to work with
If you need help with the content of that configuration file, read the
docs here: python-gitlab docs.
It is a known bug, that
sometimes the execution stops with the ReadTimeout
error. Until now,
there is no better solution, than setting the timeout
value in the
configuration file to a higher value.
filter.yaml
- configure the filtering stage of corpus building
If you want to use the corpus build
or corpus filter
commands,
you should specify a filter file. For more information on how to write
a filter file read here: How to write a filter file.
neo4j.cfg
- configure the Neo4J export functionality
To use the Neo4J export functionality you need to create a neo4j.cfg
file. For more information
on how to write the Neo4J-configuration file read here: How to write the Neo4J configuration.
Information
If you use corpus build
or corpus extract
with the parameter
--all-elements
it may take some time (especially pipelines that
extract a huge number of projects and export them to Neo4J can take up
to several hours). So I really recommend, that you do a
corpus extract --all-elements
only once. In the following you can then
use corpus filter --out=path/to/file.json
. This will prevent, that
your previously extracted corpus will be overwritten, as you probably do
not want to crawl all projects again everytime you try a new filter.
You can find interesting templates for filters here: filter templates.