Skip to content

Getting Started

Requirements

Name Installation Purpose
Python ^3.11 Download The library is Python-based.

Install Dependencies

pip install .

Set Up Your Workspace Variables

  • config.ini contains the environment variables and settings required to run the pipeline. You can modify this file to change the settings for the pipeline.

Quickstart Guide

If you just want to get started quickly, follow the steps provided here. When wanting to dive deeper, read Initialization instead.

  1. Download the repo onto a machine running docker
  2. Configure the following settings in the config file:

    • Point [general] data_dir to your data or move the data into a subfolder /dir of this folder
    • Change [arangodb] url to use the ip of your machine. A docker container will start the database on port 13401
    • Change [database] host to the ip of your machine. A docker container will start a postgres db on port 13404
    • Change [elastic] url to use the ip of your machine. A docker container will start elastic db on port 13403
    • Change [LLM] base_url to an Ollama endpoint of your choice. This service will not be started when running this setup. For small datasets, you might use https://ollama.nimbus.dlr.de/ollama (the url of the DLR-intern Ollama nimbus server)
    • If needed, insert an api_key into [LLM] static_api_key. When using the nimbus server, this is required
  3. Make sure, that the specified Ollama service runs the model llama3.2:latest (recommended) or switch to another model by changing [LLM] model_name

  4. Provide file read and write rights to the database data directory, so the docker containers can create needed files. This can be done with chmod a+rw db/* -R
  5. Run ./run.sh init in the current directory. This starts the data initialization and may take multiple days. To check the status of the initialization, read the inizialization log, created on startup in the resources folder
  6. Once the initialization is completed - check the initialization log for a >>> Initialization completed! message at the end - shutdown any remaining docker resources by running ./run.sh down

Initialization with docker

GARAG & Naive GraphRAG

Before running the Initialization, follow these steps to make sure, that everything is setup:

  1. Make sure the data is in the right format:
    • This project requires a single folder as an input for the data
    • This folder may contain any number of files or folders
    • Each folder may include multiple subfolders, zip archives and documents
    • For a list of valid file types look at valid file types
  2. Check the docker setup to follow your guidelines:
  3. Check your Ollama provider:
    • Either access an already existing ollama service, or create a new one to be used for this project
    • Make sure, the Ollama instance is running
    • We recommend the model llama3.2:latest. See the documentation for installation steps.
  4. Check the config file:
    • Point [general] data_dir to your data or move the data into a subfolder /dir of this folder
    • Change [arangodb] url to the correct url. The default port is 13401
    • Change [elastic] url to use the ip of your machine. The default port is 13403
    • Change [LLM] base_url to the correct url of the provider. See the LLM section for more information
    • Change [LLM] model_name to the correct model name. For Llama3.2 this defaults to llama3.2:latest
    • When using an Ollama instance with an API key access restriction, input your key into [LLM] static_api_key
  5. Provide file read and write rights to the database data directory, so the docker containers can create needed files. This can be done with chmod a+rw db/* -R
  6. Run the setup script:
    • Run ./run.sh init. This script might take multiple days to complete. Just be patient
    • Any errors will be printed into the initialization log file
    • After completion, run ./run.sh down to remove any leftover docker resources

An in depth description of all values in the config file is provided in the section Config.ini.

Initialization without docker

GARAG & Naive GraphRAG

Before running the Initialization, follow these steps to make sure, that everything is setup:

  1. Make sure the data is in the right format:
    • This project requires a single folder as an input for the data
    • This folder may contain any number of files or folders
    • Each folder may include multiple subfolders, zip archives and documents
    • For a list of valid file types look at valid file types
  2. Start database services:
  3. Check your Ollama provider:
    • Either access an already existing ollama service, or create a new one to be used for this project
    • Make sure, the Ollama instance is running
    • We recommend the model llama3.2:latest. See the documentation for installation steps.
  4. Check the config file:
    • Point [general] data_dir to your data or move the data into a subfolder /dir of this folder
    • Change [arangodb] url to the correct url
    • Change [arangodb] db_name to a fitting name. This db will store the provided data in various formats
    • Change [elastic] url to use the ip of your machine
    • Change [elastic] RAG_index_name & GARAG_index_name to a fitting name. These indices will be used for RAG and GARAG retrieval
    • Change [LLM] url to the correct url of the provider. See the LLM section for more information
    • Change [LLM] model_name to the correct model name. For Llama3.2 this defaults to llama3.2:latest
    • When using an Ollama instance with an API key access restriction, input your key into [LLM] static_api_key
  5. Install the required python dependencies:
    • Setup a new conda environment with python >= 3.10
    • Install the dependencies, listed in requirements.txt
  6. Run the setup script:
    • Run python Quickstart.py
    • This script might take multiple days to complete. Just be patient
    • Any errors will be printed into the initialization log file

An in depth description of all values in the config file is provided in the section Config.ini.

Config.ini

The config.ini file controlls the entire projekt. Each value is directly used by the programm. This is a list of all the values, their meaning and their default value.

The config file itself can be found here. An example config file of a woring setup can be seen here.

general

General settings affecting core parts of the program - data_dir (path): The path to the folder containing the data, that will be used for the chatbot. This value has to be set by the user, when running the script KG_1_LoadData.py during initialization. Default: not set

  • parallel_limit (int): The maximum amount of threads running in parallel during the program. Also represents the maximum number of threads simoultaniously waiting for a response from a large language model. Default: 8

  • default_rag_method (str): RAG method used if the RetrievalRequest does not specify one. Can be set to any RetrievalMethodId. Default: "GARAG#783493"

  • default_depth (int): Default depth used if the RetrievalRequest does not specify one and the RAG method requires a depth parameter. Default: 1

security

  • ssl_cert_path (path): The path to the certification file for https encryption. When using http, leave this value empty. Default: not set

  • ssl_key_path (path): The path to the key file for https encryption. When using http, leave this value empty. Default: not set

arangodb

General settings to access the Arango database - username (str): The name of the user being used to manage the database. This user has to have read, write and collection and graph create access. Default: not set

  • password (str): If set, this password will be used to register as the user on the ArangoDB. If None, the user will be asked to enter a password at the start of the program execution (Only works during initialization without docker). Default: not set

  • url (url): The url that will be used to access the ArangoDB. Default: not set

database

General settings to access the Postgres database

  • username (str): Username for the database login. Default: postgres
  • password (str): Username for the database login. Default: root
  • host (hostname): Domain used for accessing the database. Default: not set
  • port (int): Port used for accessing the database. Default: not set
  • database_name (str): Name of the database. Default: postgres

elastic

  • url (url): The url used to store the index data at. Default: not set

LLM (index/query)

Settings controlling the large language model used Settings controlling the large language model used by the program..

  • base_url (str): The url used to communicate with the llm.
  • model_name (str): The name of the model used when communicating with an Ollama server.
  • api_key (str): When provided, this api key will be used for all Ollama llm requests.
  • options (dict): llm configuration

Valid file types

This is a list of al file types, recognised by KG_1_LoadData.py:

  • pdf
  • docx
  • txt
  • md

Files not included during reading:

  • Files starting with ~$...: These files are usually temporary files used, while the file is open and thus don't provide any information and are ignored.

All other file types raise a warning, which may be examined by the user afterwards in the created log file.

Dive Deeper