Getting Started¶
Requirements¶
| Name | Installation | Purpose |
|---|---|---|
| Python ^3.11 | Download | The library is Python-based. |
Install Dependencies¶
Set Up Your Workspace Variables¶
config.inicontains the environment variables and settings required to run the pipeline. You can modify this file to change the settings for the pipeline.
Quickstart Guide¶
If you just want to get started quickly, follow the steps provided here. When wanting to dive deeper, read Initialization instead.
- Download the repo onto a machine running docker
-
Configure the following settings in the config file:
- Point
[general] data_dirto your data or move the data into a subfolder/dirof this folder - Change
[arangodb] urlto use the ip of your machine. A docker container will start the database on port13401 - Change
[database] hostto the ip of your machine. A docker container will start a postgres db on port13404 - Change
[elastic] urlto use the ip of your machine. A docker container will start elastic db on port13403 - Change
[LLM] base_urlto an Ollama endpoint of your choice. This service will not be started when running this setup. For small datasets, you might usehttps://ollama.nimbus.dlr.de/ollama(the url of the DLR-intern Ollama nimbus server) - If needed, insert an api_key into
[LLM] static_api_key. When using the nimbus server, this is required
- Point
-
Make sure, that the specified Ollama service runs the model
llama3.2:latest(recommended) or switch to another model by changing[LLM] model_name - Provide file read and write rights to the database data directory, so the docker containers can create needed files. This can be done with
chmod a+rw db/* -R - Run
./run.sh initin the current directory. This starts the data initialization and may take multiple days. To check the status of the initialization, read the inizialization log, created on startup in the resources folder - Once the initialization is completed - check the initialization log for a
>>> Initialization completed!message at the end - shutdown any remaining docker resources by running./run.sh down
Initialization with docker¶
GARAG & Naive GraphRAG¶
Before running the Initialization, follow these steps to make sure, that everything is setup:
- Make sure the data is in the right format:
- This project requires a single folder as an input for the data
- This folder may contain any number of files or folders
- Each folder may include multiple subfolders, zip archives and documents
- For a list of valid file types look at valid file types
- Check the docker setup to follow your guidelines:
- Check all dockerfiles to follow your local naming conventions
- Check the ports in the db compose file
- Check your Ollama provider:
- Either access an already existing ollama service, or create a new one to be used for this project
- Make sure, the Ollama instance is running
- We recommend the model
llama3.2:latest. See the documentation for installation steps.
- Check the config file:
- Point
[general] data_dirto your data or move the data into a subfolder/dirof this folder - Change
[arangodb] urlto the correct url. The default port is13401 - Change
[elastic] urlto use the ip of your machine. The default port is13403 - Change
[LLM] base_urlto the correct url of the provider. See the LLM section for more information - Change
[LLM] model_nameto the correct model name. ForLlama3.2this defaults tollama3.2:latest - When using an Ollama instance with an API key access restriction, input your key into
[LLM] static_api_key
- Point
- Provide file read and write rights to the database data directory, so the docker containers can create needed files. This can be done with
chmod a+rw db/* -R - Run the setup script:
- Run
./run.sh init. This script might take multiple days to complete. Just be patient - Any errors will be printed into the initialization log file
- After completion, run
./run.sh downto remove any leftover docker resources
- Run
An in depth description of all values in the config file is provided in the section Config.ini.
Initialization without docker¶
GARAG & Naive GraphRAG¶
Before running the Initialization, follow these steps to make sure, that everything is setup:
- Make sure the data is in the right format:
- This project requires a single folder as an input for the data
- This folder may contain any number of files or folders
- Each folder may include multiple subfolders, zip archives and documents
- For a list of valid file types look at valid file types
- Start database services:
- Start an ArangoDB Instance
- Start an ElasticDB Instance
- Check your Ollama provider:
- Either access an already existing ollama service, or create a new one to be used for this project
- Make sure, the Ollama instance is running
- We recommend the model
llama3.2:latest. See the documentation for installation steps.
- Check the config file:
- Point
[general] data_dirto your data or move the data into a subfolder/dirof this folder - Change
[arangodb] urlto the correct url - Change
[arangodb] db_nameto a fitting name. This db will store the provided data in various formats - Change
[elastic] urlto use the ip of your machine - Change
[elastic] RAG_index_name & GARAG_index_nameto a fitting name. These indices will be used for RAG and GARAG retrieval - Change
[LLM] urlto the correct url of the provider. See the LLM section for more information - Change
[LLM] model_nameto the correct model name. ForLlama3.2this defaults tollama3.2:latest - When using an Ollama instance with an API key access restriction, input your key into
[LLM] static_api_key
- Point
- Install the required python dependencies:
- Setup a new conda environment with
python >= 3.10 - Install the dependencies, listed in requirements.txt
- Setup a new conda environment with
- Run the setup script:
- Run
python Quickstart.py - This script might take multiple days to complete. Just be patient
- Any errors will be printed into the initialization log file
- Run
An in depth description of all values in the config file is provided in the section Config.ini.
Config.ini¶
The config.ini file controlls the entire projekt. Each value is directly used by the programm. This is a list of all the values, their meaning and their default value.
The config file itself can be found here. An example config file of a woring setup can be seen here.
general¶
General settings affecting core parts of the program
- data_dir (path): The path to the folder containing the data, that will be used for the chatbot. This value has to be set by the user, when running the script KG_1_LoadData.py during initialization. Default: not set
-
parallel_limit (int): The maximum amount of threads running in parallel during the program. Also represents the maximum number of threads simoultaniously waiting for a response from a large language model.
Default: 8 -
default_rag_method (str): RAG method used if the RetrievalRequest does not specify one. Can be set to any RetrievalMethodId.
Default: "GARAG#783493" -
default_depth (int): Default depth used if the RetrievalRequest does not specify one and the RAG method requires a depth parameter.
Default: 1
security¶
-
ssl_cert_path (path): The path to the certification file for https encryption. When using http, leave this value empty.
Default: not set -
ssl_key_path (path): The path to the key file for https encryption. When using http, leave this value empty.
Default: not set
arangodb¶
General settings to access the Arango database
- username (str): The name of the user being used to manage the database. This user has to have read, write and collection and graph create access. Default: not set
-
password (str): If set, this password will be used to register as the user on the ArangoDB. If
None, the user will be asked to enter a password at the start of the program execution (Only works during initialization without docker).Default: not set -
url (url): The url that will be used to access the ArangoDB.
Default: not set
database¶
General settings to access the Postgres database
- username (str): Username for the database login.
Default: postgres - password (str): Username for the database login.
Default: root - host (hostname): Domain used for accessing the database.
Default: not set - port (int): Port used for accessing the database.
Default: not set - database_name (str): Name of the database.
Default: postgres
elastic¶
- url (url): The url used to store the index data at.
Default: not set
LLM (index/query)¶
Settings controlling the large language model used Settings controlling the large language model used by the program..
- base_url (str): The url used to communicate with the llm.
- model_name (str): The name of the model used when communicating with an Ollama server.
- api_key (str): When provided, this api key will be used for all Ollama llm requests.
- options (dict): llm configuration
Valid file types¶
This is a list of al file types, recognised by KG_1_LoadData.py:
- docx
- txt
- md
Files not included during reading:
- Files starting with ~$...: These files are usually temporary files used, while the file is open and thus don't provide any information and are ignored.
All other file types raise a warning, which may be examined by the user afterwards in the created log file.
Dive Deeper¶
- For more details about configuring the pipeline, see the configuration documentation.
- To learn more about Initialization, refer to the Initialization documentation.