rag
Synopsis
ramalama rag [options] documents destination
Description
Convert documents into a Qdrant vector database and package the result as an
OCI container image. Instead of relying on a heavyweight container with
PyTorch and the full Docling stack, this command uses lightweight llama.cpp
servers to perform document conversion (via the Granite Docling VLM) and
text embedding (via the EmbeddingGemma model). The resulting container
image contains only the vector database and can be used with
ramalama serve --rag.
The pipeline:
- Text files (.txt, .md, .html) are read directly.
- PDFs and images are converted page-by-page through the Granite Docling VLM served by llama.cpp.
- All content is chunked by section headings.
- Chunks are embedded via the EmbeddingGemma model served by llama.cpp.
- Embeddings are stored in a Qdrant on-disk collection.
- The Qdrant database is packaged into a
FROM scratchOCI image.
Two containers work together: a llama.cpp container serves the AI models, and a lightweight RAG container runs the document processing pipeline.
this command requires a container engine (podman or docker).
positional arguments:
DOCUMENTS File or directory containing PDF, images (PNG, JPG, etc.), or text files (TXT, MD, HTML) to be processed.
DESTINATION Name for the output container image, or local path.
Options
--chunk-size=integer
Maximum tokens per chunk for embedding (default: 400). Smaller chunks are faster to embed but may lose context; larger chunks preserve more context but require more embedding capacity.
--ctx-size, -c=integer
Context size for the VLM server (default: 8192). Increase if processing complex PDF pages that produce many visual tokens.
--docling-model=model
Granite Docling GGUF model used for document conversion (default: hf://ibm-granite/granite-docling-258M-GGUF).
--embed-ctx-size=integer
Context size for the embedding server (default: 0, auto-detected by llama.cpp based on the embedding model).
--help, -h
Print usage message
--image=IMAGE
OCI container image to use for the llama.cpp inference servers. Defaults to the accelerator-appropriate ramalama image.
--ngl=value
Number of layers to store in VRAM: a number, auto, or all.
When omitted, llama-server defaults to auto.
--rag-image=IMAGE
OCI container image for the RAG processing container. Defaults to the accelerator-appropriate ramalama-rag image.
--threads, -t=integer
Number of CPU threads to use for llama.cpp inference. Defaults to half the available cores.
Examples
Convert a directory of documents into a RAG image
$ ramalama rag ./docs/ myrag:latest
Found 5 file(s): 2 need VLM, 3 text-only
Reading README.md (1/3)...
Chunking documents...
Embedding chunks via llama.cpp...
Stored vectors in Qdrant.
Building container image 'myrag:latest'...
RAG image 'myrag:latest' created successfully.
Convert a single PDF
$ ramalama rag ./report.pdf quay.io/myuser/report-rag
Use a custom number of GPU layers
$ ramalama rag --ngl all ./docs/ my-rag-image
See Also
ramalama(1), ramalama-serve(1)
Dec 2024, Originally compiled by Dan Walsh <dwalsh@redhat.com> Mar 2026, Rewritten to use llama.cpp-based pipeline by Brian Mahabirsingh <bmahabir@bu.edu>