Developer Guide

Repository Layout

qoco/
├── src/                  # Core solver (backend-agnostic)
│   ├── qoco_api.c        # Public API: setup, solve, cleanup
│   ├── kkt.c             # KKT matrix construction and RHS assembly
│   ├── cone.c / cone.cu  # Cone operations (CPU or GPU, selected at build time)
│   ├── equilibration.c   # Ruiz scaling
│   ├── common_linalg.c   # Linalg helpers that don't depend on backend
│   └── qoco_utils.c      # Printing, stopping criteria, solution copy
├── include/              # Public and internal headers
│   ├── structs.h         # All struct definitions including LinSysBackend
│   └── qoco_linalg.h     # Backend-agnostic linalg interface (types + ops)
├── algebra/
│   ├── CMakeLists.txt    # Selects backend, sets compile definitions
│   ├── builtin/          # CPU backend (QDLDL + AMD)
│   └── cuda/             # GPU backend (cuDSS + cuSPARSE + cuBLAS)
├── tests/
│   ├── unit_tests/       # Component-level tests
│   ├── simple_tests/     # Small end-to-end problems
│   ├── ocp/              # Optimal control problem tests
│   └── portfolio/        # Portfolio optimization tests
├── devtools/             # Local developer scripts
├── benchmarks/           # Benchmark runner and configs
└── .github/workflows/    # CI definitions

Backend Architecture

The solver core in src/ is completely backend-agnostic. It interacts with the linear algebra layer only through two abstractions:

  • Opaque typesQOCOMatrix, QOCOVectorf, QOCOVectori are forward-declared in include/qoco_linalg.h and defined differently by each backend.

  • Function pointer tableLinSysBackend in include/structs.h holds pointers to the backend’s setup, factor, solve, and cleanup functions. The solver calls these through solver->linsys->....

Backend Interface

LinSysBackend is defined in include/structs.h:

typedef struct {
  const char* (*linsys_name)();
  LinSysData* (*linsys_setup)(QOCOProblemData*, QOCOSettings*, QOCOInt Wnnz);
  void (*linsys_set_nt_identity)(LinSysData*, QOCOInt m);
  void (*linsys_update_nt)(LinSysData*, QOCOVectorf* WtW_vec,
                           QOCOFloat kkt_static_reg, QOCOInt m);
  void (*linsys_update_data)(LinSysData*, QOCOProblemData*);
  void (*linsys_factor)(LinSysData*, QOCOInt n, QOCOFloat kkt_dynamic_reg);
  void (*linsys_solve)(LinSysData*, QOCOWorkspace*, QOCOVectorf* b,
                       QOCOVectorf* x, QOCOInt iter_ref_iters);
  void (*linsys_cleanup)(LinSysData*);
} LinSysBackend;

Each backend exports a LinSysBackend backend global that is linked into the final binary. The solver calls linsys_setup at startup and thereafter calls linsys_factor / linsys_solve each iteration to solve the KKT system.

Backend selection happens at configure time via the CMake variable QOCO_ALGEBRA_BACKEND (default: builtin). algebra/CMakeLists.txt validates the choice, adds the corresponding directory to the include path, defines either QOCO_ALGEBRA_BACKEND_BUILTIN or QOCO_ALGEBRA_BACKEND_CUDA, and calls add_subdirectory on the backend folder. The root CMakeLists.txt then picks src/cone.c (builtin) or src/cone.cu (CUDA) accordingly.

CPU (Builtin) Backend

Location: algebra/builtin/

File

Purpose

builtin_types.h

Concrete struct definitions for QOCOMatrix, QOCOVectorf, QOCOVectori

builtin_linalg.c

All linalg operations: SpMv, norms, element-wise ops, etc.

qdldl_backend.c

LinSysBackend implementation: setup, factor, solve, cleanup

Type layout (builtin_types.h):

struct QOCOVectorf_ { QOCOFloat* data; QOCOInt len; };
struct QOCOVectori_ { QOCOInt*   data; QOCOInt len; };
struct QOCOMatrix_  { QOCOCscMatrix* csc; };

Everything lives on the CPU. get_data_vectorf(v) returns v->data directly.

Linear system (qdldl_backend.c):

linsys_setup builds the KKT matrix from P, A, G using construct_kkt (src/kkt.c), computes an AMD reordering for fill reduction, and permutes the matrix to PKPt. Index mappings (PregtoKKT, AttoKKT, GttoKKT, nt2kkt, ntdiag2kkt) are stored so that subsequent NT scaling updates can write directly into the correct entries of PKPt without rebuilding it from scratch.

linsys_factor calls QDLDL_factor on the permuted KKT matrix.

linsys_solve calls QDLDL_solve then runs up to iter_ref_iters steps of iterative refinement to improve accuracy.

Dependencies: QDLDL (lib/qdldl/), AMD (lib/amd/), both built as part of the CMake project.

GPU (CUDA) Backend

Location: algebra/cuda/

File

Purpose

cuda_types.h

Concrete struct definitions — each type holds both host and device pointers

cuda_linalg.cu

CUDA kernels for SpMv, norms, element-wise ops, etc.

cudss_backend.cu

LinSysBackend implementation using cuDSS

Type layout (cuda_types.h):

struct QOCOVectorf_ { QOCOFloat* host; QOCOFloat* device; QOCOInt len; };
struct QOCOVectori_ { QOCOInt*   host; QOCOInt*   device; QOCOInt len; };
struct QOCOMatrix_  {
  QOCOCscMatrix* csc_host;     // CSC on host
  CusparseMatrix* csr_device;  // CSR on device (data)
  CusparseMatrix* csr_meta;    // CSR on device (metadata/structure)
};

CPU mode flag: A thread-local cpu_mode flag controls which pointer get_data_vectorf() returns. Core solver code that runs on the CPU calls set_cpu_mode(1) before accessing data, ensuring it gets the host pointer. GPU kernel launches use set_cpu_mode(0).

Dynamic library loading: CUDA libraries are loaded at runtime with dlopen() in cudss_setup() rather than linked at build time. This allows the binary to run on systems without a GPU (returning a graceful error) and avoids mandatory CUDA toolkit installation for users of the CPU backend. The libraries loaded are:

  • libcudss.so — NVIDIA cuDSS sparse direct solver

  • libcusparse.so — Sparse matrix operations

  • libcublas.so — Dense linear algebra

Matrix format: The core solver uses CSC throughout. The CUDA backend converts to CSR for cuDSS (which requires CSR) during setup and stores the result on the device. Problem matrices A and G are stored in both formats.

Linear system: linsys_setup constructs the KKT matrix on the CPU via the shared construct_kkt function, converts it to CSR, uploads to device, and initialises a cuDSS solver handle. linsys_factor and linsys_solve call into cuDSS. The solve result is left on device; sync_vector_to_host is called explicitly when the CPU needs to read the result.

Cone Implementation

Cone operations (products, divisions, NT scaling, linesearch) are in src/cone.c for the builtin backend and src/cone.cu for the CUDA backend. The file is selected at build time — only one is ever compiled. The CUDA version implements the same logic as CUDA kernels dispatched via the same function signatures.

Closed-Form SOC Step Length

The linesearch for the second-order cone (soc_step_length in src/cone.c) computes the maximum step length \(\alpha \ge 0\) such that \(x + \alpha \, dx\) remains in the second-order cone

\[\mathcal{Q}^n = \{ (x_0, x_1) \in \mathbb{R} \times \mathbb{R}^{n-1} : x_0 \ge \|x_1\| \}\]

rather than performing a bisection search.

Derivation. The membership condition for \(x + \alpha \, dx\) is

\[(x_0 + \alpha \, dx_0)^2 \ge \|x_1 + \alpha \, dx_1\|^2.\]

Expanding and collecting by powers of \(\alpha\):

\[\underbrace{(dx_0^2 - \|dx_1\|^2)}_{a} \, \alpha^2 + \underbrace{2(x_0 \, dx_0 - x_1^\top dx_1)}_{b} \, \alpha + \underbrace{(x_0^2 - \|x_1\|^2)}_{c} \ge 0.\]

Because \(x\) is already in the cone, \(c = \det(x) = x_0^2 - \|x_1\|^2 \ge 0\), so \(\alpha = 0\) is always feasible. The maximum feasible \(\alpha\) is therefore the smallest positive real root of the quadratic \(a \alpha^2 + b \alpha + c = 0\).

Case analysis. Before solving the quadratic the code handles four degenerate cases:

  1. Scalar safeguard. If \(dx_0 < 0\), the first component could go negative. An independent upper bound \(-x_0 / dx_0\) is applied first.

  2. No positive root (\(a > 0\) and \(b > 0\), or discriminant \(d = b^2 - 4ac < 0\)). The parabola either opens upward with a positive vertex shift or has no real roots. Either way the quadratic stays non-negative for all \(\alpha \ge 0\), so the current bound is returned unchanged.

  3. Linear case (\(|a| < 10^{-14}\)). The leading term vanishes; the constraint is linear in \(\alpha\). With \(c \ge 0\) and the sign structure this imposes no additional restriction, so the bound is returned unchanged.

  4. Boundary case (\(c = 0\), i.e. \(x\) is on the cone boundary). If \(a \ge 0\) there is no positive root; otherwise \(\alpha = 0\) is the only feasible point.

Numerically stable root computation. When none of the degenerate cases applies, the citardauq formula is used to avoid catastrophic cancellation. Let \(\sqrt{d} = \sqrt{b^2 - 4ac}\). Define

\[\begin{split}t = \begin{cases} -b - \sqrt{d} & \text{if } b \ge 0 \\ -b + \sqrt{d} & \text{if } b < 0 \end{cases}\end{split}\]

Then the two roots are computed as

\[r_1 = \frac{2c}{t}, \qquad r_2 = \frac{t}{2a}.\]

This form ensures that both \(r_1\) and \(r_2\) are computed by dividing two numbers of the same sign, avoiding the large relative error that arises when subtracting nearly equal quantities. Negative roots are discarded (replaced by \(+\infty\)), and the smaller of \(r_1\), \(r_2\) is taken as the step-length restriction for this cone.

Building

Prerequisites

  • CMake ≥ 3.18

  • C compiler: clang or gcc (Linux/macOS), MSVC (Windows)

  • Python 3.11+ with cvxpy (for test data generation)

  • CUDA toolkit ≥ 13.0 (GPU backend only)

CPU backend (default)

cmake -B build -DQOCO_BUILD_TYPE=Release -DENABLE_TESTING=True
cmake --build build

GPU backend

cmake -B build \
  -DQOCO_ALGEBRA_BACKEND=cuda \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda-13.0/bin/nvcc \
  -DQOCO_BUILD_TYPE=Release \
  -DENABLE_TESTING=True
cmake --build build -j$(nproc)

CMake options

Option

Default

Description

QOCO_ALGEBRA_BACKEND

builtin

Backend: builtin or cuda

QOCO_BUILD_TYPE

Release

Debug (adds -g, ASAN/UBSAN on Unix) or Release (-O3)

QOCO_SINGLE_PRECISION

OFF

Use float instead of double

ENABLE_TESTING

OFF

Build and register test suite

BUILD_QOCO_DEMO

OFF

Build examples/qoco_demo

BUILD_QOCO_BENCHMARK_RUNNER

OFF

Build benchmark runner

Unit Tests

Tests use Google Test and are run with ctest. All test executables link against qocostatic.

Test categories

Directory

Executable(s)

What it covers

tests/unit_tests/

linalg_test, cone_test, input_validation_test

Individual components

tests/simple_tests/

missing_constraints_test

End-to-end with missing constraint types (LP-only, SOC-only)

tests/ocp/

lcvx_test, lcvx_bad_scaling_test, pdg_test

Optimal control problems

tests/portfolio/

markowitz_test

Portfolio optimization (Markowitz)

Unit test details

linalg_test — covers the linalg layer:

  • CSC matrix creation and copying

  • Array copy / negate / scale

  • Dot products, sparse matrix-vector products

cone_test — covers src/cone.c:

  • Cone products and divisions for LP and SOC cones

  • Mixed LP + SOC problems

input_validation_test — covers src/input_validation.c:

  • Rejects invalid settings (tolerances, iteration counts, etc.)

Integration tests

The OCP and portfolio tests load pre-generated problem data from header files (e.g. lcvx_data.h, markowitz_data.h) and call the full solve pipeline. They assert that the optimal objective matches a reference value within 0.01% relative error.

Problem data is generated by the Python scripts in each test directory (generate_problem_data.py), which use cvxpy to solve the reference problem. The generated .h files are committed to the repository, so cvxpy is only needed if you regenerate them.

Running tests locally

# Run all tests
ctest --test-dir build --verbose

# Run a specific test
ctest --test-dir build -R lcvx_test --verbose

# Run with output on failure and retry
ctest --test-dir build --rerun-failed --output-on-failure

CI Workflows

All workflows are in .github/workflows/.

unit_tests.yml — primary test suite

Triggers on every push and pull request.

Runs the full test matrix in parallel (fail-fast: false):

OS

Compiler

Build types

ubuntu-latest

clang

Debug, Release

macos-latest

clang

Debug, Release

windows-latest

MSVC

Debug, Release

The Debug build enables -fsanitize=address,undefined on Linux and macOS, so memory errors and undefined behaviour are caught automatically.

To reproduce a CI failure locally (e.g. ubuntu clang Debug):

cmake -B build -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ \
      -DQOCO_BUILD_TYPE=Debug -DENABLE_TESTING=True -S .
cmake --build build
ctest --test-dir build --verbose --rerun-failed --output-on-failure

clang_tidy.yml — static analysis

Triggers on every push and pull request.

Builds with CMAKE_EXPORT_COMPILE_COMMANDS=ON then runs clang-tidy on all src/*.c files (excluding OS-specific timers). Config is in .clang-tidy.

Enabled check families: bugprone-*, clang-analyzer-*, misc-unused-parameters. Disabled: bugprone-easily-swappable-parameters, clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling. All warnings are treated as errors.

To reproduce locally:

devtools/run_clang_tidy.sh

clang_format.yml — formatting enforcement

Triggers on every push and pull request.

Runs clang-format --dry-run --Werror on all .c and .h files under src/, include/, and algebra/builtin/. Fails if any file is not formatted according to .clang-format.

To reproduce locally:

devtools/run_clang_format.sh --check   # check only (same as CI)
devtools/run_clang_format.sh           # fix in place

benchmark_regression.yml — performance regression

Triggers only on pull requests targeting main.

Builds both main and the PR branch, runs the benchmark suite against both, and posts a comparison report as a PR comment. Uses configs in benchmarks/configs/main.yml and benchmarks/configs/branch.yml.

docs.yml — documentation deployment

Triggers manually (workflow_dispatch only).

Builds the Sphinx docs from docs/ and deploys to the gh-pages branch. Also deploys to the root of gh-pages if the version being built is the latest released version.

Developer Tools

All scripts in devtools/ are intended to be run from the repository root.

run_tests.sh

Builds in Release mode with testing, demo, and benchmark runner enabled, then runs the full test suite.

devtools/run_tests.sh

run_tests_gpu.sh

Same as above but configures the CUDA backend with CUDA 13.0.

devtools/run_tests_gpu.sh

run_clang_tidy.sh

Generates a temporary build directory with compile commands, runs clang-tidy on all src/*.c files, then removes the build directory.

devtools/run_clang_tidy.sh

run_clang_format.sh

Checks or fixes formatting for src/, include/, and algebra/builtin/.

devtools/run_clang_format.sh           # fix in place
devtools/run_clang_format.sh --check   # report violations and exit non-zero

profile.sh

Profiles a benchmark run on CPU using Valgrind’s callgrind tool and opens the result in KCachegrind.

devtools/profile.sh path/to/benchmark/data

profile_gpu.sh

Profiles a benchmark run on GPU using NVIDIA Nsight Systems.

devtools/profile_gpu.sh path/to/benchmark/data