Developer Guide =============== .. contents:: Table of Contents :local: :depth: 2 Repository Layout ----------------- .. code-block:: text qoco/ ├── src/ # Core solver (backend-agnostic) │ ├── qoco_api.c # Public API: setup, solve, cleanup │ ├── kkt.c # KKT matrix construction and RHS assembly │ ├── cone.c / cone.cu # Cone operations (CPU or GPU, selected at build time) │ ├── equilibration.c # Ruiz scaling │ ├── common_linalg.c # Linalg helpers that don't depend on backend │ └── qoco_utils.c # Printing, stopping criteria, solution copy ├── include/ # Public and internal headers │ ├── structs.h # All struct definitions including LinSysBackend │ └── qoco_linalg.h # Backend-agnostic linalg interface (types + ops) ├── algebra/ │ ├── CMakeLists.txt # Selects backend, sets compile definitions │ ├── builtin/ # CPU backend (QDLDL + AMD) │ └── cuda/ # GPU backend (cuDSS + cuSPARSE + cuBLAS) ├── tests/ │ ├── unit_tests/ # Component-level tests │ ├── simple_tests/ # Small end-to-end problems │ ├── ocp/ # Optimal control problem tests │ └── portfolio/ # Portfolio optimization tests ├── devtools/ # Local developer scripts ├── benchmarks/ # Benchmark runner and configs └── .github/workflows/ # CI definitions Backend Architecture -------------------- The solver core in ``src/`` is completely backend-agnostic. It interacts with the linear algebra layer only through two abstractions: - **Opaque types** — ``QOCOMatrix``, ``QOCOVectorf``, ``QOCOVectori`` are forward-declared in ``include/qoco_linalg.h`` and defined differently by each backend. - **Function pointer table** — ``LinSysBackend`` in ``include/structs.h`` holds pointers to the backend's setup, factor, solve, and cleanup functions. The solver calls these through ``solver->linsys->...``. Backend Interface ~~~~~~~~~~~~~~~~~ ``LinSysBackend`` is defined in ``include/structs.h``: .. code-block:: c typedef struct { const char* (*linsys_name)(); LinSysData* (*linsys_setup)(QOCOProblemData*, QOCOSettings*, QOCOInt Wnnz); void (*linsys_set_nt_identity)(LinSysData*, QOCOInt m); void (*linsys_update_nt)(LinSysData*, QOCOVectorf* WtW_vec, QOCOFloat kkt_static_reg, QOCOInt m); void (*linsys_update_data)(LinSysData*, QOCOProblemData*); void (*linsys_factor)(LinSysData*, QOCOInt n, QOCOFloat kkt_dynamic_reg); void (*linsys_solve)(LinSysData*, QOCOWorkspace*, QOCOVectorf* b, QOCOVectorf* x, QOCOInt iter_ref_iters); void (*linsys_cleanup)(LinSysData*); } LinSysBackend; Each backend exports a ``LinSysBackend backend`` global that is linked into the final binary. The solver calls ``linsys_setup`` at startup and thereafter calls ``linsys_factor`` / ``linsys_solve`` each iteration to solve the KKT system. **Backend selection** happens at configure time via the CMake variable ``QOCO_ALGEBRA_BACKEND`` (default: ``builtin``). ``algebra/CMakeLists.txt`` validates the choice, adds the corresponding directory to the include path, defines either ``QOCO_ALGEBRA_BACKEND_BUILTIN`` or ``QOCO_ALGEBRA_BACKEND_CUDA``, and calls ``add_subdirectory`` on the backend folder. The root ``CMakeLists.txt`` then picks ``src/cone.c`` (builtin) or ``src/cone.cu`` (CUDA) accordingly. CPU (Builtin) Backend ~~~~~~~~~~~~~~~~~~~~~ **Location:** ``algebra/builtin/`` .. list-table:: :header-rows: 1 * - File - Purpose * - ``builtin_types.h`` - Concrete struct definitions for ``QOCOMatrix``, ``QOCOVectorf``, ``QOCOVectori`` * - ``builtin_linalg.c`` - All linalg operations: SpMv, norms, element-wise ops, etc. * - ``qdldl_backend.c`` - ``LinSysBackend`` implementation: setup, factor, solve, cleanup **Type layout** (``builtin_types.h``): .. code-block:: c struct QOCOVectorf_ { QOCOFloat* data; QOCOInt len; }; struct QOCOVectori_ { QOCOInt* data; QOCOInt len; }; struct QOCOMatrix_ { QOCOCscMatrix* csc; }; Everything lives on the CPU. ``get_data_vectorf(v)`` returns ``v->data`` directly. **Linear system** (``qdldl_backend.c``): ``linsys_setup`` builds the KKT matrix from P, A, G using ``construct_kkt`` (``src/kkt.c``), computes an AMD reordering for fill reduction, and permutes the matrix to ``PKPt``. Index mappings (``PregtoKKT``, ``AttoKKT``, ``GttoKKT``, ``nt2kkt``, ``ntdiag2kkt``) are stored so that subsequent NT scaling updates can write directly into the correct entries of ``PKPt`` without rebuilding it from scratch. ``linsys_factor`` calls ``QDLDL_factor`` on the permuted KKT matrix. ``linsys_solve`` calls ``QDLDL_solve`` then runs up to ``iter_ref_iters`` steps of iterative refinement to improve accuracy. **Dependencies:** QDLDL (``lib/qdldl/``), AMD (``lib/amd/``), both built as part of the CMake project. GPU (CUDA) Backend ~~~~~~~~~~~~~~~~~~ **Location:** ``algebra/cuda/`` .. list-table:: :header-rows: 1 * - File - Purpose * - ``cuda_types.h`` - Concrete struct definitions — each type holds both host and device pointers * - ``cuda_linalg.cu`` - CUDA kernels for SpMv, norms, element-wise ops, etc. * - ``cudss_backend.cu`` - ``LinSysBackend`` implementation using cuDSS **Type layout** (``cuda_types.h``): .. code-block:: c struct QOCOVectorf_ { QOCOFloat* host; QOCOFloat* device; QOCOInt len; }; struct QOCOVectori_ { QOCOInt* host; QOCOInt* device; QOCOInt len; }; struct QOCOMatrix_ { QOCOCscMatrix* csc_host; // CSC on host CusparseMatrix* csr_device; // CSR on device (data) CusparseMatrix* csr_meta; // CSR on device (metadata/structure) }; **CPU mode flag:** A thread-local ``cpu_mode`` flag controls which pointer ``get_data_vectorf()`` returns. Core solver code that runs on the CPU calls ``set_cpu_mode(1)`` before accessing data, ensuring it gets the host pointer. GPU kernel launches use ``set_cpu_mode(0)``. **Dynamic library loading:** CUDA libraries are loaded at runtime with ``dlopen()`` in ``cudss_setup()`` rather than linked at build time. This allows the binary to run on systems without a GPU (returning a graceful error) and avoids mandatory CUDA toolkit installation for users of the CPU backend. The libraries loaded are: - ``libcudss.so`` — NVIDIA cuDSS sparse direct solver - ``libcusparse.so`` — Sparse matrix operations - ``libcublas.so`` — Dense linear algebra **Matrix format:** The core solver uses CSC throughout. The CUDA backend converts to CSR for cuDSS (which requires CSR) during setup and stores the result on the device. Problem matrices A and G are stored in both formats. **Linear system:** ``linsys_setup`` constructs the KKT matrix on the CPU via the shared ``construct_kkt`` function, converts it to CSR, uploads to device, and initialises a cuDSS solver handle. ``linsys_factor`` and ``linsys_solve`` call into cuDSS. The solve result is left on device; ``sync_vector_to_host`` is called explicitly when the CPU needs to read the result. Cone Implementation ~~~~~~~~~~~~~~~~~~~ Cone operations (products, divisions, NT scaling, linesearch) are in ``src/cone.c`` for the builtin backend and ``src/cone.cu`` for the CUDA backend. The file is selected at build time — only one is ever compiled. The CUDA version implements the same logic as CUDA kernels dispatched via the same function signatures. Closed-Form SOC Step Length ^^^^^^^^^^^^^^^^^^^^^^^^^^^ The linesearch for the second-order cone (``soc_step_length`` in ``src/cone.c``) computes the maximum step length :math:`\alpha \ge 0` such that :math:`x + \alpha \, dx` remains in the second-order cone .. math:: \mathcal{Q}^n = \{ (x_0, x_1) \in \mathbb{R} \times \mathbb{R}^{n-1} : x_0 \ge \|x_1\| \} rather than performing a bisection search. **Derivation.** The membership condition for :math:`x + \alpha \, dx` is .. math:: (x_0 + \alpha \, dx_0)^2 \ge \|x_1 + \alpha \, dx_1\|^2. Expanding and collecting by powers of :math:`\alpha`: .. math:: \underbrace{(dx_0^2 - \|dx_1\|^2)}_{a} \, \alpha^2 + \underbrace{2(x_0 \, dx_0 - x_1^\top dx_1)}_{b} \, \alpha + \underbrace{(x_0^2 - \|x_1\|^2)}_{c} \ge 0. Because :math:`x` is already in the cone, :math:`c = \det(x) = x_0^2 - \|x_1\|^2 \ge 0`, so :math:`\alpha = 0` is always feasible. The maximum feasible :math:`\alpha` is therefore the smallest positive real root of the quadratic :math:`a \alpha^2 + b \alpha + c = 0`. **Case analysis.** Before solving the quadratic the code handles four degenerate cases: 1. **Scalar safeguard.** If :math:`dx_0 < 0`, the first component could go negative. An independent upper bound :math:`-x_0 / dx_0` is applied first. 2. **No positive root** (:math:`a > 0` and :math:`b > 0`, or discriminant :math:`d = b^2 - 4ac < 0`). The parabola either opens upward with a positive vertex shift or has no real roots. Either way the quadratic stays non-negative for all :math:`\alpha \ge 0`, so the current bound is returned unchanged. 3. **Linear case** (:math:`|a| < 10^{-14}`). The leading term vanishes; the constraint is linear in :math:`\alpha`. With :math:`c \ge 0` and the sign structure this imposes no additional restriction, so the bound is returned unchanged. 4. **Boundary case** (:math:`c = 0`, i.e. :math:`x` is on the cone boundary). If :math:`a \ge 0` there is no positive root; otherwise :math:`\alpha = 0` is the only feasible point. **Numerically stable root computation.** When none of the degenerate cases applies, the citardauq formula is used to avoid catastrophic cancellation. Let :math:`\sqrt{d} = \sqrt{b^2 - 4ac}`. Define .. math:: t = \begin{cases} -b - \sqrt{d} & \text{if } b \ge 0 \\ -b + \sqrt{d} & \text{if } b < 0 \end{cases} Then the two roots are computed as .. math:: r_1 = \frac{2c}{t}, \qquad r_2 = \frac{t}{2a}. This form ensures that both :math:`r_1` and :math:`r_2` are computed by dividing two numbers of the same sign, avoiding the large relative error that arises when subtracting nearly equal quantities. Negative roots are discarded (replaced by :math:`+\infty`), and the smaller of :math:`r_1`, :math:`r_2` is taken as the step-length restriction for this cone. Building -------- Prerequisites ~~~~~~~~~~~~~ - CMake ≥ 3.18 - C compiler: clang or gcc (Linux/macOS), MSVC (Windows) - Python 3.11+ with ``cvxpy`` (for test data generation) - CUDA toolkit ≥ 13.0 (GPU backend only) CPU backend (default) ~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash cmake -B build -DQOCO_BUILD_TYPE=Release -DENABLE_TESTING=True cmake --build build GPU backend ~~~~~~~~~~~ .. code-block:: bash cmake -B build \ -DQOCO_ALGEBRA_BACKEND=cuda \ -DCMAKE_CUDA_COMPILER=/usr/local/cuda-13.0/bin/nvcc \ -DQOCO_BUILD_TYPE=Release \ -DENABLE_TESTING=True cmake --build build -j$(nproc) CMake options ~~~~~~~~~~~~~ .. list-table:: :header-rows: 1 * - Option - Default - Description * - ``QOCO_ALGEBRA_BACKEND`` - ``builtin`` - Backend: ``builtin`` or ``cuda`` * - ``QOCO_BUILD_TYPE`` - ``Release`` - ``Debug`` (adds ``-g``, ASAN/UBSAN on Unix) or ``Release`` (``-O3``) * - ``QOCO_SINGLE_PRECISION`` - ``OFF`` - Use ``float`` instead of ``double`` * - ``ENABLE_TESTING`` - ``OFF`` - Build and register test suite * - ``BUILD_QOCO_DEMO`` - ``OFF`` - Build ``examples/qoco_demo`` * - ``BUILD_QOCO_BENCHMARK_RUNNER`` - ``OFF`` - Build benchmark runner Unit Tests ---------- Tests use Google Test and are run with ``ctest``. All test executables link against ``qocostatic``. Test categories ~~~~~~~~~~~~~~~ .. list-table:: :header-rows: 1 * - Directory - Executable(s) - What it covers * - ``tests/unit_tests/`` - ``linalg_test``, ``cone_test``, ``input_validation_test`` - Individual components * - ``tests/simple_tests/`` - ``missing_constraints_test`` - End-to-end with missing constraint types (LP-only, SOC-only) * - ``tests/ocp/`` - ``lcvx_test``, ``lcvx_bad_scaling_test``, ``pdg_test`` - Optimal control problems * - ``tests/portfolio/`` - ``markowitz_test`` - Portfolio optimization (Markowitz) Unit test details ~~~~~~~~~~~~~~~~~ **linalg_test** — covers the linalg layer: - CSC matrix creation and copying - Array copy / negate / scale - Dot products, sparse matrix-vector products **cone_test** — covers ``src/cone.c``: - Cone products and divisions for LP and SOC cones - Mixed LP + SOC problems **input_validation_test** — covers ``src/input_validation.c``: - Rejects invalid settings (tolerances, iteration counts, etc.) Integration tests ~~~~~~~~~~~~~~~~~ The OCP and portfolio tests load pre-generated problem data from header files (e.g. ``lcvx_data.h``, ``markowitz_data.h``) and call the full solve pipeline. They assert that the optimal objective matches a reference value within 0.01% relative error. Problem data is generated by the Python scripts in each test directory (``generate_problem_data.py``), which use ``cvxpy`` to solve the reference problem. The generated ``.h`` files are committed to the repository, so ``cvxpy`` is only needed if you regenerate them. Running tests locally ~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash # Run all tests ctest --test-dir build --verbose # Run a specific test ctest --test-dir build -R lcvx_test --verbose # Run with output on failure and retry ctest --test-dir build --rerun-failed --output-on-failure CI Workflows ------------ All workflows are in ``.github/workflows/``. unit_tests.yml — primary test suite ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Triggers on every push and pull request. Runs the full test matrix in parallel (``fail-fast: false``): .. list-table:: :header-rows: 1 * - OS - Compiler - Build types * - ubuntu-latest - clang - Debug, Release * - macos-latest - clang - Debug, Release * - windows-latest - MSVC - Debug, Release The Debug build enables ``-fsanitize=address,undefined`` on Linux and macOS, so memory errors and undefined behaviour are caught automatically. **To reproduce a CI failure locally (e.g. ubuntu clang Debug):** .. code-block:: bash cmake -B build -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ \ -DQOCO_BUILD_TYPE=Debug -DENABLE_TESTING=True -S . cmake --build build ctest --test-dir build --verbose --rerun-failed --output-on-failure clang_tidy.yml — static analysis ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Triggers on every push and pull request. Builds with ``CMAKE_EXPORT_COMPILE_COMMANDS=ON`` then runs ``clang-tidy`` on all ``src/*.c`` files (excluding OS-specific timers). Config is in ``.clang-tidy``. Enabled check families: ``bugprone-*``, ``clang-analyzer-*``, ``misc-unused-parameters``. Disabled: ``bugprone-easily-swappable-parameters``, ``clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling``. All warnings are treated as errors. **To reproduce locally:** .. code-block:: bash devtools/run_clang_tidy.sh clang_format.yml — formatting enforcement ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Triggers on every push and pull request. Runs ``clang-format --dry-run --Werror`` on all ``.c`` and ``.h`` files under ``src/``, ``include/``, and ``algebra/builtin/``. Fails if any file is not formatted according to ``.clang-format``. **To reproduce locally:** .. code-block:: bash devtools/run_clang_format.sh --check # check only (same as CI) devtools/run_clang_format.sh # fix in place benchmark_regression.yml — performance regression ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Triggers only on pull requests targeting ``main``. Builds both ``main`` and the PR branch, runs the benchmark suite against both, and posts a comparison report as a PR comment. Uses configs in ``benchmarks/configs/main.yml`` and ``benchmarks/configs/branch.yml``. docs.yml — documentation deployment ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Triggers manually (``workflow_dispatch`` only). Builds the Sphinx docs from ``docs/`` and deploys to the ``gh-pages`` branch. Also deploys to the root of ``gh-pages`` if the version being built is the latest released version. Developer Tools --------------- All scripts in ``devtools/`` are intended to be run from the repository root. run_tests.sh ~~~~~~~~~~~~ Builds in Release mode with testing, demo, and benchmark runner enabled, then runs the full test suite. .. code-block:: bash devtools/run_tests.sh run_tests_gpu.sh ~~~~~~~~~~~~~~~~ Same as above but configures the CUDA backend with CUDA 13.0. .. code-block:: bash devtools/run_tests_gpu.sh run_clang_tidy.sh ~~~~~~~~~~~~~~~~~ Generates a temporary build directory with compile commands, runs ``clang-tidy`` on all ``src/*.c`` files, then removes the build directory. .. code-block:: bash devtools/run_clang_tidy.sh run_clang_format.sh ~~~~~~~~~~~~~~~~~~~ Checks or fixes formatting for ``src/``, ``include/``, and ``algebra/builtin/``. .. code-block:: bash devtools/run_clang_format.sh # fix in place devtools/run_clang_format.sh --check # report violations and exit non-zero profile.sh ~~~~~~~~~~ Profiles a benchmark run on CPU using Valgrind's callgrind tool and opens the result in KCachegrind. .. code-block:: bash devtools/profile.sh path/to/benchmark/data profile_gpu.sh ~~~~~~~~~~~~~~ Profiles a benchmark run on GPU using NVIDIA Nsight Systems. .. code-block:: bash devtools/profile_gpu.sh path/to/benchmark/data