Developer Guide

Repository Layout 

qoco/
├── src/                  # Core solver (backend-agnostic)
│   ├── qoco_api.c        # Public API: setup, solve, cleanup
│   ├── kkt.c             # KKT matrix construction and RHS assembly
│   ├── cone.c / cone.cu  # Cone operations (CPU or GPU, selected at build time)
│   ├── equilibration.c   # Ruiz scaling
│   ├── common_linalg.c   # Linalg helpers that don't depend on backend
│   └── qoco_utils.c      # Printing, stopping criteria, solution copy
├── include/              # Public and internal headers
│   ├── structs.h         # All struct definitions including LinSysBackend
│   └── qoco_linalg.h     # Backend-agnostic linalg interface (types + ops)
├── algebra/
│   ├── CMakeLists.txt    # Selects backend, sets compile definitions
│   ├── builtin/          # CPU backend (QDLDL + AMD)
│   └── cuda/             # GPU backend (cuDSS + cuSPARSE + cuBLAS)
├── tests/
│   ├── unit_tests/       # Component-level tests
│   ├── simple_tests/     # Small end-to-end problems
│   ├── ocp/              # Optimal control problem tests
│   └── portfolio/        # Portfolio optimization tests
├── devtools/             # Local developer scripts
├── benchmarks/           # Benchmark runner and configs
└── .github/workflows/    # CI definitions

Backend Architecture 

The solver core in src/ is completely backend-agnostic. It interacts with the linear algebra layer only through two abstractions:

Opaque types — QOCOMatrix, QOCOVectorf, QOCOVectori are forward-declared in include/qoco_linalg.h and defined differently by each backend.
Function pointer table — LinSysBackend in include/structs.h holds pointers to the backend’s setup, factor, solve, and cleanup functions. The solver calls these through solver->linsys->....

Backend Interface 

LinSysBackend is defined in include/structs.h:

typedef struct {
  const char* (*linsys_name)();
  LinSysData* (*linsys_setup)(QOCOProblemData*, QOCOSettings*, QOCOInt Wnnz);
  void (*linsys_set_nt_identity)(LinSysData*, QOCOInt m);
  void (*linsys_update_nt)(LinSysData*, QOCOVectorf* WtW_vec,
                           QOCOFloat kkt_static_reg_G, QOCOInt m);
  void (*linsys_update_data)(LinSysData*, QOCOProblemData*);
  void (*linsys_factor)(LinSysData*, QOCOInt n, QOCOFloat kkt_dynamic_reg);
  void (*linsys_solve)(LinSysData*, QOCOWorkspace*, QOCOVectorf* b,
                       QOCOVectorf* x, QOCOFloat ir_tol,
                       QOCOInt max_ir_iters);
  void (*linsys_cleanup)(LinSysData*);
} LinSysBackend;

Each backend exports a LinSysBackend backend global that is linked into the final binary. The solver calls linsys_setup at startup and thereafter calls linsys_factor / linsys_solve each iteration to solve the KKT system.

Backend selection happens at configure time via the CMake variable QOCO_ALGEBRA_BACKEND (default: builtin). algebra/CMakeLists.txt validates the choice, adds the corresponding directory to the include path, defines either QOCO_ALGEBRA_BACKEND_BUILTIN or QOCO_ALGEBRA_BACKEND_CUDA, and calls add_subdirectory on the backend folder. The root CMakeLists.txt then picks src/cone.c (builtin) or src/cone.cu (CUDA) accordingly.

CPU (Builtin) Backend 

Location: algebra/builtin/

File	Purpose
`builtin_types.h`	Concrete struct definitions for `QOCOMatrix`, `QOCOVectorf`, `QOCOVectori`
`builtin_linalg.c`	All linalg operations: SpMv, norms, element-wise ops, etc.
`qdldl_backend.c`	`LinSysBackend` implementation: setup, factor, solve, cleanup

Type layout (builtin_types.h):

struct QOCOVectorf_ { QOCOFloat* data; QOCOInt len; };
struct QOCOVectori_ { QOCOInt*   data; QOCOInt len; };
struct QOCOMatrix_  { QOCOCscMatrix* csc; };

Everything lives on the CPU. get_data_vectorf(v) returns v->data directly.

Linear system (qdldl_backend.c):

linsys_setup builds the KKT matrix from P, A, G using construct_kkt (src/kkt.c), computes an AMD reordering for fill reduction, and permutes the matrix to PKPt. Index mappings (PregtoKKT, AttoKKT, GttoKKT, nt2kkt, ntdiag2kkt) are stored so that subsequent NT scaling updates can write directly into the correct entries of PKPt without rebuilding it from scratch.

linsys_factor calls QDLDL_factor on the permuted KKT matrix.

linsys_solve calls QDLDL_solve then runs adaptive iterative refinement: it repeats up to max_ir_iters times, stopping early when the KKT residual \(\|Kx - b\|_\infty\) falls below ir_tol. A best-solution checkpoint is maintained in permuted space; if a refinement step worsens the residual the best solution is restored and refinement stops immediately. The number of refinement iterations taken is accumulated in work->ir_iters and printed in the IR column of the iteration log.

GPU (CUDA) Backend 

Location: algebra/cuda/

File	Purpose
`cuda_types.h`	Concrete struct definitions — each type holds both host and device pointers
`cuda_linalg.cu`	CUDA kernels for SpMv, norms, element-wise ops, etc.
`cudss_backend.cu`	`LinSysBackend` implementation using cuDSS

Type layout (cuda_types.h):

struct QOCOVectorf_ { QOCOFloat* host; QOCOFloat* device; QOCOInt len; };
struct QOCOVectori_ { QOCOInt*   host; QOCOInt*   device; QOCOInt len; };
struct QOCOMatrix_  {
  QOCOCscMatrix* csc_host;     // CSC on host
  CusparseMatrix* csr_device;  // CSR on device (data)
  CusparseMatrix* csr_meta;    // CSR on device (metadata/structure)
};

CPU mode flag: A thread-local cpu_mode flag controls which pointer get_data_vectorf() returns. Core solver code that runs on the CPU calls set_cpu_mode(1) before accessing data, ensuring it gets the host pointer. GPU kernel launches use set_cpu_mode(0).

Dynamic library loading: CUDA libraries are loaded at runtime with dlopen() in cudss_setup() rather than linked at build time. This allows the binary to run on systems without a GPU (returning a graceful error) and avoids mandatory CUDA toolkit installation for users of the CPU backend. The libraries loaded are:

libcudss.so — NVIDIA cuDSS sparse direct solver
libcusparse.so — Sparse matrix operations
libcublas.so — Dense linear algebra

Matrix format: The core solver uses CSC throughout. The CUDA backend converts to CSR for cuDSS (which requires CSR) during setup and stores the result on the device. Problem matrices A and G are stored in both formats.

Linear system: linsys_setup constructs the KKT matrix on the CPU via the shared construct_kkt function, converts it to CSR, uploads to device, and initialises a cuDSS solver handle. linsys_factor and linsys_solve call into cuDSS. The solve result is left on device; sync_vector_to_host is called explicitly when the CPU needs to read the result.

Cone Implementation 

Cone operations (products, divisions, NT scaling, linesearch) are in src/cone.c for the builtin backend and src/cone.cu for the CUDA backend. The file is selected at build time — only one is ever compiled. The CUDA version implements the same logic as CUDA kernels dispatched via the same function signatures.

Implementation Details 

Closed-Form SOC Step Length 

The linesearch for the second-order cone (soc_step_length in src/cone.c) computes the maximum step length \(\alpha \ge 0\) such that \(x + \alpha \, dx\) remains in the second-order cone

\[\mathcal{Q}^n = \{ (x_0, x_1) \in \mathbb{R} \times \mathbb{R}^{n-1} : x_0 \ge \|x_1\| \}\]

rather than performing a bisection search.

Derivation. The membership condition for \(x + \alpha \, dx\) is

\[(x_0 + \alpha \, dx_0)^2 \ge \|x_1 + \alpha \, dx_1\|^2.\]

Expanding and collecting by powers of \(\alpha\):

\[\underbrace{(dx_0^2 - \|dx_1\|^2)}_{a} \, \alpha^2 + \underbrace{2(x_0 \, dx_0 - x_1^\top dx_1)}_{b} \, \alpha + \underbrace{(x_0^2 - \|x_1\|^2)}_{c} \ge 0.\]

Because \(x\) is already in the cone, \(c = \det(x) = x_0^2 - \|x_1\|^2 \ge 0\), so \(\alpha = 0\) is always feasible. The maximum feasible \(\alpha\) is therefore the smallest positive real root of the quadratic \(a \alpha^2 + b \alpha + c = 0\).

Case analysis. Before solving the quadratic the code handles four degenerate cases:

Scalar safeguard. If \(dx_0 < 0\), the first component could go negative. An independent upper bound \(-x_0 / dx_0\) is applied first.
No positive root (\(a > 0\) and \(b > 0\), or discriminant \(d = b^2 - 4ac < 0\)). The parabola either opens upward with a positive vertex shift or has no real roots. Either way the quadratic stays non-negative for all \(\alpha \ge 0\), so the current bound is returned unchanged.
Linear case (\(|a| < 10^{-14}\)). The leading term vanishes; the constraint is linear in \(\alpha\). With \(c \ge 0\) and the sign structure this imposes no additional restriction, so the bound is returned unchanged.
Boundary case (\(c = 0\), i.e. \(x\) is on the cone boundary). If \(a \ge 0\) there is no positive root; otherwise \(\alpha = 0\) is the only feasible point.

Numerically stable root computation. When none of the degenerate cases applies, the citardauq formula is used to avoid catastrophic cancellation. Let \(\sqrt{d} = \sqrt{b^2 - 4ac}\). Define

\[\begin{split}t = \begin{cases} -b - \sqrt{d} & \text{if } b \ge 0 \\ -b + \sqrt{d} & \text{if } b < 0 \end{cases}\end{split}\]

Then the two roots are computed as

\[r_1 = \frac{2c}{t}, \qquad r_2 = \frac{t}{2a}.\]

This form ensures that both \(r_1\) and \(r_2\) are computed by dividing two numbers of the same sign, avoiding the large relative error that arises when subtracting nearly equal quantities. Negative roots are discarded (replaced by \(+\infty\)), and the smaller of \(r_1\), \(r_2\) is taken as the step-length restriction for this cone.

Static Regularization 

The KKT system solved at each IPM iteration is a symmetric indefinite linear system of the form

\[\begin{split}\begin{bmatrix} P & A^\top & G^\top \\ A & 0 & 0 \\ G & 0 & -W^\top W \end{bmatrix} \begin{bmatrix} \Delta x \\ \Delta y \\ \Delta z \end{bmatrix} = \begin{bmatrix} r_x \\ r_y \\ r_z \end{bmatrix}\end{split}\]

To keep the system nonsingular and to give each diagonal block a well-defined sign for the factorization, a diagonal perturbation is added before every factorization, yielding the regularized system

\[\begin{split}\begin{bmatrix} P + \varepsilon_P I & A^\top & G^\top \\ A & -\varepsilon_A I & 0 \\ G & 0 & -W^\top W - \varepsilon_G I \end{bmatrix} \begin{bmatrix} \Delta x \\ \Delta y \\ \Delta z \end{bmatrix} = \begin{bmatrix} r_x \\ r_y \\ r_z \end{bmatrix}\end{split}\]

The three parameters are kept separate because the blocks have different signs and may require different magnitudes:

Setting	Block	Rationale
`kkt_static_reg_P` (\(\varepsilon_P\))	(1,1) — \(P\)	Ensures the (1,1) block is positive definite even when \(P\) is only positive semidefinite.
`kkt_static_reg_A` (\(\varepsilon_A\))	(2,2) — equality constraints	Gives the zero (2,2) block a definite (negative) sign, preventing near-zero pivots on problems with redundant equality constraints.
`kkt_static_reg_G` (\(\varepsilon_G\))	(3,3) — NT scaling \(W^\top W\)	Guards against near-zero pivots when the NT scaling matrix is ill-conditioned near the cone boundary.

Implementation. kkt_static_reg_P is applied once during setup by regularize_P (or construct_identity when \(P = 0\)) in src/qoco_api.c, and is corrected for in compute_objective and compute_kkt_residual so that reported objectives and residuals reflect the original unregularized problem. kkt_static_reg_A is baked into the KKT matrix structure by construct_kkt (src/kkt.c) at setup time and does not change across iterations. kkt_static_reg_G is re-applied every iteration by linsys_update_nt after the NT scaling block is refreshed.

Adaptive Dynamic Regularization 

QOCO uses two coupled mechanisms to handle near-singular KKT systems.

Per-Pivot Dynamic Regularization (QDLDL)

Inside QDLDL_factor (lib/qdldl/src/qdldl.c), each diagonal pivot D[k] must have the correct sign for the quasi-definite structure: positive for the \(P\) block, negative for the equality constraint and \(-W^\top W\) blocks. pos_diags is set to \(n\) (the number of primal variables) and encodes the boundary between positive- and negative-expected pivots under the AMD permutation: a pivot whose original row index perm[k] < n is expected to be positive; one with perm[k] >= n is expected to be negative.

If a pivot has the wrong sign or is smaller in magnitude than 1e-11, it is replaced with ±kkt_dynamic_reg:

// Positive-expected pivot (perm[k] < pos_diags):
if (D[k] < 1e-11)   D[k] = dyn_reg;

// Negative-expected pivot:
if (D[k] > -1e-11)  D[k] = -dyn_reg;

The threshold (1e-11) and the replacement value (kkt_dynamic_reg) are deliberately decoupled. The threshold is fixed: it answers “is this pivot numerically bad?” — a property of the problem, not of how many times the solver has stalled. The replacement value escalates with the adaptive outer loop, ensuring bad pivots are substituted with a value large enough to dominate downstream numerical error. Coupling the two (using kkt_dynamic_reg for both) would cause the threshold to grow alongside the replacement, perturbing pivots that are small but valid — making the factorization less accurate than necessary.

Step-Stall and NaN Detection (Outer Loop)

After each predictor-corrector step, check_stopping (src/qoco_utils.c) examines work->a. Two situations both reduce work->a to zero and trigger this path:

The computed step size is genuinely tiny — the iterates have stalled.
The KKT solution contains NaN values, which signals a likely factorization failure. predictor_corrector detects NaNs via check_nan, sets work->a = 0.0, and returns early before updating the iterates.

When work->a \(< 10^{-8}\), the solver multiplies kkt_dynamic_reg by 10 and returns 0 from check_stopping. This is not a retry — the outer loop in qoco_solve advances to the next IPM iteration, which will re-factorize with the larger kkt_dynamic_reg. If kkt_dynamic_reg has grown past \(10^{-6}\), the solver instead applies the inaccurate tolerances (abstol_inacc, reltol_inacc): it exits with QOCO_SOLVED_INACCURATE if the looser check passes, or QOCO_NUMERICAL_ERROR otherwise.

Stopping Criteria 

At the start of each IPM iteration check_stopping (src/qoco_utils.c) tests three residuals in the original (unscaled) problem space:

Primal residual \(r_p = \max(\|Ax - b\|_\infty,\ \|Gx + s - h\|_\infty)\)
Dual residual \(r_d = \|Px + c + A^\top y + G^\top z\|_\infty\)
Duality gap \(g = s^\top z\) (note: distinct from \(\mu = s^\top z / m\), the per-component complementarity used by the predictor-corrector)

The solver declares QOCO_SOLVED when all three satisfy an absolute-plus-relative threshold simultaneously:

\[\begin{split}r_p &< \varepsilon_{\text{abs}} + \varepsilon_{\text{rel}} \cdot \max(\|Ax\|_\infty,\, \|b\|_\infty,\, \|Gx\|_\infty,\, \|h\|_\infty,\, \|s\|_\infty) \\ r_d &< \varepsilon_{\text{abs}} + \varepsilon_{\text{rel}} \cdot \max(\|Px\|_\infty,\, \|A^\top y\|_\infty,\, \|G^\top z\|_\infty,\, \|c\|_\infty) \\ g &< \varepsilon_{\text{abs}} + \varepsilon_{\text{rel}} \cdot \max(1,\, |p_{\text{obj}}|,\, |d_{\text{obj}}|)\end{split}\]

where \(\varepsilon_{\text{abs}}\) = abstol and \(\varepsilon_{\text{rel}}\) = reltol. Because QOCO equilibrates the problem internally, the Ruiz scaling factors are unwound before computing each norm so that the residuals reflect the original problem data.

Best-Iterate Restoration 

The interior-point sequence is not monotone in \((r_p, r_d, g)\) — a late iteration can degrade after the iterates have already passed close to the solution, and a NaN-producing factorization can blow up an otherwise good iterate. To avoid returning a worse point than the solver actually found, QOCO maintains a checkpoint of the best iterate seen so far and falls back to it on non-success exits.

The “best” iterate is defined by a composite progress metric, computed in check_stopping (src/qoco_utils.c) on the same unscaled residuals used by the regular stopping check:

\[M = \max\!\left( \frac{r_p}{\varepsilon^{\text{inacc}}_{\text{abs}} + \varepsilon^{\text{inacc}}_{\text{rel}} \cdot s_p},\ \frac{r_d}{\varepsilon^{\text{inacc}}_{\text{abs}} + \varepsilon^{\text{inacc}}_{\text{rel}} \cdot s_d},\ \frac{g }{\varepsilon^{\text{inacc}}_{\text{abs}} + \varepsilon^{\text{inacc}}_{\text{rel}} \cdot s_g} \right)\]

where \(s_p, s_d, s_g\) are the same scale factors used for the absolute / relative stopping check and \(\varepsilon^{\text{inacc}}_{\text{abs}}\) = abstol_inacc, \(\varepsilon^{\text{inacc}}_{\text{rel}}\) = reltol_inacc. \(M\) is in inaccurate-tolerance units: \(M \le 1\) exactly when the current iterate already satisfies the inaccurate stopping criterion on all three residuals simultaneously, and smaller is always better. Combining all three residuals into one scalar avoids the ambiguity of multi-objective comparisons (e.g. lower \(r_p\) but higher \(g\)).

Each iteration with a finite \(M\) strictly smaller than the previous best overwrites the saved checkpoint. The checkpoint stores x, s, y, z in scaled space along with \(r_p\), \(r_d\), \(g\), \(p_{\text{obj}}\), the metric value, and the iteration index, in the best_* fields of QOCOWorkspace.

restore_best_iterate (src/qoco_utils.c) is invoked from qoco_solve on two exit paths:

Numerical error — when check_stopping returns QOCO_NUMERICAL_ERROR (dynamic regularization has saturated and the inaccurate check on the current iterate failed).
Max-iter — when the IPM loop exits without converging.

If the restored iterate satisfies \(M \le 1\), the status is upgraded from QOCO_NUMERICAL_ERROR / QOCO_MAX_ITER to QOCO_SOLVED_INACCURATE. Restoration runs before unscale_variables and copy_solution, so the returned QOCOSolution always corresponds to the best iterate the solver visited, not the diverged final state.

Restoration is a no-op when no best iterate has been recorded (best_valid is zero), which protects the case where the very first check_stopping call produces a non-finite metric.

Building 

Prerequisites 

CMake ≥ 3.18
C compiler: clang or gcc (Linux/macOS), MSVC (Windows)
Python 3.11+ with cvxpy (for test data generation)
CUDA toolkit ≥ 13.0 (GPU backend only)

CPU backend (default)

cmake -B build -DQOCO_BUILD_TYPE=Release -DENABLE_TESTING=True
cmake --build build

GPU backend 

cmake -B build \
  -DQOCO_ALGEBRA_BACKEND=cuda \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda-13.0/bin/nvcc \
  -DQOCO_BUILD_TYPE=Release \
  -DENABLE_TESTING=True
cmake --build build -j$(nproc)

Floating point precision 

QOCOFloat is selected at configure time in the root CMakeLists.txt and defined in include/definitions.h. Double precision is the default. Developers can build the CPU solver in single precision or long double precision with:

cmake -B build-float -DQOCO_SINGLE_PRECISION=ON
cmake -B build-long-double -DQOCO_LONG_DOUBLE_PRECISION=ON

The precision options are mutually exclusive. CMake stops with a configuration error if both are enabled.

The selected precision is propagated in two places:

configure/qoco_config.h.in defines QOCO_SINGLE_PRECISION or QOCO_LONG_DOUBLE_PRECISION for QOCO.
QDLDL_FLOAT or QDLDL_LONG_DOUBLE is forced in the QDLDL subproject so QDLDL_float has the same size as QOCOFloat.

Long double precision has two additional constraints. It is supported only by the builtin CPU backend because the CUDA backend and cuDSS path do not expose a matching long double solve path. It also requires a platform where long double is wider than double; CMake checks LDBL_MANT_DIG against DBL_MANT_DIG and fails early when the types have equivalent precision.

When adding precision-sensitive code, use the precision helpers from include/definitions.h instead of hard-coding double-specific functions or formats:

qoco_sqrt dispatches to sqrtf, sqrt, or sqrtl.
QOCOFloat_MAX comes from FLT_MAX, DBL_MAX, or LDBL_MAX.
QOCOFloat_PRINT_FORMAT and QOCOFloat_PRINT_ARG keep formatted output correct for float, double, and long double.

Use QOCOFloat for all user problem data copies, workspace values, residuals, and backend numeric storage. Avoid temporary casts through double unless the loss of precision is intentional and documented.

CMake options 

Option	Default	Description
`QOCO_ALGEBRA_BACKEND`	`builtin`	Backend: `builtin` or `cuda`
`QOCO_BUILD_TYPE`	`Release`	`Debug` (adds `-g`, ASAN/UBSAN on Unix) or `Release` (`-O3`)
`QOCO_SINGLE_PRECISION`	`OFF`	Use `float` instead of `double`
`QOCO_LONG_DOUBLE_PRECISION`	`OFF`	Use `long double` instead of `double` for the builtin CPU backend
`ENABLE_TESTING`	`OFF`	Build and register test suite
`BUILD_QOCO_DEMO`	`OFF`	Build `examples/qoco_demo`
`BUILD_QOCO_BENCHMARK_RUNNER`	`OFF`	Build benchmark runner

Unit Tests 

Tests use Google Test and are run with ctest. All test executables link against qocostatic.

Test categories 

Directory	Executable(s)	What it covers
`tests/unit_tests/`	`linalg_test`, `cone_test`, `input_validation_test`, `precision_test`	Individual components
`tests/simple_tests/`	`missing_constraints_test`	End-to-end with missing constraint types (LP-only, SOC-only)
`tests/ocp/`	`lcvx_test`, `lcvx_bad_scaling_test`, `pdg_test`	Optimal control problems
`tests/portfolio/`	`markowitz_test`	Portfolio optimization (Markowitz)

Unit test details 

linalg_test — covers the linalg layer:

CSC matrix creation and copying
Array copy / negate / scale
Dot products, sparse matrix-vector products

cone_test — covers src/cone.c:

Cone products and divisions for LP and SOC cones
Mixed LP + SOC problems

input_validation_test — covers src/input_validation.c:

Rejects invalid settings (tolerances, iteration counts, etc.)

precision_test — covers configured scalar precision:

Verifies QOCOFloat and QDLDL_float use the same storage size
Checks that long double builds preserve input values that would be rounded by a cast through double
Confirms setup and solve do not mutate user-owned problem data

Integration tests 

The OCP and portfolio tests load pre-generated problem data from header files (e.g. lcvx_data.h, markowitz_data.h) and call the full solve pipeline. They assert that the optimal objective matches a reference value within 0.01% relative error.

Problem data is generated by the Python scripts in each test directory (generate_problem_data.py), which use cvxpy to solve the reference problem. The generated .h files are committed to the repository, so cvxpy is only needed if you regenerate them.

Running tests locally 

# Run all tests
ctest --test-dir build --verbose

# Run a specific test
ctest --test-dir build -R lcvx_test --verbose

# Run with output on failure and retry
ctest --test-dir build --rerun-failed --output-on-failure

CI Workflows 

All workflows are in .github/workflows/.

unit_tests.yml — primary test suite 

Triggers on every push and pull request.

Runs the full test matrix in parallel (fail-fast: false):

OS	Compiler	Build types
ubuntu-latest	clang	Debug, Release
macos-latest	clang	Debug, Release
windows-latest	MSVC	Debug, Release

The Debug build enables -fsanitize=address,undefined on Linux and macOS, so memory errors and undefined behaviour are caught automatically.

To reproduce a CI failure locally (e.g. ubuntu clang Debug):

cmake -B build -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ \
      -DQOCO_BUILD_TYPE=Debug -DENABLE_TESTING=True -S .
cmake --build build
ctest --test-dir build --verbose --rerun-failed --output-on-failure

clang_tidy.yml — static analysis 

Triggers on every push and pull request.

Builds with CMAKE_EXPORT_COMPILE_COMMANDS=ON then runs clang-tidy on all src/*.c files (excluding OS-specific timers). Config is in .clang-tidy.

Enabled check families: bugprone-*, clang-analyzer-*, misc-unused-parameters. Disabled: bugprone-easily-swappable-parameters, clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling. All warnings are treated as errors.

To reproduce locally:

devtools/run_clang_tidy.sh

clang_format.yml — formatting enforcement 

Triggers on every push and pull request.

Runs clang-format --dry-run --Werror on all .c and .h files under src/, include/, and algebra/builtin/. Fails if any file is not formatted according to .clang-format.

To reproduce locally:

devtools/run_clang_format.sh --check   # check only (same as CI)
devtools/run_clang_format.sh           # fix in place

benchmark_regression.yml — performance regression 

Triggers only on pull requests targeting main.

Builds both main and the PR branch, runs the benchmark suite against both, and posts a comparison report as a PR comment. Uses configs in benchmarks/configs/main.yml and benchmarks/configs/branch.yml.

docs.yml — documentation deployment 

Triggers manually (workflow_dispatch only).

Builds the Sphinx docs from docs/ and deploys to the gh-pages branch. Also deploys to the root of gh-pages if the version being built is the latest released version.

Developer Tools 

All scripts in devtools/ are intended to be run from the repository root.

run_tests.sh 

Builds in Release mode with testing, demo, and benchmark runner enabled, then runs the full test suite.

devtools/run_tests.sh

run_tests_gpu.sh 

Same as above but configures the CUDA backend with CUDA 13.0.

devtools/run_tests_gpu.sh

run_clang_tidy.sh 

Generates a temporary build directory with compile commands, runs clang-tidy on all src/*.c files, then removes the build directory.

devtools/run_clang_tidy.sh

run_clang_format.sh 

Checks or fixes formatting for src/, include/, and algebra/builtin/.

devtools/run_clang_format.sh           # fix in place
devtools/run_clang_format.sh --check   # report violations and exit non-zero

profile.sh 

Profiles a benchmark run on CPU using Valgrind’s callgrind tool and opens the result in KCachegrind.

devtools/profile.sh path/to/benchmark/data

profile_gpu.sh 

Profiles a benchmark run on GPU using NVIDIA Nsight Systems.

devtools/profile_gpu.sh path/to/benchmark/data

Developer Guide

Iterative Refinement Stopping Criteria

Per-Pivot Dynamic Regularization (QDLDL)

Step-Stall and NaN Detection (Outer Loop)