Research

Carson Rodrigues

Independent researcher working at the intersection of production AI systems and their measurement — voice-AI latency, LLMOps, the Model Context Protocol, multi-agent reliability, human–AI trust, and clinical ML. 14 papers — 1 accepted at ICANN 2026 (Springer LNCS), with 8 more under peer review and 5 live preprints. Most studies are built on systems I ship in production, so the methods are evaluated under real workloads rather than in isolation.

iD0009-0001-7195-6742 carson@celabe.comAffiliation: Celabe, Research Division

ORCID Google Scholar arXiv OpenReview OSF / PsyArXiv Research site

Papers

Accepted

Under review

Live preprints

Voice AI systems & latencyLLMOpsModel Context Protocol (MCP)Multi-agent reliabilityHuman–AI trustAgentic safetyClinical ML

Papers & preprints

Updated July 2026

AcceptedICANN 2026·Conference · Springer LNCS

Latency Optimization for a Production Voice AI Platform

Carson RodriguesiD (Celabe), Oysturn VasiD (University of Waterloo)

A systems-level latency study of a production voice-AI platform (Anthropic Claude intent detection + ElevenLabs TTS over a NestJS WebSocket pipeline). The central finding is that running intent detection and TTS concurrently — rather than shaving any single stage — is the highest-leverage optimization, cutting median end-to-end latency from 3,277 ms to 1,909 ms.

−41.8% p50 latency (3,277 → 1,909 ms)Voice AILatencyLLM systems

Accepted at ICANN 2026 (peer-reviewed; Springer LNCS proceedings). Registration confirmed; camera-ready June 2026.

SubmittedIEEE Software·Journal · magazine article

MCP Server Architecture Patterns

Carson RodriguesiD (Celabe), Oysturn VasiD (University of Waterloo)

A pattern catalogue for production Model Context Protocol (MCP) servers — a Gamma-format taxonomy, anti-patterns, and cross-cutting concerns — validated with a real inter-rater reliability study (kappa = 0.76). Proposes a scoped Proxy-Aggregator pattern once tool counts exceed a practical threshold.

Inter-rater kappa = 0.76MCPSoftware architectureAgents

arXiv preprint doi:10.48550/arXiv.2606.30317

Condensed magazine version submitted to IEEE Software on 29 June 2026; the extended 9-page version is openly available on arXiv (cs.AI primary, cs.SE cross-list) under the fuller title “MCP Server Architecture Patterns for LLM-Integrated Applications”.

SubmittedAAAI 2027·Conference · double-blind

When Do LLMs Replace Fine-Tuned NLU? A Decision Framework for Intent Detection

Carson RodriguesiD (Celabe), Oysturn VasiD (University of Waterloo)

A decision framework for choosing between LLM classifiers and fine-tuned NLU on noisy production transcripts. Shows that full-data TF-IDF still reaches 95.2% on ATIS, and maps the regimes where an LLM-based intent classifier is — and is not — worth its cost and latency.

95.2% on ATIS (full-data TF-IDF baseline)NLUIntent detectionLLM evaluation

Submitted to AAAI-27 on 1 July 2026 (double-blind review; Montréal, February 2027).

SubmittedACML 2026·Conference · double-blind

When Is an LLM Worth It for Hyperparameter Optimization? A Budget-Matched Study

Carson RodriguesiD (Celabe), Oysturn VasiD (University of Waterloo), Isaiah Abner DCosta (University of Queensland), Nithish Kumar Prabhakaran (University of Queensland)

A budget-matched, multi-seed study of whether an LLM advisor actually helps hyperparameter optimization on tabular data. The deflationary finding: the advisor's strong first guess is not an LLM output but a fixed default configuration evaluated before any model call. Once classical search (random, Optuna-TPE, Bayesian optimization, successive halving) is seeded with that same default, the apparent lead collapses within a handful of evaluations and the LLM adds no measurable generalization benefit.

LLM warm-start = a default config, not the modelAutoMLHyperparameter optimizationLLMs

arXiv preprint doi:10.48550/arXiv.2606.21641

Submitted to ACML 2026 on 26 June 2026 (Conference Track, double-blind; notification 15 September 2026). Preprint on arXiv (cs.LG · cs.AI); the harness and a script that reproduces every statistic are released.

Retargeting · preprint liveHCI journal (retargeting)·Journal · review article

Personality Traits and Trust in Large Language Models: A Scoping Review

Carson RodriguesiD (Celabe), Simran Marian Rebello (Celabe)

A scoping review of how personality traits shape user trust in large language models, extended toward agentic (action-taking) systems with a set of testable propositions for future empirical work.

Scoping review + agentic propositionsHuman–AI trustPersonalityScoping review

PsyArXiv preprint doi:10.31234/osf.io/2mdu8

Preprint openly available on PsyArXiv; currently being retargeted, with a publisher transfer offer under consideration as of July 2026. Novelty: Big Five traits as a moderator of LLM trust calibration.

SubmittedNeurIPS 2026·Conference · double-blind

Hallucination as Context Drift: Synchronization Protocols for Multi-Agent LLM Systems

Carson RodriguesiD (Celabe)

Reframes a class of multi-agent LLM hallucination as context drift: divergence between agents' internal world-states. Introduces the Context Divergence Score (CDS) and the Shared State Verification Protocol (SSVP). Across two domains on Claude Haiku, naive full-broadcast synchronization backfires by propagating one agent's erroneous state (hallucination rate 0.658 vs. 0.492 for no-sync), while SSVP avoids that failure mode and beats full-broadcast using 58% fewer API calls.

Full-broadcast backfires; SSVP at −58% API callsMulti-agentReliabilityEvaluation

arXiv preprint doi:10.48550/arXiv.2606.21666

Under double-blind review; author notifications expected late September 2026. Preprint on arXiv.

In peer review · preprint liveJMIRx Med·Journal · overlay

DentaCoPilot: Dental Procedure Prediction from Patient Records

Carson RodriguesiD (Celabe), Steffie Dione Rebello (KLE (co-PI))

A machine-learning approach to predicting dental procedures from patient records, evaluated on a synthetic pipeline benchmark plus a real public-data benchmark (AHRQ MEPS 2023 dental visits), with calibrated abstention for out-of-distribution charts.

Clinical-AI preprint · real public-data benchmark (MEPS)Clinical MLHealthcare AIPrediction

medRxiv preprint doi:10.64898/2026.05.07.26352635

Carson and Steffie share first authorship (equal contribution). Preprint live on medRxiv (v3); submitted for peer review at JMIRx Med — the PubMed-indexed overlay journal for medRxiv — on 16 July 2026.

Under reviewIC-SIT 2026 (Silicon University, India)·Conference · IEEE Xplore

A Structural-Similarity (SSIM)-Based Framework

Carson RodriguesiD (Celabe), and collaborators (7 authors)

A structural-similarity (SSIM)-based framework developed as a seven-author collaboration, targeting IEEE Xplore conference proceedings.

Collaborative (Carson a co-author)SSIMCollaboration

Acceptance notification pending.

Stage-1 Registered ReportPCI Registered Reports → Peer Community Journal·Registered Report

Personality and Over-Delegation to Agentic LLMs

Carson RodriguesiD (Celabe)

An empirical follow-up to the trust scoping review (Paper 05), testing its propositions on over-delegation to agentic, action-taking LLMs. Stage-1 manuscript complete, with the jsPsych experiment harness built and the design powered at 0.80 for N = 320.

Stage-1 complete · powered at 0.80 for N = 320Agentic safetyHuman–AI trustOver-delegation

Stage-1 Registered Report on the PCI-RR route (Peer Community Journal). PCI-RR is closed to new submissions between 1 July and 1 September 2026, so submission is scheduled for September.

arXiv submitted · TMLR nextTMLR (target) · arXiv submitted·Journal

Done Is Not Correct: Measuring Silent Failures and Self-Verification Calibration When LLM Agents Take CAD Actions

Carson RodriguesiD (Celabe), Clive Rodrigues, Aravind Reddy G

A reproducible benchmark for agentic CAD. When an LLM agent writes parametric CAD code from a natural-language spec, syntactic success (the code runs, the geometry renders) decouples from semantic correctness (right dimensions, valid constraints, manufacturable features). The harness measures that silent-failure gap directly and tests whether claim-conditioned self-verification closes it.

468 runs · single-shot 14.1% silent failures (ECE 0.32) → 3.2% with enforced self-checks · expert–grader agreement κ = 0.64 (0.87 excluding one ambiguous task)Agentic safetyReliabilityEvaluation

Benchmark complete (4 models × 3 conditions); independent CAD-engineer validation done (blind pass/fail scoring of 81 parts). Submitted to arXiv on 12 July 2026 (ID pending), with TMLR as the review venue. AAAI-27 was dropped on 9 July 2026 because it mandates in-person presentation in Montréal and this paper has no Canada-based author.

Reframed · app liveJMIRx Med (target)·Journal · overlay

Oravira: Design, Privacy-by-Design Architecture, and App-Quality Audit of a Mobile Oral-Health App for People Living with HIV

Carson RodriguesiD (Celabe), Steffie Dione Rebello (KLE)

Oravira is a privacy-by-design mobile app that helps people living with HIV self-manage their oral health. The paper documents the system architecture, the privacy model that keeps sensitive status data off any server, and a MARS-based quality audit of the app-store landscape it sits in — a design-and-architecture contribution that stands on its own, without patient data.

App live on iOS + Android · MARS app-store auditClinical MLHealthcare AIPrivacy

Reframed in July 2026: the clinical evaluation route through the KLE ethics committee and CTRI trial registration is no longer part of this paper, so it needs no ethics approval; a KLE-run study is referenced as future work only.

SubmittedREALM @ EMNLP 2026 workshop·Workshop · non-archival

DevTwin: Benchmarking Passively-Formed Identity Memory for Developer Agents

Carson RodriguesiD (Celabe)

Personal-memory systems for language agents are usually scored on retrieval: whether a relevant chunk can be found in a stored history. DevTwin argues this misses what a personal memory is for, which is to know the person rather than the transcript, and benchmarks whether a memory built passively from a developer's own public artifacts (git history, papers, project documents) can answer identity questions about them.

Identity-memory benchmark (passive, artifact-grounded)Agent memoryEvaluationPersonalization

Submitted to REALM @ EMNLP 2026 (non-archival; notification August 2026). The central result currently rests on a single-developer anchor (n=1) pending consented human ratings.

Manuscript final · arXiv imminentTMLR (target)·Journal

It Depends on the Dataset: When a Brain-Encoding Model's Predicted Responses Beat Their Visual Backbone for Video Memorability

Carson RodriguesiD (Celabe)

Do a brain-encoding foundation model's predicted fMRI responses forecast video memorability better than the visual backbone they are built on? Using the TRIBE encoder over its V-JEPA2 backbone, the answer flips with the dataset: a clean double dissociation rather than a single winner. The predicted-brain projection carries a small but real, vision-orthogonal signal that helps on one dataset and hurts on another, which cautions against treating a brain-encoding readout as a free upgrade over the backbone.

Dataset-dependent double dissociation · Memento10k: backbone 0.594 > brain 0.544 · VideoMem: brain 0.415 > backbone 0.368 (+0.047, p = 0.006); cross-dataset transfer inherits the flipNeuro-AIRepresentation learningEvaluation

Analysis complete over Memento10k and VideoMem (VideoMem access granted). TMLR manuscript final and the arXiv bundle built; results were shared with the dataset authors first, and the VideoMem author replied positively and raised a possible benchmark collaboration for September 2026.

Draft complete · pre-arXivarXiv (cs.AI) → perspective venue TBD·Perspective · position paper

Dreams and Hallucinations: Generative Memory Reconstruction in Biological and Artificial Intelligence

Carson RodriguesiD (Celabe)

A perspective paper arguing that dreaming and LLM hallucination are the same operation seen in two substrates: generative reconstruction from distributed memory under incomplete constraints. The single axis separating reliable output from a dream or a hallucination is the strength of constraint and verification placed on the generator. It proposes a four-stage framework — encoding, latent, generative reconstruction, reality verification — recasts mitigation as restoring verification rather than suppressing generation, and reads lucid dreaming as controlled generation mid-stream.

Four-stage framework · 5 falsifiable predictionsMemoryHallucinationNeuro-AI

Sole-authored perspective paper with no experiments; draft complete as of 18 July 2026, with all 74 references independently verified against Crossref/arXiv. Heading to arXiv (cs.AI) ahead of a perspective venue decision. Builds on Papers 02 and 06.

Several papers are under double-blind review, so author lists and venues may be anonymized in the submitted copies. DOIs are linked where a preprint is publicly posted; full manuscripts, data, and code for any work in progress are available on request — carson@celabe.com. ORCID: 0009-0001-7195-6742.

Available for senior AI / contract / FDE work

Building something with AI?

Voice agents, MCP servers, LLM pipelines, agentic workflows — pick a slot, drop a message, or send your email and I'll reply within a day.

Book a 30-min call WhatsApp me

or leave your email

Replies within ~24 hours · Remote-first · global · open to relocation