Problems I Solve

Six recurring problems organisations hire me to fix.

Each one has the same shape: symptoms you'll recognise, root causes the team usually hasn't named, a short-term fix that buys time without making it worse, and a long-term architecture solution.

1. Scalability bottlenecks

Symptoms

Response times spike under predictable load (Monday 9am, payroll day, sale launch)
One slow query takes the database — and the rest of the app — down
Adding more EC2 instances doesn't help, or costs more than it saves
Background jobs and user requests share the same connection pool

Root causes

The system was designed for a user base it never reached, OR for a user base it long outgrew
3–5 hot endpoints carry 80%+ of load, but everything is scaled together
Caching, queueing, read-replicas added late and inconsistently

Short-term fix (this week)

Identify the top 5 endpoints by latency × volume; optimise queries on those only
Add Redis (or in-process cache) for the 3 most-read entities
Move expensive reports to background queue with cached results
Add CloudFront in front of any GET-heavy public API

Long-term solution

Profile real usage; right-size to actual users, not hypothetical 10×
Identify which 3–5 services need horizontal scale; keep the rest as a tight monolith
Read replica + connection-pool segregation (jobs vs requests)
Auto-scaling with explicit ceilings to prevent runaway costs

→ See: Inara — 40 microservices collapsed into 1 modular monolith

2. Legacy system mess

Symptoms

Nobody fully understands the business logic — it's spread across SQL, stored procs, and one developer's head
Every release breaks something seemingly unrelated
Hiring is hard because the stack is old (VB.NET, Java 6, classic ASP, Cobol)
A "small change" takes weeks because of unknown dependencies

Root causes

15+ years of accumulated decisions, no documentation owners
Database schema not in version control; every release adds undocumented changes
Fear of touching anything → frozen architecture → mounting tech debt

Short-term fix

AI-assisted documentation: feed legacy code through an LLM; produce one-page-per-module docs in days
Schema diff tooling so every release auto-records DB changes
Strangler façade in front of the legacy: new endpoints route to new code, old endpoints unchanged

Long-term solution

Module-by-module migration behind feature flags; both systems run in parallel until parity
One team owns the migration playbook; one team keeps the old running
End-state is rarely a full rewrite — usually a modular monolith with clean boundaries

→ See: Jobscope — VB.NET → .NET 8 with AI-assisted code migration

3. Cloud cost explosion

Symptoms

AWS bill grows faster than user count or revenue
Finance asks why and engineering can't explain
Most-used services aren't the most-expensive ones
"It must be the new feature" — but the spend started climbing months ago

Root causes

Microservices added without cost analysis — each service has fixed pod overhead
Managed services bought for convenience, not measured usage (Auth0, Elasticsearch cluster, Redis cluster, Kafka)
No lifecycle policies on S3, no retention on logs, no auto-shutdown on dev environments
Idle pods running 24/7 because nobody removed them after a feature was deprecated

Short-term fix

AWS Cost Explorer audit by service and tag
Kill orphan resources: unattached EBS, idle ELBs, unused NAT gateways
S3 lifecycle rules (Glacier after 90 days; delete logs after 30)
Reserved instances or Savings Plans on everything stable

Long-term solution

Collapse over-fragmented microservices that don't have independent scale needs
Replace heavy managed components with in-process equivalents where the trade-off makes sense
Cost-per-user and cost-per-request as engineering metrics, reviewed quarterly
Auto-scale max ceilings to prevent runaway costs

→ See: Inara — $3.5–5.5K/mo to $340–545/mo (85–90% reduction)

4. Security & compliance gaps

Symptoms

Auth code spread across 4 services; nobody can answer "what does our session model look like"
Cookies bloat as more permissions are added → request size errors under load
HIPAA / SOC 2 / GDPR audit highlights gaps you can't deny
Audit logs exist but in different places, different formats

Root causes

Auth bolted onto auth — OAuth2, then JWT, then cookies, then OIDC, none retired
Multi-tenancy enforced in application code only (one missed query = data leak)
PII / PHI in plaintext columns because encryption was "for later"

Short-term fix

Document the auth flow as it really is, not as it's drawn
Move oversized cookies to JWT + server-side permission lookup
Centralise audit logging via one library

Long-term solution

PostgreSQL row-level security for tenant isolation (defence in depth)
Field-level encryption via AWS KMS for PII / PHI
SAML / SCIM for enterprise SSO; ROPC for service-to-service
External penetration test before claiming SOC 2 / HIPAA readiness

→ See: Jobscope — cookie-bloat fixed via JWT + RBAC migration

5. Integration chaos

Symptoms

Every new partner integration is a 4–8 week project
Partners share user cookies because there's no service-to-service auth
Webhook deliveries fail silently; no replay mechanism
Different partners need different data shapes — your codebase has 5 versions of "send to partner"

Root causes

No clean integration boundary: APIs designed for the UI, not for partners
Authentication wasn't separated for service-to-service vs human
Outbound webhooks added ad-hoc as features needed them

Short-term fix

One ROPC token endpoint for all service-to-service callers; no more cookie-sharing
Webhook signature verification (HMAC) on every outbound
Idempotency keys on POST endpoints partners hit

Long-term solution

API gateway with rate-limiting, key management, audit logs per partner
Stable v1 partner API distinct from internal API; versioned with sunset dates
Partner-onboarding playbook: docs + sandbox + sample integrations → onboard in days, not weeks

6. AI adoption confusion

Symptoms

Pressure from leadership to "add AI" but no clear use case
Demos work great; production usage hallucinates or ruins margin
Multiple LLM API keys, no cost ceiling, no eval framework
Engineering team uncertain whether to build with AI or just talk about it

Root causes

AI applied where determinism matters (billing, regulated decisions) — wrong
No grounding (no RAG, no structured tool calls) — model invents
No human-in-the-loop on high-value actions — runaway risk

Short-term fix

Pick 1–2 places where AI clearly pays off: lead qualification, code analysis, content drafting, support deflection, document summarisation
Pair every LLM call with a deterministic rule layer for high-stakes outputs
Cost ceiling per request and per user; alert on spikes

Long-term solution

Multi-provider orchestration (OpenAI / Anthropic / Gemini) with fallback and cost/latency routing
RAG pipeline grounded in your actual data, not generic web
Evaluation harness — you can A/B prompts and measure regression objectively
Agentic patterns where appropriate, with bounded tools and observability over the reasoning path

→ See: AI Sales & Logistics Pipeline — 9 specialised agents with deterministic guardrails

Recognise more than one of these in your system?

A 1-week Architecture Review covers all six. You walk away knowing exactly what to fix first and what to leave alone.

Book a System Review