Problems I Solve

Six recurring problems organisations hire me to fix.

Each one has the same shape: symptoms you'll recognise, root causes the team usually hasn't named, a short-term fix that buys time without making it worse, and a long-term architecture solution.

1. Scalability bottlenecks

Symptoms

  • Response times spike under predictable load (Monday 9am, payroll day, sale launch)
  • One slow query takes the database — and the rest of the app — down
  • Adding more EC2 instances doesn't help, or costs more than it saves
  • Background jobs and user requests share the same connection pool

Root causes

  • The system was designed for a user base it never reached, OR for a user base it long outgrew
  • 3–5 hot endpoints carry 80%+ of load, but everything is scaled together
  • Caching, queueing, read-replicas added late and inconsistently

Short-term fix (this week)

  • Identify the top 5 endpoints by latency × volume; optimise queries on those only
  • Add Redis (or in-process cache) for the 3 most-read entities
  • Move expensive reports to background queue with cached results
  • Add CloudFront in front of any GET-heavy public API

Long-term solution

  • Profile real usage; right-size to actual users, not hypothetical 10×
  • Identify which 3–5 services need horizontal scale; keep the rest as a tight monolith
  • Read replica + connection-pool segregation (jobs vs requests)
  • Auto-scaling with explicit ceilings to prevent runaway costs

→ See: Inara — 40 microservices collapsed into 1 modular monolith


2. Legacy system mess

Symptoms

  • Nobody fully understands the business logic — it's spread across SQL, stored procs, and one developer's head
  • Every release breaks something seemingly unrelated
  • Hiring is hard because the stack is old (VB.NET, Java 6, classic ASP, Cobol)
  • A "small change" takes weeks because of unknown dependencies

Root causes

  • 15+ years of accumulated decisions, no documentation owners
  • Database schema not in version control; every release adds undocumented changes
  • Fear of touching anything → frozen architecture → mounting tech debt

Short-term fix

  • AI-assisted documentation: feed legacy code through an LLM; produce one-page-per-module docs in days
  • Schema diff tooling so every release auto-records DB changes
  • Strangler façade in front of the legacy: new endpoints route to new code, old endpoints unchanged

Long-term solution

  • Module-by-module migration behind feature flags; both systems run in parallel until parity
  • One team owns the migration playbook; one team keeps the old running
  • End-state is rarely a full rewrite — usually a modular monolith with clean boundaries

→ See: Jobscope — VB.NET → .NET 8 with AI-assisted code migration


3. Cloud cost explosion

Symptoms

  • AWS bill grows faster than user count or revenue
  • Finance asks why and engineering can't explain
  • Most-used services aren't the most-expensive ones
  • "It must be the new feature" — but the spend started climbing months ago

Root causes

  • Microservices added without cost analysis — each service has fixed pod overhead
  • Managed services bought for convenience, not measured usage (Auth0, Elasticsearch cluster, Redis cluster, Kafka)
  • No lifecycle policies on S3, no retention on logs, no auto-shutdown on dev environments
  • Idle pods running 24/7 because nobody removed them after a feature was deprecated

Short-term fix

  • AWS Cost Explorer audit by service and tag
  • Kill orphan resources: unattached EBS, idle ELBs, unused NAT gateways
  • S3 lifecycle rules (Glacier after 90 days; delete logs after 30)
  • Reserved instances or Savings Plans on everything stable

Long-term solution

  • Collapse over-fragmented microservices that don't have independent scale needs
  • Replace heavy managed components with in-process equivalents where the trade-off makes sense
  • Cost-per-user and cost-per-request as engineering metrics, reviewed quarterly
  • Auto-scale max ceilings to prevent runaway costs

→ See: Inara — $3.5–5.5K/mo to $340–545/mo (85–90% reduction)


4. Security & compliance gaps

Symptoms

  • Auth code spread across 4 services; nobody can answer "what does our session model look like"
  • Cookies bloat as more permissions are added → request size errors under load
  • HIPAA / SOC 2 / GDPR audit highlights gaps you can't deny
  • Audit logs exist but in different places, different formats

Root causes

  • Auth bolted onto auth — OAuth2, then JWT, then cookies, then OIDC, none retired
  • Multi-tenancy enforced in application code only (one missed query = data leak)
  • PII / PHI in plaintext columns because encryption was "for later"

Short-term fix

  • Document the auth flow as it really is, not as it's drawn
  • Move oversized cookies to JWT + server-side permission lookup
  • Centralise audit logging via one library

Long-term solution

  • PostgreSQL row-level security for tenant isolation (defence in depth)
  • Field-level encryption via AWS KMS for PII / PHI
  • SAML / SCIM for enterprise SSO; ROPC for service-to-service
  • External penetration test before claiming SOC 2 / HIPAA readiness

→ See: Jobscope — cookie-bloat fixed via JWT + RBAC migration


5. Integration chaos

Symptoms

  • Every new partner integration is a 4–8 week project
  • Partners share user cookies because there's no service-to-service auth
  • Webhook deliveries fail silently; no replay mechanism
  • Different partners need different data shapes — your codebase has 5 versions of "send to partner"

Root causes

  • No clean integration boundary: APIs designed for the UI, not for partners
  • Authentication wasn't separated for service-to-service vs human
  • Outbound webhooks added ad-hoc as features needed them

Short-term fix

  • One ROPC token endpoint for all service-to-service callers; no more cookie-sharing
  • Webhook signature verification (HMAC) on every outbound
  • Idempotency keys on POST endpoints partners hit

Long-term solution

  • API gateway with rate-limiting, key management, audit logs per partner
  • Stable v1 partner API distinct from internal API; versioned with sunset dates
  • Partner-onboarding playbook: docs + sandbox + sample integrations → onboard in days, not weeks

6. AI adoption confusion

Symptoms

  • Pressure from leadership to "add AI" but no clear use case
  • Demos work great; production usage hallucinates or ruins margin
  • Multiple LLM API keys, no cost ceiling, no eval framework
  • Engineering team uncertain whether to build with AI or just talk about it

Root causes

  • AI applied where determinism matters (billing, regulated decisions) — wrong
  • No grounding (no RAG, no structured tool calls) — model invents
  • No human-in-the-loop on high-value actions — runaway risk

Short-term fix

  • Pick 1–2 places where AI clearly pays off: lead qualification, code analysis, content drafting, support deflection, document summarisation
  • Pair every LLM call with a deterministic rule layer for high-stakes outputs
  • Cost ceiling per request and per user; alert on spikes

Long-term solution

  • Multi-provider orchestration (OpenAI / Anthropic / Gemini) with fallback and cost/latency routing
  • RAG pipeline grounded in your actual data, not generic web
  • Evaluation harness — you can A/B prompts and measure regression objectively
  • Agentic patterns where appropriate, with bounded tools and observability over the reasoning path

→ See: AI Sales & Logistics Pipeline — 9 specialised agents with deterministic guardrails

Recognise more than one of these in your system?

A 1-week Architecture Review covers all six. You walk away knowing exactly what to fix first and what to leave alone.

Book a System Review