We are building production-grade AI systems for capital markets, including an AI-powered investing assistant that runs on cloud-native infrastructure and integrates with regulated trading and research platforms. We are hiring a Senior AI Engineer to build, evaluate, and operate LLM-based products end to end.

This is a deeply hands-on role. You will write code, debug live systems, run evaluations, and ship to production. We are not looking for someone whose AI experience is limited to wiring up a hosted chat API — we expect you to have personally built, broken, and fixed LLM systems in production.

Experience

5–8 years in software engineering, with 2+ years on LLM/AI products in production.
Strong track record of shipping AI features that are actually used by real users at scale.

Required Skills

1. LLM Hosting & Serving

Hands-on experience hosting LLMs for testing, evaluation, and production inference.
Working knowledge of inference servers and runtimes: vLLM, TGI (Text Generation Inference), TensorRT-LLM, Ollama, llama.cpp.
Experience deploying open-weight models (Llama, Mistral, Qwen, Nemotron, GPT-OSS, DeepSeek, etc.) on GPU instances — knowledge of quantization (GPTQ, AWQ, GGUF, FP8), batching strategies (continuous batching, paged attention), and KV-cache management.
Experience with managed model hosting platforms: AWS Bedrock, SageMaker, Azure OpenAI, Vertex AI, or equivalent.
Ability to choose between hosted APIs and self-hosted inference based on cost, latency, throughput, and data-residency constraints — and to defend that choice with numbers.

2. LLM Evaluation & Testing

Designing and running case-specific test suites for LLM-based applications — not just generic benchmarks.
Building eval datasets from production traffic, edge cases, and adversarial prompts. Experience curating golden datasets and maintaining them as the product evolves.
Hands-on with evaluation frameworks: Langfuse, Promptfoo, DeepEval, RAGAS, OpenAI Evals, LM-Eval-Harness, or equivalents.
LLM-as-a-judge pipelines — including knowing the failure modes (judge bias, position bias, verbosity bias) and how to mitigate them.
Regression testing for prompts, models, and tool chains. Catching silent quality drift between model versions.
Quantitative metrics: faithfulness, groundedness, answer relevance, tool-selection accuracy, hallucination rates, latency percentiles, token cost per query.

3. LLM Frameworks & Orchestration

Working knowledge of LangChain, LangGraph, LlamaIndex, Haystack, or equivalent orchestration frameworks.
Experience with agentic patterns: ReAct, ReWoo, Reflexion, Plan-and-Execute, multi-agent workflows.
MCP (Model Context Protocol) and tool-calling: building tool schemas, handling tool-selection failures, recovering from malformed tool calls.
Comfortable working outside Python ecosystems — building LLM applications in Go, Java/Kotlin, TypeScript/Node, or custom in-house frameworks. We do not assume Python is the right answer for production services.
Streaming responses (SSE, WebSockets), session management, and handling long-running agentic loops gracefully.

4. Retrieval & Context Engineering

Hands-on with embedding models, vector databases (pgvector, OpenSearch, Pinecone, Weaviate, Milvus), and hybrid search (BM25 + dense).
Chunking strategies, re-ranking (Cohere Rerank, cross-encoders), and query rewriting.
Knowledge of when RAG is the wrong answer (and what to do instead).

5. Good-to-Have

Fine-tuning / instruction-tuning / LoRA / QLoRA / DPO on open-weight models.
RLHF or RLAIF exposure.
Prompt distillation, model routing, and cost optimization at scale.
Guardrails: PII redaction, jailbreak detection, output validation (Guardrails AI, NeMo Guardrails, Llama Guard).
Experience with multimodal models (vision, audio, ASR/TTS).
Contributions to open-source AI/ML projects.

Responsibilities

Build and ship LLM-powered features: agentic workflows, RAG pipelines, tool-using assistants, summarization and classification services.
Host, serve, and benchmark LLMs — both hosted (Bedrock, Azure OpenAI) and self-hosted (vLLM, TGI) — with measurable latency, throughput, and cost targets.
Write and maintain case-specific test suites; create eval datasets from real traffic; gate model and prompt changes on regression results.
Instrument production: traces, prompts, tool calls, token usage, error taxonomy. Build dashboards that tell you when quality is degrading.
Collaborate with product, design, and domain experts to translate fuzzy requirements into concrete prompts, tools, and evals.
Mentor junior engineers and review code with care.
Participate in on-call for AI services and contribute to runbooks and RCAs.