We are building production-grade AI systems for capital markets, including an AI-powered investing assistant that runs on cloud-native infrastructure and integrates with regulated trading and research platforms. We are hiring a Senior AI Engineer to build, evaluate, and operate LLM-based products end to end.
This is a deeply hands-on role. You will write code, debug live systems, run evaluations, and ship to production. We are not looking for someone whose AI experience is limited to wiring up a hosted chat API — we expect you to have personally built, broken, and fixed LLM systems in production.
Experience
- 5–8 years in software engineering, with 2+ years on LLM/AI products in production.
- Strong track record of shipping AI features that are actually used by real users at scale.
Required Skills
1. LLM Hosting & Serving
- Hands-on experience hosting LLMs for testing, evaluation, and production inference.
- Working knowledge of inference servers and runtimes: vLLM, TGI (Text Generation Inference), TensorRT-LLM, Ollama, llama.cpp.
- Experience deploying open-weight models (Llama, Mistral, Qwen, Nemotron, GPT-OSS, DeepSeek, etc.) on GPU instances — knowledge of quantization (GPTQ, AWQ, GGUF, FP8), batching strategies (continuous batching, paged attention), and KV-cache management.
- Experience with managed model hosting platforms: AWS Bedrock, SageMaker, Azure OpenAI, Vertex AI, or equivalent.
- Ability to choose between hosted APIs and self-hosted inference based on cost, latency, throughput, and data-residency constraints — and to defend that choice with numbers.
2. LLM Evaluation & Testing
- Designing and running case-specific test suites for LLM-based applications — not just generic benchmarks.
- Building eval datasets from production traffic, edge cases, and adversarial prompts. Experience curating golden datasets and maintaining them as the product evolves.
- Hands-on with evaluation frameworks: Langfuse, Promptfoo, DeepEval, RAGAS, OpenAI Evals, LM-Eval-Harness, or equivalents.
- LLM-as-a-judge pipelines — including knowing the failure modes (judge bias, position bias, verbosity bias) and how to mitigate them.
- Regression testing for prompts, models, and tool chains. Catching silent quality drift between model versions.
- Quantitative metrics: faithfulness, groundedness, answer relevance, tool-selection accuracy, hallucination rates, latency percentiles, token cost per query.
3. LLM Frameworks & Orchestration
- Working knowledge of LangChain, LangGraph, LlamaIndex, Haystack, or equivalent orchestration frameworks.
- Experience with agentic patterns: ReAct, ReWoo, Reflexion, Plan-and-Execute, multi-agent workflows.
- MCP (Model Context Protocol) and tool-calling: building tool schemas, handling tool-selection failures, recovering from malformed tool calls.
- Comfortable working outside Python ecosystems — building LLM applications in Go, Java/Kotlin, TypeScript/Node, or custom in-house frameworks. We do not assume Python is the right answer for production services.
- Streaming responses (SSE, WebSockets), session management, and handling long-running agentic loops gracefully.
4. Retrieval & Context Engineering
- Hands-on with embedding models, vector databases (pgvector, OpenSearch, Pinecone, Weaviate, Milvus), and hybrid search (BM25 + dense).
- Chunking strategies, re-ranking (Cohere Rerank, cross-encoders), and query rewriting.
- Knowledge of when RAG is the wrong answer (and what to do instead).
5. Good-to-Have
- Fine-tuning / instruction-tuning / LoRA / QLoRA / DPO on open-weight models.
- RLHF or RLAIF exposure.
- Prompt distillation, model routing, and cost optimization at scale.
- Guardrails: PII redaction, jailbreak detection, output validation (Guardrails AI, NeMo Guardrails, Llama Guard).
- Experience with multimodal models (vision, audio, ASR/TTS).
- Contributions to open-source AI/ML projects.
Responsibilities
- Build and ship LLM-powered features: agentic workflows, RAG pipelines, tool-using assistants, summarization and classification services.
- Host, serve, and benchmark LLMs — both hosted (Bedrock, Azure OpenAI) and self-hosted (vLLM, TGI) — with measurable latency, throughput, and cost targets.
- Write and maintain case-specific test suites; create eval datasets from real traffic; gate model and prompt changes on regression results.
- Instrument production: traces, prompts, tool calls, token usage, error taxonomy. Build dashboards that tell you when quality is degrading.
- Collaborate with product, design, and domain experts to translate fuzzy requirements into concrete prompts, tools, and evals.
- Mentor junior engineers and review code with care.
- Participate in on-call for AI services and contribute to runbooks and RCAs.
