Advanced AI Prompting for Developers: JSON Mode, Prompt Chaining, RAG, and Automated Evals
AI prompt engineering LLM RAG developer tools

Advanced AI Prompting for Developers: JSON Mode, Prompt Chaining, RAG, and Automated Evals

D. Rout

D. Rout

April 26, 2026 14 min read

On this page

If you've mastered zero-shot and few-shot prompting — and if not, start with our guide Zero-Shot to Chain-of-Thought: A Developer's Guide to Writing Effective AI Prompts — it's time to go further. The gap between "impressive demo" and "production-ready LLM feature" comes down to four capabilities: enforcing structured output so your code can reliably parse model responses, chaining prompts so complex tasks compose cleanly, injecting real data through retrieval-augmented generation (RAG) so models answer from facts rather than hallucinations, and running automated evaluations so you catch regressions before your users do. This tutorial walks through all four with practical, copy-paste code samples you can drop into an Express/Node.js backend today.

Prerequisites

  • Node.js 18+ and a working Express app (or any JS/TS backend)
  • An Anthropic or OpenAI API key set as an environment variable (ANTHROPIC_API_KEY or OPENAI_API_KEY)
  • Basic familiarity with async/await and fetch/axios
  • Completed or familiar with zero-shot, few-shot, and chain-of-thought prompting concepts (see the previous post for a refresher)
  • Optional: a vector database (Pinecone, Weaviate, or pgvector) for the RAG section

1. JSON Mode — Enforcing Structured Output

The single biggest source of LLM integration bugs is unpredictable output format. A model that sometimes returns {"status": "ok"} and other times returns "Everything looks good!" breaks your parser intermittently and is a nightmare to debug. JSON mode solves this.

Why JSON mode matters

When you're building an API endpoint that calls an LLM and returns structured data to a frontend, you need a guarantee — not a hope — that the model returns valid JSON. Both OpenAI and Anthropic provide mechanisms for this.

1a. OpenAI — response_format: { type: "json_object" }

// lib/llm/openaiJson.ts
import OpenAI from "openai";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

interface ProductReview {
  sentiment: "positive" | "negative" | "neutral";
  score: number; // 0–10
  summary: string;
  topics: string[];
}

export async function analyzeReview(reviewText: string): Promise<ProductReview> {
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    response_format: { type: "json_object" }, // 🔑 forces JSON output
    messages: [
      {
        role: "system",
        content: `You are a product review analyzer. Always respond with valid JSON matching this schema:
{
  "sentiment": "positive" | "negative" | "neutral",
  "score": number between 0 and 10,
  "summary": string (one sentence),
  "topics": array of strings (key topics mentioned)
}`,
      },
      {
        role: "user",
        content: `Analyze this review: "${reviewText}"`,
      },
    ],
  });

  const raw = response.choices[0].message.content ?? "{}";
  return JSON.parse(raw) as ProductReview;
}

Important: When using response_format: { type: "json_object" }, OpenAI requires that the word "json" appear somewhere in your system prompt. The example above satisfies this requirement.

1b. Anthropic Claude — XML-fenced JSON

Claude doesn't have a dedicated JSON mode flag (as of this writing), but it's remarkably reliable when you ask it to wrap output in a tag and parse from there:

// lib/llm/claudeJson.ts
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });

export async function analyzeReviewClaude(reviewText: string) {
  const message = await client.messages.create({
    model: "claude-opus-4-5",
    max_tokens: 512,
    messages: [
      {
        role: "user",
        content: `Analyze this product review and respond ONLY with a JSON object inside <json></json> tags.

Schema:
{
  "sentiment": "positive" | "negative" | "neutral",
  "score": 0-10,
  "summary": "one sentence",
  "topics": ["array", "of", "strings"]
}

Review: "${reviewText}"`,
      },
    ],
  });

  const raw = message.content[0].type === "text" ? message.content[0].text : "";
  const match = raw.match(/<json>([\s\S]*?)<\/json>/);
  if (!match) throw new Error("Model did not return JSON in expected format");
  return JSON.parse(match[1]);
}

1c. Validate with Zod

Never trust raw model output in production. Pair JSON mode with runtime schema validation:

// lib/llm/reviewSchema.ts
import { z } from "zod";

export const ReviewSchema = z.object({
  sentiment: z.enum(["positive", "negative", "neutral"]),
  score: z.number().min(0).max(10),
  summary: z.string().min(1),
  topics: z.array(z.string()),
});

// In your route:
const parsed = ReviewSchema.safeParse(rawOutput);
if (!parsed.success) {
  console.error("Schema validation failed:", parsed.error.flatten());
  throw new Error("Invalid model output structure");
}
return parsed.data;

2. Prompt Chaining — Building Composable Pipelines

A single prompt can only do so much. Prompt chaining lets you break a complex task into discrete steps where the output of one prompt becomes the input to the next. This improves accuracy, makes debugging tractable, and lets you cache or short-circuit expensive steps.

The pattern

Input → [Prompt A] → Intermediate Output → [Prompt B] → [Prompt C] → Final Output

Example: Content moderation pipeline

// lib/pipelines/moderationPipeline.ts
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });

async function callClaude(prompt: string, maxTokens = 256): Promise<string> {
  const msg = await client.messages.create({
    model: "claude-opus-4-5",
    max_tokens: maxTokens,
    messages: [{ role: "user", content: prompt }],
  });
  return msg.content[0].type === "text" ? msg.content[0].text : "";
}

// Step 1: Classify the content type
async function classifyContent(text: string): Promise<string> {
  return callClaude(`Classify the following user-submitted content into ONE of these categories:
SAFE | SPAM | HATE_SPEECH | ADULT | VIOLENCE | OFF_TOPIC

Respond with only the category label.

Content: "${text}"`);
}

// Step 2: Based on classification, decide action
async function decideAction(classification: string, text: string): Promise<string> {
  if (classification.trim() === "SAFE") return "APPROVE";

  return callClaude(`A piece of content was classified as: ${classification}

Content: "${text}"

Based on the classification, decide the appropriate action:
- APPROVE: Allow the content through
- WARN: Flag for human review
- BLOCK: Remove immediately

Respond with only: APPROVE | WARN | BLOCK`);
}

// Step 3: Generate a user-facing explanation if blocked or warned
async function generateExplanation(action: string, classification: string): Promise<string | null> {
  if (action.trim() === "APPROVE") return null;

  return callClaude(`A piece of content was ${action} due to: ${classification}

Write a brief, polite, user-facing explanation (2 sentences max) explaining why their content was not approved. 
Do not mention internal classification labels.`);
}

// The composed pipeline
export async function runModerationPipeline(userContent: string) {
  const classification = await classifyContent(userContent);
  console.log(`[Step 1] Classification: ${classification}`);

  const action = await decideAction(classification, userContent);
  console.log(`[Step 2] Action: ${action}`);

  const explanation = await generateExplanation(action, classification);
  console.log(`[Step 3] Explanation: ${explanation ?? "N/A"}`);

  return { classification: classification.trim(), action: action.trim(), explanation };
}

Parallel chaining

When steps are independent, run them concurrently:

// Run multiple analysis steps in parallel, then synthesize
const [sentiment, entities, keyPoints] = await Promise.all([
  callClaude(`Extract sentiment from: "${doc}"`),
  callClaude(`Extract named entities from: "${doc}"`),
  callClaude(`Extract 3 key points from: "${doc}"`),
]);

const summary = await callClaude(`
Given this analysis of a document:
- Sentiment: ${sentiment}
- Entities: ${entities}
- Key points: ${keyPoints}

Write a 3-sentence executive summary.`);

3. RAG — Retrieval-Augmented Generation

RAG is how you give an LLM access to your data without fine-tuning. Instead of baking knowledge into the model weights, you retrieve relevant documents at query time and inject them into the prompt as context. The model then answers from that context rather than hallucinating.

The RAG architecture

User Query
    │
    ▼
[Embed Query] ──→ Vector DB ──→ Top-K Similar Chunks
                                        │
                                        ▼
                              [Inject as Context] ──→ LLM ──→ Grounded Answer

Step 3.1 — Chunk and embed your documents

// scripts/ingestDocs.ts
import { Pinecone } from "@pinecone-database/pinecone";
import OpenAI from "openai";
import fs from "fs";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });

function chunkText(text: string, chunkSize = 500, overlap = 50): string[] {
  const words = text.split(/\s+/);
  const chunks: string[] = [];
  for (let i = 0; i < words.length; i += chunkSize - overlap) {
    chunks.push(words.slice(i, i + chunkSize).join(" "));
  }
  return chunks;
}

async function embedText(text: string): Promise<number[]> {
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text,
  });
  return response.data[0].embedding;
}

export async function ingestDocument(filePath: string, docId: string) {
  const text = fs.readFileSync(filePath, "utf-8");
  const chunks = chunkText(text);
  const index = pinecone.index("your-index-name");

  for (let i = 0; i < chunks.length; i++) {
    const embedding = await embedText(chunks[i]);
    await index.upsert([
      {
        id: `${docId}-chunk-${i}`,
        values: embedding,
        metadata: { text: chunks[i], docId, chunkIndex: i },
      },
    ]);
    console.log(`Upserted chunk ${i + 1}/${chunks.length}`);
  }
}

Step 3.2 — Retrieve and generate

// lib/rag/ragQuery.ts
import { Pinecone } from "@pinecone-database/pinecone";
import OpenAI from "openai";
import Anthropic from "@anthropic-ai/sdk";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });
const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });

async function retrieveContext(query: string, topK = 5): Promise<string[]> {
  const queryEmbedding = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: query,
  });

  const index = pinecone.index("your-index-name");
  const results = await index.query({
    vector: queryEmbedding.data[0].embedding,
    topK,
    includeMetadata: true,
  });

  return results.matches
    .filter((m) => m.score && m.score > 0.75) // filter low-relevance chunks
    .map((m) => (m.metadata?.text as string) ?? "");
}

export async function ragAnswer(userQuery: string): Promise<string> {
  const contextChunks = await retrieveContext(userQuery);

  if (contextChunks.length === 0) {
    return "I don't have enough relevant information to answer that confidently.";
  }

  const contextBlock = contextChunks
    .map((chunk, i) => `[Source ${i + 1}]\n${chunk}`)
    .join("\n\n");

  const message = await anthropic.messages.create({
    model: "claude-opus-4-5",
    max_tokens: 1024,
    messages: [
      {
        role: "user",
        content: `You are a helpful assistant. Answer the user's question using ONLY the provided context. 
If the context doesn't contain the answer, say so — do not guess or use outside knowledge.

<context>
${contextBlock}
</context>

Question: ${userQuery}`,
      },
    ],
  });

  return message.content[0].type === "text" ? message.content[0].text : "";
}

Step 3.3 — Wire it into an Express route

// routes/chat.ts
import { Router } from "express";
import { ragAnswer } from "../lib/rag/ragQuery";

const router = Router();

router.post("/chat", async (req, res) => {
  const { query } = req.body;
  if (!query || typeof query !== "string") {
    return res.status(400).json({ error: "query is required" });
  }

  try {
    const answer = await ragAnswer(query);
    res.json({ answer });
  } catch (err) {
    console.error("RAG error:", err);
    res.status(500).json({ error: "Failed to generate answer" });
  }
});

export default router;

4. Automated Evals — Testing Your Prompts Like Code

Prompts change. Models get updated. What worked last week may regress today. Automated evals are your test suite for LLM behaviour — a set of input/expected-output pairs you run before every deploy to catch regressions.

The eval framework pattern

// evals/runner.ts
interface EvalCase {
  id: string;
  input: string;
  expectedOutput?: string; // for exact or fuzzy match
  expectedSchema?: object; // for JSON output evals
  evaluator: "exact" | "contains" | "schema" | "llm-judge";
}

interface EvalResult {
  id: string;
  passed: boolean;
  actualOutput: string;
  score?: number;
  reason?: string;
}

Step 4.1 — Exact and contains evaluators

// evals/evaluators.ts
export function exactMatch(actual: string, expected: string): boolean {
  return actual.trim().toLowerCase() === expected.trim().toLowerCase();
}

export function containsMatch(actual: string, expected: string): boolean {
  return actual.toLowerCase().includes(expected.toLowerCase());
}

export function schemaMatch(actual: string, schema: object): boolean {
  try {
    const parsed = JSON.parse(actual);
    // Use Zod or ajv for real schema validation
    return typeof parsed === "object" && parsed !== null;
  } catch {
    return false;
  }
}

Step 4.2 — LLM-as-judge evaluator

For tasks where correctness is nuanced (summaries, tone, factual accuracy), use a second LLM call to grade the output:

// evals/llmJudge.ts
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });

export async function llmJudge(
  input: string,
  actualOutput: string,
  gradingCriteria: string
): Promise<{ score: number; reason: string }> {
  const response = await client.messages.create({
    model: "claude-opus-4-5",
    max_tokens: 256,
    messages: [
      {
        role: "user",
        content: `You are an AI output evaluator. Grade the following response based on the criteria.

Input: "${input}"
Response to grade: "${actualOutput}"
Grading criteria: ${gradingCriteria}

Respond with JSON only:
{
  "score": number from 0 to 10,
  "reason": "one sentence explanation"
}`,
      },
    ],
  });

  const raw = response.content[0].type === "text" ? response.content[0].text : "{}";
  return JSON.parse(raw);
}

Step 4.3 — Run the full eval suite

// evals/suite.ts
import { analyzeReview } from "../lib/llm/openaiJson";
import { ragAnswer } from "../lib/rag/ragQuery";
import { llmJudge } from "./llmJudge";
import { containsMatch } from "./evaluators";

const evalCases = [
  {
    id: "review-sentiment-positive",
    fn: () => analyzeReview("This product exceeded my expectations. Absolutely love it!"),
    check: (result: any) => result.sentiment === "positive" && result.score >= 7,
  },
  {
    id: "review-sentiment-negative",
    fn: () => analyzeReview("Broke after two days. Complete waste of money."),
    check: (result: any) => result.sentiment === "negative" && result.score <= 3,
  },
  {
    id: "rag-known-fact",
    fn: () => ragAnswer("What is our refund policy?"),
    check: async (result: string) => {
      const { score } = await llmJudge(
        "What is our refund policy?",
        result,
        "The answer should mention a refund policy and not make up details. Score 8+ if grounded, 3 or below if hallucinated."
      );
      return score >= 7;
    },
  },
];

async function runEvals() {
  let passed = 0;
  const results: { id: string; passed: boolean }[] = [];

  for (const ec of evalCases) {
    try {
      const output = await ec.fn();
      const ok = await ec.check(output);
      results.push({ id: ec.id, passed: ok });
      if (ok) passed++;
      console.log(`${ok ? "✅" : "❌"} ${ec.id}`);
    } catch (err) {
      results.push({ id: ec.id, passed: false });
      console.log(`❌ ${ec.id} — threw error: ${err}`);
    }
  }

  console.log(`\nResults: ${passed}/${evalCases.length} passed`);
  if (passed < evalCases.length) process.exit(1); // fail CI
}

runEvals();

Add this to your CI pipeline:

# .github/workflows/llm-evals.yml
name: LLM Eval Suite
on: [pull_request]
jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: "20"
      - run: npm ci
      - run: npx ts-node evals/suite.ts
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }}

Reference: Technique Comparison

Technique Best For Latency Impact Cost Impact Key Risk
JSON Mode Structured data extraction, API integrations Minimal Minimal Schema drift — validate with Zod
Prompt Chaining (sequential) Multi-step reasoning, conditional logic Additive (N × latency) Additive (N × tokens) Error propagation between steps
Prompt Chaining (parallel) Independent sub-tasks Same as slowest step Additive Race conditions, context isolation
RAG Knowledge-grounded Q&A, large document sets +100–300ms for retrieval +tokens for context Retrieval quality, chunk size tuning
LLM-as-judge evals Nuanced quality grading N/A (offline) Per eval run Judge bias, inconsistent grading
Automated evals in CI Regression prevention N/A (CI gate) Per CI run Flaky tests, high API cost at scale

What's Next

  1. Structured output with tool/function calling — Both OpenAI and Anthropic support function calling as an alternative to JSON mode, with typed parameter schemas that are even stricter than a prompt-defined schema. Worth exploring for complex structured extraction tasks.

  2. Semantic caching for RAG — Store embeddings of recent queries and return cached answers when a new query is semantically similar (cosine similarity > 0.95). This dramatically cuts latency and cost on repeated queries.

  3. Multi-agent orchestration — Take prompt chaining further with agent frameworks (LangGraph, CrewAI, or a custom orchestrator) where each "step" is an autonomous agent with its own tool access and memory.

  4. Eval datasets and regression tracking — Move from one-off eval scripts to a structured eval dataset (stored in your DB) with historical pass/fail tracking, so you can visualise prompt quality over time as models and prompts evolve.


Further Reading

Share

Comments (0)

Join the conversation

Sign in to leave a comment on this post.

No comments yet. to be the first!