All Posts Who Trained Your LLM? The Answer Matters More Than You Think
#AI #LLMs #Compliance #Data

Who Trained Your LLM? The Answer Matters More Than You Think

Cavan Page ·

Most people using AI tools day-to-day have no idea what their model was trained on. That’s not an accident. It’s a deliberate choice by the companies building them - and it has consequences you should understand before you trust an LLM with anything that matters.

The Training Data Problem

Every large language model learns from data. The bigger the model, the more data it needs. The training datasets behind state-of-the-art models now comprise trillions of data points - the majority of it scraped from the public internet.

The internet is not a curated library. It contains misinformation, outdated content, contested claims, low-quality sources and outright fabrications. When a model trains on all of it indiscriminately, it learns those patterns too. This is why LLMs hallucinate with such confidence - the model has no internal sense of what is authoritative versus what is noise.

The 2026 International AI Safety Report, authored by over 100 independent researchers across 30 countries, put it plainly: current systems “continue to generate text that includes false statements” and their “performance also tends to decline when prompted in languages other than English, which are less represented in training datasets.” Garbage in, garbage out - just with a more fluent delivery.

What Companies Actually Disclose

If you pull a model from Hugging Face, you will usually find a model card. It describes the architecture, the intended use case, known limitations and - sometimes - the datasets used during training. It’s imperfect, but it exists. The open-source community has at least established a norm of documentation.

The frontier labs? Different story. OpenAI, Anthropic, and Google publish almost nothing about the specifics of what went into their training runs. The 2026 International AI Safety Report notes that “companies usually limit the information they share about the datasets used to train general-purpose AI models, including how that data is acquired and processed” and that researchers have identified “limited and inconsistent transparency around training data, general-purpose AI models, evaluations, development pipelines, and safety.”

The stated reasons are: trade secrets, competitive advantage, model security. Those are real concerns. But the report is equally clear that “nondisclosure can also conceal problematic practices, including the use of copyrighted or unlicensed data for training.”

There is no industry-wide standard requiring disclosure. No audit body. No certification. When you connect to a hosted model via API - Claude, GPT-4, Gemini - you are trusting a black box.

Domain-Specific Models Are a Different Animal

Not all models are built the same way. General-purpose frontier models are trained wide and shallow on everything. But a growing class of domain-specific models are trained narrow and deep on verified, curated sources.

A model trained on PubMed, peer-reviewed clinical trials and FDA filings behaves very differently from a model trained on Reddit and Wikipedia. The same is true for models built for legal research, defense applications or financial analysis. When the training data is controlled, auditable and domain-specific, the outputs are more reliable - and the failure modes are better understood.

This distinction matters when you’re deciding what to use AI for. Using a general-purpose LLM to answer a medical question is categorically different from using a model specifically trained and validated for clinical decision support. The first is a shortcut. The second is a tool.

If your use case has real stakes - compliance, legal, medical, financial - ask what the model was trained on before you ship it. If you can’t get an answer, that’s your answer.

The IP Battle Playing Out in Court

The training data opacity issue isn’t just an abstract transparency problem. It’s actively being litigated.

The New York Times sued OpenAI for training on millions of its articles without a license. Getty Images sued Stability AI for using its photo library without permission or compensation. A wave of class-action suits from authors and artists has followed. These cases are still working through the courts, but they are establishing that training on copyrighted material without consent is at minimum a contested legal act - and potentially an expensive one.

On the technical side, creators have limited tools. The robots.txt standard was designed to tell crawlers not to index pages for search - but it has no legal teeth for AI training. The European Union’s AI Act includes text-and-data mining opt-out rights, which give rights holders a formal mechanism to exclude their content from training datasets. But enforcement requires tracking, and tracking requires disclosure that most companies don’t provide.

This is the core problem: you cannot protect what you cannot see being used.

Compliance Is Moving Faster Than Most Realize

The regulatory environment is shifting. The EU AI Act is the furthest along - it includes transparency requirements for general-purpose AI models above a certain capability threshold, including documentation of training data and compliance with copyright law. It came into force in 2024 and obligations are phasing in through 2027.

In the US, progress has been slower. Most AI safety initiatives remain voluntary. The 2026 International AI Safety Report notes that “in 2025, 12 companies published or updated Frontier AI Safety Frameworks - documents that describe how they plan to manage risks as they build more capable models. Most risk management initiatives remain voluntary, but a few jurisdictions are beginning to formalise some practices as legal requirements.”

The most interesting recent development is a memorandum of understanding signed on March 31, 2026 between Anthropic and the Australian government. Dario Amodei and Australian Prime Minister Anthony Albanese formalized a collaboration that includes joint AI safety evaluations, sharing of data on AI adoption across sectors like healthcare, agriculture and financial services, and coordination with Australia’s AI Safety Institute. It mirrors arrangements Anthropic already has with the US, UK and Japanese safety institutes.

This is notable because it’s the first government-level deal that explicitly ties AI safety to economic data tracking - not just model behavior. Governments are starting to treat training data transparency and adoption monitoring as a national interest, not just a consumer protection question. That shift will eventually produce teeth.

What You Should Actually Do

If you’re building products with AI or advising clients who are, here’s what this means in practice.

Know what model you’re using and what class it belongs to. A fine-tuned open-weight model with a documented dataset is not the same risk profile as a frontier proprietary model. Treat them differently.

Match the model to the stakes. General-purpose LLMs are good for drafting, summarizing and exploring. They are not appropriate for generating medical advice, legal conclusions or compliance determinations without a human in the loop who understands the model’s limitations.

Watch the compliance landscape. If you operate in the EU, the AI Act applies to you now. If you handle data in sectors like healthcare or finance, domain-specific model requirements are coming regardless of jurisdiction.

Ask the question your vendor doesn’t want to answer: what was this model trained on, and can you show me? If the answer is vague, build that uncertainty into how you use it.

The companies building these models know more than they’re sharing. That gap between what they know and what you know is where the risk lives.


This post is part of a series on building responsibly with AI. Up next: how publishing authoritative content in your industry can shape what AI systems know - and say - about your business.