AI and Intellectual Property: Training Data, Copyright, and Disclosure

AI Training and the Copyright Question

Every AI system that learns from data raises a copyright question: does the process of using existing content — text, images, code, music, databases — to train an AI model constitute an act that requires permission from the rights holder? The answer depends on the nature of the content, the purpose of the training, and the exceptions available under applicable copyright law.

The EU’s position on this question is shaped by two intersecting pieces of legislation: the Digital Single Market (DSM) Directive (Directive 2019/790), which established the text and data mining framework, and the EU AI Act, which imposes specific transparency and copyright compliance obligations on providers of general-purpose AI models.

Understanding how these frameworks interact is essential for three categories of stakeholders: AI developers who train models on third-party content, businesses that procure AI systems and need to assess IP risks in their supply chain, and content creators and rights holders who want to control whether their works are used for AI training.

‍

The Text and Data Mining Framework

The DSM Directive provides two exceptions that are relevant to AI training.

Article 3: TDM for research. Research organisations and cultural heritage institutions may carry out text and data mining (TDM) of works to which they have lawful access, for the purposes of scientific research. This exception is mandatory and cannot be overridden by contract. It applies to non-commercial research purposes.

Article 4: TDM for any purpose. Any person may carry out TDM of works to which they have lawful access, provided that the rights holder has not reserved their rights in an appropriate manner. This is the broader exception that applies to commercial AI training. Crucially, it is subject to an opt-out: rights holders can reserve their rights, and if they do, the exception does not apply.

For AI developers, Article 4 means that training on publicly available content is permitted — unless the rights holder has opted out. For rights holders, it means that the default position permits TDM, but they can prevent it by expressly reserving their rights.

‍

The Opt-Out Mechanism

The opt-out under Article 4(3) of the DSM Directive requires that the reservation of rights be expressed in an appropriate manner. For content made available online, this includes machine-readable means such as metadata or terms and conditions of a website or service.

In practice, this means that rights holders can opt out of AI training by including a clear statement in their terms of use (for example, “Text and data mining of content on this website for the purpose of training AI models is not permitted”), implementing the robots.txt protocol with appropriate directives (such as disallowing specific AI crawlers), using metadata standards that signal the reservation of rights, and including opt-out notices in the metadata of individual files or publications.

The effectiveness of the opt-out depends on visibility and machine-readability. A reservation buried in terms and conditions that no crawler reads may be legally valid but practically ineffective. Conversely, a robots.txt directive may be technically effective but its legal status under the DSM Directive is still subject to debate.

For AI developers, the obligation is clear: before training on content, check whether the rights holder has opted out. If they have, the Article 4 exception does not apply, and using the content without permission constitutes copyright infringement.

‍

The AI Act’s Copyright Compliance Requirements

The AI Act reinforces and extends the copyright framework for AI training through its GPAI provisions (Chapter V). Providers of general-purpose AI models must establish a policy to comply with Union copyright law, in particular to identify and comply with rights reservations expressed pursuant to Article 4(3) of the DSM Directive. This obligation goes beyond simply not infringing copyright — it requires proactive compliance measures.

Additionally, GPAI providers must make publicly available a sufficiently detailed summary of the content used for training the model. This transparency requirement is designed to enable rights holders to determine whether their content was used and to assess whether their opt-out was respected.

The European AI Office is tasked with developing a template for the training data summary, and the level of detail required is a subject of ongoing discussion. The summary must be sufficiently detailed to allow rights holders to exercise their rights, but the AI Act also recognises the need to protect trade secrets and proprietary training methodologies.

‍

Copyright in AI-Generated Output

A related but distinct question is whether AI-generated output is protected by copyright. Under EU copyright law, copyright protection requires that a work be the author’s own intellectual creation — reflecting the author’s creative choices. Works generated autonomously by an AI system, without human creative input in the expression of the work, are unlikely to meet this threshold.

However, the analysis is rarely that simple. In practice, most AI-generated content involves some degree of human involvement — in the selection and arrangement of prompts, in the curation and editing of outputs, in the creative decisions about how to use the AI as a tool. The extent of copyright protection depends on the degree of human creative contribution to the final work.

For businesses using AI to generate content — marketing materials, design elements, code, reports — this means that purely AI-generated output may not be protected by copyright (and therefore cannot be exclusively claimed or enforced), output that reflects substantial human creative input may be protected, and the analysis is fact-specific and depends on the degree of human involvement in the creative process.

The practical implication is that businesses should not assume that AI-generated content is automatically protected, and should maintain records of the human creative input involved in producing important works.

‍

IP Risks in AI Procurement

When procuring AI systems from third-party providers, businesses should assess the IP risks in the AI supply chain. Key questions include whether the provider can represent that its training data was lawfully obtained, whether the provider complied with the TDM opt-out obligations and has a copyright compliance policy as required by the AI Act, whether the provider’s terms include IP indemnification provisions (what happens if the AI system’s output infringes a third party’s intellectual property rights?), who owns the IP in outputs generated by the AI system when used by the deployer, and whether the provider’s training data summary (required under the AI Act for GPAI models) is available and sufficiently detailed.

These questions should be addressed in the procurement contract. Standard SaaS terms often do not adequately cover AI-specific IP issues, and deployers should negotiate provisions that allocate IP risk appropriately.

‍

Protecting Your Content from AI Training

If your business creates content — publications, reports, images, software, databases — and you want to control whether that content is used to train AI models, you should take affirmative steps to exercise the opt-out under the DSM Directive.

Practical measures include updating your website’s terms of use to include an express reservation of rights regarding text and data mining for AI training purposes, implementing robots.txt directives that restrict known AI crawlers, adding machine-readable metadata to your content indicating that TDM rights are reserved, monitoring whether your content appears in AI training data summaries published by GPAI providers, and considering technological protection measures where appropriate.

The opt-out is not retroactive — it applies to future mining, not to content that has already been ingested. For content that has already been used in training, enforcement depends on whether the use occurred before the opt-out was expressed and whether the applicable exception was available at the time.

‍

Looking Ahead

The interaction between AI and intellectual property is one of the most rapidly evolving areas of EU law. The European Commission is actively monitoring the effectiveness of the TDM opt-out mechanism, the adequacy of the training data transparency requirements, and the broader questions around AI-generated works and patent eligibility. Further guidance and potentially legislative action can be expected as the AI Act’s provisions take full effect and as practical experience with compliance accumulates.

‍

If you need to assess IP risks in your AI systems or protect your content from unauthorised AI training, get in touch or schedule a meeting with our team.

Bart Lieben

Attorney-at-Law

Attend webinar

Solutions

AI and Intellectual Property: Training Data, Copyright, and Disclosure

AI Training and the Copyright Question

The Text and Data Mining Framework

The Opt-Out Mechanism

The AI Act’s Copyright Compliance Requirements

Copyright in AI-Generated Output

IP Risks in AI Procurement

Protecting Your Content from AI Training

Looking Ahead

More related articles

AI in HR and Recruitment Under the AI Act

AI Clauses in Vendor Contracts

AI Act Meets GDPR: Where the Two Frameworks Overlap

Conformity Assessments for High-Risk AI Systems

Building an AI Governance Framework

AI Literacy: The Obligation Most Businesses Are Overlooking

Provider vs Deployer: Understanding Your Role Under the AI Act

AI Act Compliance Timeline: What Applies When

AI Act Risk Classification Explained

The EU AI Act: A Practical Overview for Businesses

The NIS2 Directive: towards a firmer EU-wide cybersecurity framework

Who we are:

Correspondence addresses: