To Scrape or Not to Scrape? First Court Decision on the EU Copyright Exception for Text and Data Mining in Germany

Keep up with the latest legal and industry insights, news, and events from MoFo

On September 27, 2024, the Regional Court (Landgericht) of Hamburg, a court of first instance (the “Court”), dismissed a cease-and-desist claim by the photographer Robert Kneschke against LAION e. V. that the scraping of his photos from a photo stock website by LAION to create a dataset to be used for AI training infringed his copyright on the photos.

The Court found that LAION could rely on the statutory copyright exception of Section 60d of the German Copyright Act that permits reproductions of copyrighted content for text and data mining (TDM) for non-commercial scientific research purposes without the rights holder’s consent. While the Court was not required to (and did not) decide on the application of the TDM exception for other purposes, including commercial purposes under Section 44b of the German Copyright Act, the Court provided an obiter dictum on its application. The Court expressed doubts whether LAION could have invoked this copyright exception for commercial purposes. The Court was inclined to consider the opt-out published in a natural language in the photo stock website’s terms of use an effective machine-readable TDM opt-out under Section 44b of the German Copyright Act depending on the state of technical developments at the time of the scraping of the content.

This judgment will be noted not only in Germany but also in other EU Member States as Section 60d and Section 44b of the German Copyright Act implement Articles 3 and 4 of the EU Directive on Copyright in the Digital Single Market (“DSM Directive”). These provisions had to be implemented in all EU Member States and became effective as of June 7, 2021. Given the general relevance of this issue for both rights holders and users of TDM in the context of AI training, the claimant may appeal against this judgment. Depending on the findings of the courts of higher instance, the case may be referred to the Court of Justice of the European Union (CJEU) that has exclusive jurisdiction over the interpretation of EU law to ensure its uniform application within all EU Member States.

Despite the relevance of this case for TDM, including in the context of AI training, we note that the subject of the judgment is limited to the question whether the reproduction of the photos in connection with the dataset creation was a copyright infringement but this case does not answer the question of whether and to what extent reproductions made in specific AI model training cases are shielded by the TDM exceptions.

I. The Case

LAION e.V. is a registered non-profit association seated in Germany. LAION created a dataset, the LAION-5B dataset (the “Dataset”), in the second half of 2021 based on a dataset compiled by the Common Crawl Organization. The Dataset consists of 5.85 billion filtered image-text pairs that include hyperlinks to images publicly accessible online and other information on the linked images (including a description of the images) but does not include any images. LAION used the hyperlinks from the original dataset and downloaded the linked images from various websites. LAION ran a software tool over the images to analyze whether the image descriptions in the dataset were correct and filtered and deleted data pairs that did not match. Finally, LAION transferred the remaining hyperlinks and metadata (including the image descriptions) into the Dataset LAION 5B. LAION has made the Dataset available to the public free of charge for AI training. As part of the creation of the Dataset, LAION also downloaded and stored the claimant’s photos that were displayed for commercial licensing on www.bigstockphoto.com. LAION did not download the claimant’s photos that required certain licensing behind a paywall but only downloaded the publicly displayed versions that included the watermark of the photo stock website.

The terms of service of www.bigstockphoto.com displayed on the website included the following reservation of rights in the website provider’s natural language at the time when the photos were downloaded: “Restrictions: You may not … Use automated programs, applets, bots or the like to access the ... website or any content thereon for any purpose, including, by way of example only, downloading Content, indexing, scraping or caching any content on the website.”

II. The Judgment

1. Background

Under German/EU copyright law, any act that a user performs on copyrighted content that interferes with the exclusive rights assigned to the copyright owner of that content, such as reproducing or distributing the content, in principle requires the rights holder’s consent. In contrast to U.S copyright law, there is no general doctrine of fair use. Instead, German/EU copyright law provides for several specific statutory use cases where users are permitted to use copyrighted content without requiring a license from the rights holders. Such use cases are called copyright exceptions and limitations. When deciding whether any such copyright exception applies to a specific use case, the courts must also consider whether the specific use case conflicts with a normal exploitation of the copyrighted work or unreasonably prejudices the legitimate interests of the rights holders (the so-called “three-step test”).[1]

The DSM Directive added two new copyright exceptions: (i) one for TDM for scientific research by research organizations (Art. 3 DSM Directive) and (ii) one for TDM performed for other purposes, including commercial purposes (Art. 4 DSM Directive). The EU Member States have implemented these copyright exceptions into their local laws.

2. The Court’s Decision

The Court found that, while LAION reproduced the photos when downloading them for the Dataset analysis, LAION did not require the claimant’s consent for such reproductions and did not infringe his copyrights in the photos because it could rely on the statutory copyright exception of Section 60d of the German Copyright Act that allows a reproduction of copyrighted content for TDM for non-commercial scientific research.

Exception for Temporary Reproductions (Section 44a of the Copyright Directive)

First, the Court decided that downloading the photos was neither a transient nor an incidental reproduction. LAION could not rely on the copyright exception of Section 44a that permits temporary, either transient or incidental reproductions when they form an integral and essential part of a technological process with the sole purpose of enabling (i) a transmission in a third-party network by an intermediary, or (ii) the exception for a lawful use of a work, provided that the reproduction does not have independent economic significance.

TDM Exception for Commercial Purposes (Section 44b of the Copyright Act)

While the Court was not required to (and did not) make a final decision on the application of the TDM exception for commercial purposes, it extensively commented on this issue.

Section 44b Copyright Act states:

(1) ʻText and data miningʼ means the automated analysis of individual or several digital or digitized works for the purpose of gathering information, in particular regarding patterns, trends and correlations.

(2) It is permitted to reproduce lawfully accessible works in order to carry out text and data mining. Copies are to be deleted when they are no longer needed to carry out text and data mining.

(3) Uses in accordance with subsection (2) sentence 1 are permitted only if they have not been reserved by the right holder. A reservation of use in the case of works which are available online is effective only if it is made in a machine-readable format.

Reproduction for Text and Data Mining

The Court found that the reproduction of the photos by LAION qualifies as TDM. LAION downloaded the photos and used a software to analyze whether the image descriptions matched the images and to automatically filter out non-matching pairs and gather information on the correlations between the image-text pairs.

The claimant argued that reproducing and collecting data for AI training purposes is not TDM.. He argued that AI training, especially the training of generative AI models, would not only extract hidden information in the data but would also use the intellectual-creative expression of the works to create comparable and competing products. Neither the European nor German legislator, he argued, had envisaged such AI training uses when adopting the TDM exceptions.

The Court did not find these arguments convincing. In addition, it pointed out that the subject of the lawsuit brought by the claimant was only the reproductions of the photos made for creating the Dataset but not any possible, not yet clearly foreseeable, subsequent AI training activities that may be performed by LAION or third parties with the Dataset. As a result, the Court saw no need to decide whether training generative AI models with the Dataset qualifies as TDM under Section 44b. LAION’s intention was to create and make the Dataset publicly available for AI training in general. This general intention to use the Dataset for AI training would not justify denying LAION the application of Sections 44b and 60d and generally precluding AI training from the benefits of the TDM exceptions. According to the Court, both the German[2] and the current EU legislators[3] consider the creation of datasets for the training of artificial neural networks to be TDM.

In this context, the Court did not see an issue under the three-step test. The reproduction of the photos to create and make the Dataset publicly available neither impaired the normal exploitation of the photos by the claimant nor unreasonably prejudiced the claimant’s legitimate interests. Possible, but not clearly foreseeable, subsequent uses of the Dataset for training AI models that may generate similar content competing with the rights holder’s works should not be regarded in context of the creation of the Dataset by LAION.

Lawful Access to the Photos

The claimant’s photos were lawfully accessible on bigstock.com. LAION copied only the “showcase” version of these photos that included the website’s watermark and were freely accessible by everyone online.

Rights Holder’s Opt-Out from the TDM Exception

Finally, the Court suggested that the claimant may have effectively reserved the rights to use his photos for TDM. Because the Court decided that Section 60d resolved this case, as discussed below, this reservation of rights did not affect the outcome, and the Court’s discussion of this point can be considered obiter dictum.

Express Opt-Out by a Rights Holder

In accordance with the wording of Section 44b(1), the Court confirmed that it is not necessary for the original copyright owner, i.e., the claimant, to express an opt-out from the TDM, but it is sufficient if any rights holder of the reproduced works has declared an opt-out, i.e., the photo stock agency which the claimant had granted a non-exclusive license to market his photos. The Court regarded the stock agency as having done so in this instance.

Machine-Readable Opt-Out

The Court concluded that, given the legislative intent to enable automated data processing by web crawlers, “machine readable” should be interpreted as “machine understandable.” However, contrary to the prevailing view among legal experts[4], the Court was inclined to view an opt-out in natural language as machine understandable depending on the state of technical developments at the time of the scraping of the content.

In the case of content made available online, Section 44b(3) requires that the TDM opt-out is expressed in a machine-readable format.[5] However, the DSM Directive does not define the meaning of machine-readable formats. The Court argued that the definition of machine-readable in Recital 35[6] of Directive (EU) 2019/1024 on Open Data should not apply due to its different objectives. Instead, the Court referred to Section 53(1)(c) of the EU AI Act that obligates the providers of general-purpose AI (GPAI) models to establish a copyright policy “to identify and comply with, including through state-of-the-art technologies, a reservation of rights expressed in Article 4(3)” of the DSM Directive.

According to the Court, AI tools that have natural language processing (NLP) capabilities and understand the semantic content of such natural language texts are state-of-the-art-technology that should be applied for TDM. Please note that the Court did not collect any evidence as to the applicable state-of-the-art technology for the identification of TDM opt-outs on websites in 2021 (when LAION downloaded the photos from bigstock.com), and a court’s view of the matter might well differ in a case in which such evidence was presented.

3. TDM for Scientific Research Purposes (Section 60d(1) of the Copyright Act)

Finally, the Court decided that the reproductions made by LAION when creating the Dataset qualify as TDM for scientific research purposes under Section 60d(1) of the Copyright Act.

Section 60d(1) states:

(1) It is permitted to make reproductions to carry out text and data mining (Sections 44b(1) and (2) sentence 1) for scientific research purposes in accordance with the following provisions.

(2) Research organisations are authorised to make reproductions. ʻResearch organisationsʼ means universities, research institutes and other establishments conducting scientific research if they

1. pursue non-commercial purposes,

2. reinvest all their profits in scientific research or

3. act in the public interest based on a state-approved mandate.

The authorisation under sentence 1 does not extend to research organisations cooperating with a private enterprise which exerts a certain degree of influence on the research organisation and has preferential access to the findings of its scientific research.

The Court found that the creation of the Dataset was scientific research. While it did not directly generate scientific findings, it was a necessary and essential step in the scientific research process that LAION made with the intention to make such Dataset available to other researchers in the field of artificial neural networks for their research. LAION acted for non-commercial purposes, because it made the Dataset available to the public free of charge. According to Recital 42 of the InfoSoc Directive, only the non-commercial nature of the specific activity in question is relevant; the organizational structure and funding of the organization are not decisive. While the claimant asserted that LAION had close connections with, and received financial and computing resources support from, commercial companies, the Court found that the claimant had failed to substantiate and prove that these commercial companies exerted a decisive influence on the research organization and had preferential access to the findings of its scientific research, as required by the last sentence of Section 60d(2), to carve out the creation of the Dataset from the TDM exception.

III. Takeaways

1. Should AI Training Be Carved Out From the TDM Exceptions?

The judgment clarifies that creators of datasets that use TDM and make available such datasets for AI training purposes on a general basis can be in scope for the TDM exceptions for non-commercial scientific research and commercial purposes. In addition, the judgment confirms that the creation of a dataset can qualify as non-commercial scientific research under Section 60d of the German Copyright Act. However, the judgment does not bring clarity to the dispute among legal experts[7] whether the specific use of copyrighted content in the training of AI models is in the scope of the TDM exceptions under Article. 3 and 4 of the DSM Directive as implemented in Sections 44b and 60d of the German Copyright Act. It seems doubtful whether the facts of this case would permit higher courts to provide more clarity on this issue in case of an appeal.

2. Can There Be an Opt-Out by Any Rights Holders?

The judgment shows that the TDM opt-out does not necessarily have to be expressed by the copyright owner of the work but can be expressed by any rights holder that holds exclusive and non-exclusive rights in the work. Copyright owners may consider obligating their licensees to include, in addition to a reservation of rights in their natural language in their website terms, a machine-readable opt-out in a format specified under Section 3 below.

3. What Constitutes a Machine-Readable TDM Opt-Out?

The Court’s view that a TDM opt-out in website terms and conditions in a natural language is machine-readable does not help to create legal certainty for rights holders and TDM users as to which formats constitute an effective TDM opt-out. In addition, the Court’s view is questionable for the following reasons:

The Court did not consider that the obligation to apply state-of-the-art technology to identify and adhere to rights holders’ opt-outs from the TDM exception under Art. 53(1)(c) of the EU AI Act applies only to providers of general-purpose AI (GPAI) models. The scope of Art. 4 of the DSM Directive is broader than Art. 53(1)(c) of the EU AI Act. It applies to all users that perform web scraping for TDM whether used for their AI model training or other purposes and does not require that these users apply highly sophisticated AI tools with natural language processing (NLP) capabilities.
The EU AI Act is a product safety regulation for specific categories of AI models and AI systems. The regulatory requirement for GPAI model providers to have a copyright policy to use state-of-the-art technology to identify TDM opt-outs does not modify the EU copyright rules that the opt-outs must be made in an appropriate manner (see next point). It defines the state of technology that the GPAI providers will be expected to apply to identify and respect appropriate opt-outs to have a compliant product on the EU market.
According to the wording of Art. 4(3) of the DSM Directive, the term “machine-readable opt-out” is a subset of the requirement that the opt-out must be made “in an appropriate manner.”[8] Even when NLP-capable web crawlers trained for TDM opt-out identification with an acceptable error rate would be readily available on the market for all users, these crawlers would have to search and process the entire natural language text on the website and in relation to all content assets it intends to scrape to ensure that no applicable opt-out is disregarded. If the natural language opt-out is included in an image or a video published on the website or is not written but expressed verbally (which would also qualify as natural language), the AI-powered crawler would have to provide for image and voice recognition features too. If digitized natural language suffices, would it also suffice to send emails to users naming the websites or content assets which the rights holders intend to opt out from the TDM exception? It seems questionable whether any of this can be deemed an appropriate format for opt-outs that balances the legitimate interests of rights holders and the users performing automated data analysis processes using software tools.
While, in the future, NLP-capable web crawlers that are trained to identify, process, and respect natural language TDM opt-outs on websites and in the metadata of specific content assets may evolve generally accepted standards that are readily available in the market to all users performing web scraping, it is doubtful whether this is currently the case. As stated above, the Court has not collected or analyzed any evidence on available technological standards in the proceedings and has based its view on assumptions, and because it was not necessary to prove the decision with specific evidence, it represents obiter dictum.

As the Court noted, the current prevailing view is that natural language opt-outs do not suffice but that the opt-out must be structured so that software applications can easily identify, recognize, and extract specific data (see footnote 4). However, currently, there are no generally recognized standards for a machine-readable opt-out from TDM. The establishment of standards for the use of manageable opt-outs with reasonable efforts for the various categories of rights holders and TDM users is necessary to make the opt-out scenario work for both sides. Various formats are discussed by the stakeholders:

The most common method on a website level, robots.txt, is a text file that uses a specific syntax to enable website operators to establish distinct rules for automated crawlers or scrapers, including permissions to index and display web pages and content. In 2022, the Internet Engineering Task Force (IETF) officially recognized it as the “Robots Exclusion Protocol” (REP).
The ai.txt technique, similar to robots.txt, involves placing a file at the root of a website that selectively restricts or permits access to the site’s content. Unlike robots.txt, which is read once upon website access, ai.txt is checked each time content is accessed through a specific interface for AI developers (the “Spawning API”).
A working group of the World Wide Web Consortium (W3C) proposed three communication standards for declaring opt-outs in February 2024, the so-called TDM Reservation Protocol (“TDM ReP”): Transfer Protocol (HTTP); Hypertext Markup Language (HTML) metadata tags; and JavaScript Object Notation (JSON).

HTTP involves a machine-readable structured response from a server to a data request. According to this protocol, an opt-out can be declared by inserting the metadata “tdm-reservation”.
HTML is a text format readable by humans that also allows for the documentation of extensive semantic and syntactic information. The TDM ReP stipulates that HTML metadata be provided in a <meta> element with the same properties as in the HTTP response header, i.e., the name “tdm-reservation” or “tdm-policy” and the values 1 or the file location. Since multiple <meta> tags can be included in a single file, this approach allows for the differentiation of rights reservations for various paragraphs within the same text.
JSON is a widely used format for structuring data for data provisioning or exchange, which can be processed by many programming languages. According to the TDM ReP, a text file named “tdmrep.json” should be placed in the root directory of the website, with each piece of information represented by an attribute and a value.

Under the EU AI Act, the AI Office of the EU Commission is responsible for developing a GPAI Code of Practice, which will include requirements for the copyright policy that GPAI model providers need to establish, including measures to identify and adhere to TDM opt-outs. The development of the GPAI Code of Practice included an initial multi-stakeholder consultation, kick-off plenary and will now continue in four working groups. It remains to be seen whether this GPAI Code of Practice will facilitate the development of generally accepted standards for TDM opt-outs under Art. 4 of the DSM Directive. The final version of the Code of Practice is expected to be published and presented in April 2025.

[1] The three-step test is laid down in Art. 5(5) of the EU InfoSoc Directive 2001/29/EC that also applies to the new copyright exceptions introduced by the DSM Directive pursuant to Art. 7(2)1 of the DSM Directive. This three-step test is derived from Art. 9(2) of the Berne Convention, Art. 13 of the TRIPS Agreement and Art. 10 of the WIPO Copyright Treaty.

[2] The German legislator states in its reasons to apply Sections 44b and 60d of the Copyright Act: “[These provisions] also expand the use of copyrighted works and other protected subject matter through text and data mining for scientific research and other purposes, thereby promoting innovation (Sustainable Development Principle 6, Sustainability Indicator 9.1). This is of particular importance for machine learning as a basic technology for artificial intelligence.” Drucksache 19/27426 (bundestag.de), p. 60.

[3] Art. 53(1)(c) of the EU AI Act states that providers of general-purpose AI models must “put in place a policy to comply with Union law on copyright and related rights, and in particular to identify and comply with, including through state-of-the-art technologies, a reservation of rights expressed pursuant to Article 4(3) of Directive (EU) 2019/790.” Regulation - EU - 2024/1689 - EN - EUR-Lex (europa.eu).

[4] See Hamann, Nutzungsvorbehalte für KI Training in der Rechtsgeschäftslehre der Maschinenkommunikation (Reservation of use for AI training in the legal theory of machine communication), ZGE 16 (2024), p. 131 et seq., 146 et seq. with further references.

[5] While the language of the underlying Article 4(3) of the DSM Directive is slightly unclear on this aspect (“The exception … shall apply on condition that the use of works … has not been expressly reserved by their rightholders in an appropriate manner, such as machine-readable means in the case of content made publicly available online”), Recital 18 clarifies that “In the case of content that has been made publicly available online, it should only be appropriate to reserve those rights by the use of machine-readable means, including metadata and terms and conditions of a website or a service.” As result, Section 44b(3) of the German Copyright Act is aligned with Art. 4(3) of the DSM Directive.

[6] “A document should be considered to be in a machine-readable format if it is in a file format that is structured in such a way that software applications can easily identify, recognise and extract specific data from it.”

[7] The apparently prevailing view is that AI training, including of generative AI models should be subject to the TDM exceptions; see BeckOK UrhR/Bomhard, 43. Ed. February 15, 2024, UrhG § 44b, margin # 7-11b with further references; Hamann, loc. cit., p. 120 et seq. with further references; Koenatz/Schönhof, Vervielfältigungen und die Text- und Data-Mining-Schranke beim Training von (generativer) Künstlicher Intelligenz (Reproductions and the Text and Data Mining Exception in the Training of (Generative) Artificial Intelligence) WPR 2024, p.289 et seq. with further references; Hofmann Zehn Thesen zu Künstlicher Intelligenz (KI) und Urheberrecht (Ten Thesis on Artificial Intelligence (AI) and Copyright) WPR 2024, p.11 et seq.,; Maamar, Urheberrechtliche Fragen beim Einsatz von generativen KI-Systemen (Copyright Issues When Using Generative AI Systems), ZUM 2023, 481, 482 et seq.; Hamann loc. cit. p.120 et seq. Some voices argue that at least generative AI models should be carved out from the application; see Dornis/Stober Urheberrecht und Training generativer KI-Modelle (Copyright Law and Training of Generative AI Models), August 2024, expert opinion commissioned by the Initiative für Urheberrecht (Author’s Rights Initiative), p. 71 et seq., p.103 et. seq. with further references.

[8] Please note that in this respect Section 44b of the German Copyright Act has not fully implemented Art. 4 of the DSM Directive but must be interpreted in accordance with Art. 4 and Recital 18 of the DSM Directive.

Kristina Ehle
Partner
Yeşim Tüzün
Associate

Practices

Industries + Issues

Artificial Intelligence (AI)

Regions

Germany