The AI-systems builder’s IP checklist – Part 1: Using others’ data to train and build AI systems

In this 4-part series we’d like to share some important considerations at every stage of AI development:

‍

- Part 1: Using others’ data for training and building AI systems

- Part 3: Functionalities and features of AI systems

- Part 4: User interface and output of AI systems

‍

Building an AI system is more than creating algorithms, code and data. It also comes with responsibilities to ensure that no part of the system is created from stolen data, copied without permission, or could potentially cause significant harm if the system is hacked or used inappropriately.

‍

Data is the foundation of any AI system. It acts as the training material for machine learning models. Regardless of the AI methodology (supervised, unsupervised or deep learning), AI needs large amounts of training data. This data comes from open repositories, proprietary databases, web scraping, user-generated content, or licensed datasets. Knowing who owns that data is crucial since often, this data is protected by copyright, meaning someone has the legal right over it. Owners may allow its use only under specific conditions, typically set in a license.

‍

License terms: stay within the agreed terms of use

When you have found the data you want to use, make sure to check the licensing terms that govern its use. Can you use the data for research? Commercially?

‍

Some datasets, like DrugBank’s comprehensive drug database in Canada, are clearly licensed for commercial use. Choosing data with transparent licensing terms (and abiding by these terms!) is the first step to avoid issues later and supports responsible AI development.

‍

A close up of textAI-generated content may be incorrect.

Above: example of License terms of use for DrugBank Content

‍

Available online does not mean it is free to copy
Just because data is on the internet doesn’t mean it is free for everyone to copy and use; someone owns it. Thanks to international agreements, images, text, and more are automatically protected by copyright in nearly every country in the world. Therefore, AI models that copy publicly available data without permission could face lawsuits for unauthorized use. Unsurprisingly, more and more content creators are calling on their governments to form stricter control of unauthorized use of their work.

‍

Can you even use this data? A case of problematic data acquisition

A recent case in point is from 2023 when Stability AI used millions of Getty Images (without permission) to train its model. U.S.-based Getty Images, which licenses the images from thousands of contributors, sued U.K-based Stability AI for unauthorized use of these copyright protected images.

‍

Fair Use / Fair Dealing Debate

In copyright law, fair dealing (Canada) and fair use (U.S.) are exceptions that allow limited use of copyrighted material without permission. This rule aims to balance creators' rights with the public’s right to access creative works. This often applies to criticism, commentary, news, teaching, or research.

‍

Use of data to train an AI system may not be considered fair use

In February 2025, in a landmark decision on AI and copyright, a U.S. district court ruled that an AI startup, Ross Intelligence, broke copyright rules by using thousands of Thomson Reuters’ Westlaw’s legal summaries to train its AI tool. The court said this wasn’t fair use because Ross used the content to build a competing product without permission. The court found that the copying was commercial, not transformative, and could harm the market for Westlaw’s content, setting an important precedent for AI training data.

‍

Consider Alternative Approaches to achieving Copyright Compliance

It may be worthwhile building a system that doesn’t rely on copying copyright data. In fact, it may become a selling point. For example, if your AI system offers improvements to copyright work (music, text), you may be able to offer only tailored improvements without copying copyrighted works first. This ensures both high-quality results and legal compliance.

‍

The Bottom Line: Build Responsibly from the Start

As AI development accelerates, so too does scrutiny around how data is acquired and used. Ignoring copyright and licensing obligations isn’t just a legal risk, it’s a reputational and ethical one. Companies that prioritize responsible data use will not only avoid costly litigation but also build systems that are trustworthy, transparent, and sustainable.

‍

Stay tuned for Part 2: Creating and Storing Data—where we’ll explore how to safeguard the data you generate and ensure it forms a strong legal and ethical foundation for your AI system.

‍

About the Author:

Allessia Chiappetta is a second-year JD candidate at Osgoode Hall Law School with a keen interest in intellectual property and technology law. She holds a Master of Socio-Legal Studies from York University, specializing in AI regulation.

‍

Allessia works with Communitech’s ElevateIP initiative, advising inventors on the innovation and commercialization aspects of IP.

‍

Allessia regularly writes on IP developments for the Ontario Bar Association and other platforms. Allessia is trilingual, speaking English, French, and Italian.

‍

ventureLAB, in partnership with Communitech, is a recipient of Federal Government funding under ISED’s ElevateIP program in Ontario, Manitoba and Saskatchewan.

‍

ventureLAB is a corporation incorporated under the laws of Ontario and is not an agent of the Crown.

‍

The AI-systems builder’s IP checklist – Part 1: Using others’ data to train and build AI systems

Related News