In this insight, we look at the process of AI training, the potential pitfalls of misused data, and what measures can be taken to protect your personal and business data from being used to train AI. 

Data – For AI Training 

AI training, at its core, involves feeding large datasets to algorithms, thereby enabling them to learn and make ‘intelligent’ decisions. These datasets are often culled from user-generated content across various platforms. Understanding the source and nature of this data is crucial for recognising the implications of its use. 

Data, therefore, is the lifeblood of AI models and the quality, quantity, and variety of data directly influences an AI model’s performance. For example, language models require vast amounts of text data to understand and generate human-like responses, while image recognition models need diverse visual data to improve accuracy. 

One of the most contentious ways that generative AI companies have allegedly used in recent years, resulting in many lawsuits, to gather enough training data is by the scraping/automatic collection of online content/data. High-profile examples include: 

  • A class action lawsuit filed in the Northern District of California accused OpenAI and Microsoft of scraping personal data from internet users, alleging violations of privacy, intellectual property, and anti-hacking laws. The plaintiffs claimed that this practice violates the Computer Fraud and Abuse Act (CFAA).  
  • Google was accused in a class-action lawsuit of misusing large amounts of personal information and copyrighted material to train its AI systems, thereby raising issues about the boundaries of data use and copyright infringement in the context of AI training.  
  • A Stability AI, Midjourney, and DeviantArt class action lawsuit claiming that AI companies used copyrighted images to train their AI systems without permission.   
  • Back in February 2023, Getty Images sued Stability AI alleging that it had copied 12 million images to train its AI model without permission or compensation. 
  • Last December, The New York Times sued OpenAI and Microsoft, alleging that they used millions of its articles without permission/consent (and without payment) to help train chatbots. 

In many of these cases, the legal argument to allow such use has been “fair use” and “transformative outputs.” For example, the AI companies know that under US law, the “fair use” doctrine allows limited use of copyrighted material without permission or payment, especially for purposes like criticism, comment, news reporting, teaching, scholarship, or research. 

What About Your Data? Could It Be Used AI Training … And How? 

When it comes to your personal and business data, many of the big AI companies have already scraped the web, so whatever you’ve posted is probably already in their systems. There are also many other ways that your data could end up being part of AI training data through several channels. For example: 

  • Online Activity. When you browse websites, search engines, and social media, companies collect your data to personalise services and train AI to predict user-behaviour. 
  • Device usage. Smartphones, wearables and smart home devices collect data about your daily activities, locations, health statistics, and preferences, all of which is useful for training AI in areas like health monitoring, personal assistance, and device-optimisation. 
  • Service Interactions. Interacting with customer service chatbots or voice assistants provides conversational data that helps train AI to understand and generate human-like responses. 
  • Content creation. Uploading videos, writing reviews, or other content creation on platforms can provide data for AI to learn about content preferences and creation styles. 
  • Transactional Data. Purchases, financial transactions, and browsing products online give insights into consumer behaviour, used by AI to enhance recommendation engines and advertising algorithms. 

All these methods, therefore, which could involve your data, help AI systems learn and adapt to provide more personalised and efficient services. 

The Risks of Data Misuse 

There are, of course, risks in having your data used/misused by AI. These risks include: 

– Privacy and security concerns. The primary risk of using data in AI training is the potential for significant privacy breaches. Sensitive information, if not adequately protected, can be exposed or misused, leading to serious consequences for individuals and businesses alike. 

– Bias and ethical implications. Another critical concern is the propagation of bias through AI systems. If AI is trained on biased or unrepresentative data, it can lead to unfair or prejudiced outcomes, which is especially problematic in sectors like recruitment, law enforcement, and credit scoring. 


For some people, their creative artwork/images have been used to train AI and this is a particular issue. The website, for example, is an online tool that uses clip-retrieval to search the largest public text-to-image datasets. In this way, links to images that artists want to opt-out from being used to train generative AI systems can be removed. 

What Proactive Measures Can You Take To Protect Your Data? 

Bearing in mind the significant privacy risk posed by AI, there are a number of proactive measures you can take to stop your data from being used to train AI. For example: 

Opt-Out Options and User Consent 

Many of the services you use from the big tech companies provide mechanisms for users to opt-out of data sharing. Familiarising yourself with these options and understanding how to activate them is essential for maintaining control over your data. Examples include: 

If you store your files in Adobe’s Creative Cloud, to opt out of having them used for training, for a personal account, go to the Content analysis section, and click the toggle to turn it off. 

If you’re a Google Gemini (AI) user, to prevent your conversations being used, open Gemini in a browser, click on Activity, and select the Turn Off drop-down menu. 

If you’re a ChatGPT account holder and are logged in through a web browser, select ChatGPT, Settings, Data Controls, and then turn off Chat History & Training. 

For the Squarespace website building tool, to block AI bots, open Settings (in your account), find Crawlers, and turn off Artificial Intelligence Crawlers. 

These are just a few examples and it will be a case of going through each of the main services you use and trying to find the opt-out (perhaps using Google to help as you go). However, it’s worth noting that some are either very difficult to find or simply aren’t available for certain types of account. Overall, this can be quite a time-consuming process. 

Enhanced Data Management Practices 

Businesses should implement strict data management policies that govern the collection, storage, and use of data. These policies can help ensure that data is handled ethically and in compliance with relevant data protection laws and shielded from AI use for training. 

Leveraging Technology for Data Security 

Advanced technological solutions, such as encryption and secure data storage systems, may also be able to play a critical role in protecting data from unauthorised access and breaches that could lead to it finding its way into the hands of AI companies for training. 

What Does This Mean for Your Business? 

For businesses today, the pervasive use of data by AI underscores the dual imperatives of protection and vigilance. The reality is that many AI companies have likely already collected extensive swathes of public internet data, including potentially from your own business activities, which poses a distinct challenge. This means that data posted online (either deliberately or inadvertently) may already be part of training sets used to enhance AI capabilities. 

That said, businesses can still do things and still hold significant power to influence future data usage and secure existing data. For example, businesses can take proactive steps by regularly reviewing the privacy policies and settings of the digital platforms they use. This includes social media, cloud storage, business software, and any platform where data is stored or shared. Although navigating these settings can be complex, finding and activating opt-out features may be necessary for maintaining control over how your data is used. 

Businesses may also wish to educate their employees about data sharing and privacy settings. Training sessions can help employees understand the importance of data-privacy and the steps they can take to ensure data is not inadvertently shared or used for AI training without consent. 

Developing and enforcing robust data management policies is essential anyway and this not only complies with data protection regulations but also limits unnecessary data exposure that could be exploited by AI systems. These policies should govern how data is collected, stored, and shared, ensuring that data handling within the company is done ethically and responsibly. 

Deploying advanced technological solutions such as encryption, secure access management, and data loss prevention tools can also significantly reduce the risk of unauthorised data access. This is particularly relevant in preventing breaches that could see sensitive information being used to train AI (without your knowledge). While it is challenging to completely control all data that may already be within AI training datasets, businesses can still exert some significant influence over their current data handling and future engagements.

Finally, with ongoing AI legal battles and new regulations, staying informed about your rights and the latest developments in data privacy law could be prudent. This knowledge could help businesses advocate for their interests and respond more adeptly to changes in the legal landscape that affect how their data can be used. 

If you would like to discuss your technology requirements please:

Back to Tech News