Study Reveals Growing Data Restrictions Impacting AI Training - 1

Image by Adisorn, from Adobe Stock

Study Reveals Growing Data Restrictions Impacting AI Training

  • Written by Kiara Fabbri Former Tech News Writer
  • Fact-Checked by Justyn Newman Former Lead Cybersecurity Editor

A new study led by a MIT research group, reveals a growing trend of websites limiting the use of their data for AI training. The study examined 14,000 web domains and found that restrictions have been placed on 5% of all data. Additionally, over 28% of data from the highest-quality sources across three commonly used AI training datasets is restricted. This study is the first large-scale longitudinal audit of consent protocols for web domains used in AI training corpora.

Generative AI systems, like ChatGPT, Gemini, and Claude, rely heavily on vast amounts of data to function effectively. The quality of these AI tools’ outputs depends significantly on the quality of the data they are trained on. Historically, gathering this data was relatively straightforward, but the recent surge in generative AI has led to tensions with data owners. Many data owners are uneasy about their content being used for AI training without compensation or proper consent.

The consequences of this data squeeze are multifaceted. It will make developing AI systems more difficult, as they rely heavily on this data for training. The restrictions may also bias AI models by limiting them to less diverse data sets. Additionally, copyright issues could arise if AI models are trained on data that websites don’t want used for that purpose.

The restrictions are having a significant impact. In just one year, a significant portion of data from important websites has become restricted, and this trend is expected to continue.

Shayne Longpre, the study’s lead author, states : “We’re seeing a rapid decline in consent to use data across the web that will have ramifications not just for A.I. companies, but for researchers, academics and noncommercial entities.”

This means that smaller AI companies and academic researchers who depend on freely available datasets could be disproportionately affected, as they often lack the resources to license data directly from publishers.

For example, Common Crawl , a dataset comprising billions of pages of web content and maintained by a nonprofit, has been cited in over 10,000 academic studies, illustrating its critical role in research.

The study highlights the need for new tools that give website owners more control over how their data is used. Ideally, these tools would allow them to differentiate between commercial and non-commercial uses, permitting access for research or educational purposes.

The situation also serves as a reminder to big A.I. companies. They need to find ways to collaborate with data owners and offer them value in return for access. A more sustainable approach is crucial for the continued development of A.I.

Longpre emphasised the need for big AI companies to collaborate with data owners and offer them value in return for access. For years, these companies have treated the internet as an “all-you-can-eat data buffet” without giving much in return to data owners. However, this approach is unsustainable, and as data owners become more protective of their content, AI companies will need to find ways to work with them to ensure continued access to high-quality data.

AI Model Detects Alzheimer’s Disease Better Than Standard Clinical Markers - 2

AI Model Detects Alzheimer’s Disease Better Than Standard Clinical Markers

  • Written by Kiara Fabbri Former Tech News Writer
  • Fact-Checked by Justyn Newman Former Lead Cybersecurity Editor

Researchers from Cambridge University created an AI model that can predict with high accuracy whether someone with early memory issues is likely to develop Alzheimer’s disease, and how fast it might progress. Published in eClinical Medicine, their study shows this model outperforms current methods used to diagnose dementia.

To achieve this, the research team trained a machine learning model on non-invasive, routinely collected data for Alzheimer’s disease prediction (such as cognitive tests and structural MRI scans). They tested the model’s accuracy with real-world data from 1500 patients across the US, UK, and Singapore.

The algorithm effectively distinguished between individuals with stable mild cognitive impairment and those who progressed to Alzheimer’s disease within three years. It correctly identified those who developed Alzheimer’s in 82% of cases and those who did not in 81% of cases. Notably, this makes the AI model about three times more accurate than standard clinical markers.

In their study, the researchers point out how this AI-guided approach has the potential to significantly improve patient care. Firstly, the model could reduce the need for expensive and invasive procedures, ultimately lowering healthcare costs. Furthermore, by identifying those most likely to develop Alzheimer’s disease, scarce medical resources could be targeted more effectively. Additionally, this approach could help standardize diagnoses across different memory clinics, leading to more consistent care and reducing inequalities in healthcare access.

Professor Zoe Kourtzi, Senior author from from Cambridge University’s Department of Psychology, stated :

“We’ve created a tool which, despite using only data from cognitive tests and MRI scans, is much more sensitive than current approaches at predicting whether someone will progress from mild symptoms to Alzheimer’s – and if so, whether this progress will be fast or slow. […] This has the potential to significantly improve patient wellbeing, showing us which people need closest care, while removing the anxiety for those patients we predict will remain stable. At a time of intense pressure on healthcare resources, this will also help remove the need for unnecessary invasive and costly diagnostic tests.”

The researchers also point out some limitations of the study that need to be considered. These include the size and diversity of the population sample and the tools used to collect data. To improve their AI model, they need more real-world patient data from various healthcare systems and countries. This will help them ensure the AI works well for different groups of people, making it more useful worldwide.

However, given their promising findings, the research team aims to extend their model to encompass other dementia forms like vascular and frontotemporal dementia. Additionally, they plan to explore incorporating blood test markers into their data analysis.

Professor Kourtzi, added : “If we’re going to tackle the growing health challenge presented by dementia, we will need better tools for identifying and intervening at the earliest possible stage. Our vision is to scale up our AI tool to help clinicians assign the right person at the right time to the right diagnostic and treatment pathway. Our tool can help match the right patients to clinical trials, accelerating new drug discovery for disease modifying treatments.”

This research is especially critical because dementia cases are on the rise worldwide. According to the World Health Organization , over 55 million people currently live with dementia, with most cases concentrated in developing countries (over 60%). Alarmingly, nearly 10 million new cases are diagnosed each year. Dementia is a leading cause of disability in older adults, often requiring them to depend on caregivers for assistance.

Nonetheless, this AI model, with its ability to predict dementia early, has the potential to significantly impact how we manage this growing global health challenge.