Artificial intelligence systems have become an ubiquitous part of our everyday lives. In particular, recent advances have reignited discussions about the benefits and risks of AI systems - from ChatGPT to deep fake images. ECDF and Elsevier have taken these recent developments as the impetus for an open discussion series on how digitalization is impacting the scientific enterprise: ECDF and Elsevier Conversations on Science in the Digital Future. After the first discussion on data protection in the digital age (//recap), the second panel took place on April 20, 2023, on the topic of "Responsible AI or Disinformation at Scale?" with Prof. Dr. Felix Biessmann, Professor of Data Science at ECDF and the Berliner Hochschule für Technik, Harry Muncey, Director of Data Science and Responsible AI at Elsevier, and Tabea Rößner, Member of the German Bundestag for Bündnis 90/Die Grünen and Chair of the Digital Committee. The event was moderated by journalist Katharina Heckendorf.
Three articles form the basis for this episode: The Dutch algorithm scandal, AI for drug discovery, and the outsourcing that was necessary for the creation of ChatGPT. In the Netherlands, the tax authorities used automated systems that falsely accused thousands of families of fraud, discriminating on the basis of nationality. In the pharmaceutical industry, AI is being used to improve drug discovery, helping to analyze large amounts of data and predict successful compounds. OpenAI, the company behind the recent ChatGPT AI system, profited from outsourcing, paying Kenyan workers as little as $2 an hour to make the algorithms less toxic. The workers had to sift through data, some of which included graphic violence and abuse. So is responsible AI even possible? What is its potential for harm? To what extent should we regulate and monitor these systems? What can we learn from the Dutch algorithm scandal? How can we ensure data sets of good quality? To what extent should we trust AI?
“It seems like the issue is a result of multiple failings in terms of responsible AI: a lack of oversight, accountability failures, transparency, and explanations with people being impacted by the decisions of the algorithms not having an opportunity to challenge the results. It is a prime example of how by using AI we can replicate existing biases in our systems”, so Harry Muncey with regard to responsible AI and the Dutch example, pointing out the human prejudices which have likely been embedded in the data used to train the algorithms used by Dutch tax authorities. Tabea Rößner comments: “We need good, open data in order to make sure that the used data sets are adequate. However, some risks cannot be minimized although the data is good.” She points to the AI Act and the risk-based approach the European Union is currently working on to protect human rights and personal data in a world of algorithms. Felix Biessmann agrees that data quality is an important issue: “We assume that the data is representative enough to train a model that can do predictions of any kind of data. Typically that is not the case. Algorithms can amplify biases and marginalize groups even further. Many of these problems are not AI-related, but data-related." Again, the ChatGPT example shows the importance of data. The OpenAI program is largely trained on the Internet, where anyone can post content that is then digested by the AI. Hence the need for appropriate regulations to train on high-quality data. When it comes to the human side of AI, Felix Biessmann highlights that “both is wrong: not trusting AI systems enough and blindly following them.” Harry Muncey agrees and perceives it as “problematic when expectations of what a system is capable of misaligns with the reality”.
When it comes to AI producing pharmaceutical products that no one has ever seen, how safe would it be to take such drugs? What role will human oversight play?
“What is inside? How would it be developed? What do the doctors say? We need human decisions in this," says Tabea Rößner, who acknowledges the possibilities and opportunities of AI in medicine, but carefully considers whether we really want to know everything that AI can tell us. Felix Biessmann points out the importance of data quality and representative data sets: “For example, the mimic data set is one of the most important data sets in health care research in AI. Yet, it is tested and developed for white people. We need to work more on the diversity and heterogeneity of our data!" There is also the issue of privacy, as private information needs to be protected - especially when it comes to patient information. Felix believes that “it is always a trade-off of privacy versus utility. Is it really good data? Tabea Rößner argues that "we cannot force people to give up their data. Especially with rare diseases, you could identify the patient. It requires a balance act of having good data, using the advantages, and protecting the people.” This also entails human oversight which Harry Muncey believes to always be needed “especially in health care which is the highest risk domain for applying AI and also one of the areas where we will see the biggest returns when it comes to AI systems. We will not be getting away from requiring a basic level of human oversight”. Felix Biessmann agrees by stating that human oversight should remain but be implemented more efficiently with a facilitated use of quickly advancing technology which will also allow us to save time.
Regarding the dataset behind ChatGPT: Do we still need humans to label the data to address the violent biases or hate speech it may contain? Could outsourcing be avoided?
"Outsourcing is a general problem! Even if the workers earn more, they would still see the traumatic content. It is our responsibility to protect these people," says Tabea Rößner, referring to the Kenyan workers behind the OpenAI datasets. For Harry, “it is not a problem specific to AI. There is a responsibility to examine the supply chains of the products and technologies we are using just like we would for things that are not AI.” Implementing a supply chain law similar to Germany's Supply Chain Act would be one option. According to Felix Biessmann, it would be difficult to implement, but it's a great idea, highlighting the helpfulness of being able to see the data on which the model was trained. While the EU is working on an AI law, Tabea emphasizes that it is not easy to define appropriate regulations, but that there are many ongoing discussions. Harry reflects on the challenge of tackling AI with regulations: "It's a real challenge! Responsible AI for particular algorithms is a very context-specific challenge: The risks involved in an AI for drug discovery or treatment response to algorithms differ from those of the algorithms that advise what to watch next on Netflix. There are biases in the data and a certain level of human oversight and transparency is needed. Depending on the context, different levels of save-guarding are needed.” In addition, our guests discussed the extent to which the power and monopoly of a few large companies in the AI world can be broken. As a researcher, Felix Biessmann would like to see more open science and open datasets, relying less on companies and more on science. As for Harry Muncey, they think that "it seems like a natural progression. What we need to do is to incentivise academic and industry collaboration to create openness and foster information sharing”, stressing the need of more diverse voices in the AI dialogue.
So, what does this mean for science in the digital future? Moderator Katharina Heckendorf asks the three panellists to look to the future:
"To build responsible AI, we need..."
Tabea Rößner: “... public money, public openness, public code; transparency, good quality of data, authorities to control it and we need to have the oversight of the connected risks.”
Harry Muncey: "... to work together collaboratively and transparently. [...] We need to incorporate as many diverse voices as possible so that we can ensure that the systems work for everybody and not just a few.”
Felix Biessmann: "... transdisciplinary efforts for automated data quality."