A formal debate scene with two humanoid robots facing each other at podiums and a gavel symbolizing the jury between them.
A symbolic scene representing an AI vs AI debate focused on argumentation, structure, and persuasion.

AI Debate Experiment: Thinking and Persuasion with Artificial Intelligence

In several AI seminars I attended in 2025, the most common question was, "Will artificial intelligence take our jobs?" What caught my attention, though, was something else: how I could make myself more resilient and more useful in a growing AI-driven environment. With that in mind, I thought it would be helpful to look at the World Economic Forum's Future of Jobs 2025 Report. In this post, I'll briefly touch on the report and then share an experiment idea inspired by some of its findings.

The reason I'm writing this is to see whether the skills mentioned in the report can be learned in a hands-on way through AI models. To do that, I plan to use an experiment I had already designed for a different purpose.

This won't be a technical paper or a comparison of AI models. My goal is to observe how AI uses thinking, discussion, and persuasion skills that are becoming harder to apply in everyday work life, and to understand how I can practice them myself.

The WEF Future of Jobs 2025 Report and Key Skills

This report, prepared by the World Economic Forum (WEF), aims to identify opportunities and risks by analyzing how the global labor market is expected to transform between 2025 and 2030. It includes data-based insights on how areas like technological progress, economic shifts, and social expectations may affect the business world.

Put more simply, this report is like a modern nautical map and weather forecast for ships sailing through fast-changing and sometimes stormy seas. The map (the report) shows approaching storms (declining jobs and economic crises) and newly discovered islands (growing sectors and new skills), helping captains (leaders and individuals) adjust their routes toward a safer and more profitable future.

According to the report, there are several technical, social, behavioral, and personal skills recommended to develop between 2025 and 2030. It predicts that 39% of existing skills will either disappear or transform, and emphasizes that the skills listed below will become more valuable for surviving in the business world. What I noticed while reading the report is this: it's impossible to say "I've mastered these skills," but it is very possible to get better at them by practicing in daily life.

Technical Skills

  • Artificial Intelligence and Big Data
  • Networks and Cybersecurity
  • Technological Literacy
  • Environmental Management

Social and Personal Skills

  • Analytical Thinking
  • Resilience, Flexibility, and Agility
  • Creative Thinking
  • Leadership and Social Influence
  • Curiosity and Lifelong Learning
  • Motivation and Self-Awareness
  • Empathy and Active Listening
  • Talent Management, Teaching, and Mentoring

While reading the report, I also noticed something uncomfortable. Most of the skills I felt confident about were technical ones. The skills I struggled to evaluate were the others: thinking clearly under pressure, defending ideas, changing my mind without letting ego take over. The purpose of this experiment is to improve myself by observing how AI uses these skills.

Debate: A Tool for Analytical Thinking and Persuasion

I could write a lot about the report's outcomes, and I have many different ideas in mind. But in this post, I want to focus on one thing: debate. Debate, which I looked down on in high school and only watched as a spectator in university competitions, actually covers several of the skills listed in the report.

Debate is a form of discussion where opposing views on a specific topic are defended within defined rules. The goal is not to decide who is right, but to defend arguments with reasoning and try to refute opposing arguments through logic. You can think of it as a kind of mental workout. By structuring your thoughts and the topic you defend, you try to persuade both the other side and the audience.

Practicing debate helps with building thinking systems, spotting weak or blind spots, and improving analytical thinking. On top of that, it develops expression, active listening, and the ability to communicate clearly. In other words, many of the skills predicted by the report can be developed simply by learning how to debate.

Facing an AI that can debate would push me to improve seriously. But to make it a starting point-and to test another idea I had in mind-I felt like all the planets aligned, and I designed an experiment where two AI models debate against each other.

A Debate Experiment with AI: Purpose and Approach

There are many areas where I've already gotten my hands dirty with AI. Some of them are moving slowly due to technical limitations, but to bring certain experiments to life, I mostly just need an excuse. For this experiment, the report was a perfect excuse.

The goal of the experiment is to make two different AI models talk and debate with each other. Then, by sharing the outcomes with a few people, I want them to score the results and review them with human judgment, creating high-quality training data. I thought the best way to create this discussion environment would be a debate format.

By the way, there may already be people or teams who have done something like this. I didn't come across any in my research, but I'm deliberately avoiding saying "I did it first." If you know of similar work, please share it with me so I can learn and review their outputs as well.

Rules and Structure of the Debate Experiment

While defining the rules and structure of the experiment, I started looking at what topics people debate most and under which rules. What I read and the competition videos I watched showed me that, although debate has simple rules, there is serious strategic work happening behind the scenes. I even noticed some standard techniques being used.

As a result of this research-and to validate the experiment-I decided on a four-round debate flow under a single main topic.

In the first round, each side presents only its own position and supporting arguments. They do not respond to the other side at all. In the second round, the sides focus on what the opponent said and explain why they disagree, aiming to refute the opposing arguments. The third round is similar to the second, but this time the focus is on why the opponent's assumptions are flawed. In the final round, I ask the sides to summarize their views and explain why they still defend their original position or why they agree with the opposing view after hearing the counterarguments.

Although the flow is similar to human debate competitions, I couldn't impose time limits on AI models. Instead, I chose to limit words and arguments. In each round, models can use a maximum of 500 words and 3 arguments. Similar limits apply in the response rounds.

These limits were set to prevent models from producing "long but empty" answers. The goal was to make responses denser and more meaningful. Otherwise, I felt it wouldn't be much different from talking heads on TV shows.

For the first experiment, I chose a topic we all experienced during the pandemic and that is still used as a justification for many mass layoffs today: working remotely is more productive than working from the office. Many articles and studies have been written on this, but I was curious to see what AI models would produce using this information.

How Were the Debate Outputs Evaluated?

After defining the rules and topic, the most important part was evaluating the outputs in an unbiased way. For this, I thought scoring could be done based on the following skills:

  • Conceptual Clarity
  • Logical Consistency
  • Strength of Arguments
  • Quality of Counterarguments
  • Practical Realism
  • Summarization and Inference

There would be a winner based on the total score across these six categories.

I chose these six criteria to measure both the quality of thinking and practical applicability together.

Running the Experiment: Model Selection and Process

While running the experiment, I selected two strong AI models and used both through Open Router with paid usage to avoid imbalance. When choosing models, I relied on the usage trend charts provided by the platform.

From the lists between December 15, 2025 and January 6, 2026, in the categories:

  • Technology
  • Science
  • Health
  • Academia
  • Legal
  • Trivia
  • Roleplay

I selected the models with the highest token usage and then picked the top two. As seen in the table, this pointed to Mimo V2 Flash and Gemini 2.5 Flash. However, since I received errors when sending requests to Mimo V2 Flash, I continued with DeepSeek V3.2 instead.

TechnologyScienceHealthAcademiaTriviaLegalRoleplayTotal
Mimo V2 Flash3908120335
Gemini 2.5 Flash248803227
DeepSeek V3.20002001214
Devstral 2 2512 (free)003307013
gpt-oss-120b040008012
Gemini 3 Flash Preview054001010
Claude Sonnet 4.572000009
Claude Haiku 4.520600008
GPT-5.220031006
Claude Opus 4.550000005
DeepSeek R1 T2 Chimera00000055
Gemini 2.5 Flash Lite10000304
GPT-4o-mini00003003
Grok 4.1 Fast00003003
Llama 3.1 70B Instruct00003003
MiniMax M220000002
Gemini 2.0 Flash00002002
Qwen3 235B A22B Instruct 250700000202
DeepSeek V3 032400000022
Claude Sonnet 400100001
gpt-oss-20b00100001
GPT-4.1 Nano00100001

For the first round, I labeled Gemini 2.5 Flash as Model A and DeepSeek V3.2 as Model B. Model A defended the proposal, meaning it argued that remote work is more productive. Model B defended the opposite view: working from the office is more productive.

During the experiment, DeepSeek V3.2 showed a slight loss of focus at times but still produced results very close to what I expected.

Results and Lessons from the Debate Experiment

First of all, this experiment cost me 0.010573 USD, or about 0.46 TRY, in AI usage. With a very low cost, I was able to run the experiment I had in mind. Evaluating it myself as a human would have been enough.

But since I was short on time (I rushed a bit to finish this post), I asked the GPT-5.2 model to do the evaluation. In a competition where everyone was AI, the winner was Model B: DeepSeek V3.2.

So what did I gain from this experiment? Mainly, I practiced debate methods. On top of that, I now have a template project, which means I can run many other experiments using the same structure. In that sense, I achieved what I aimed for while designing the experiment.

Still, looking at the WEF report that inspired this post, I only took a step toward one technical competency. That makes it look like I failed, doesn't it?

Here's how I see it. I didn't fail, because I now have three different roles I've tested. I have examples created by these roles, and they are as close to real as possible. This experiment gave me material I can study and, in a way, imitate. Basically, I started using AI as a training partner.

This was just the first attempt. I plan to repeat the same format with different topics, different models, and human participation. As the outputs change, the questions will change too. Maybe you can support the experiment by suggesting new debate topics.

In the end, the real question is not whether AI will surpass us, but which muscles we choose to strengthen alongside it-and which items we decide to remove from our old list of "I wish I had."

References and Further Reading