Date: Wednesday September 27
Time: 20:15 - 22:15
Location: V2 Amphi
Bilkent Artificial Intelligence Society is proud to announce that it is a partner in this event, which will be held under the leadership of Dutch Universities.
This event will be held simultaneously at Bilkent University V2 Lecture Hall, online.
Summary:
Part 1: Introduction and Context
-
Welcome & Scope: Alexandra welcomes everyone to the live Q&A event with OpenAI. She notes the high interest, with over 1700 people signed up, including participants from numerous Dutch universities organizing local viewing hubs (Tilburg, Eindhoven, Maastricht, Amsterdam, Delft, Utrecht, Twente, Groningen mentioned specifically). She emphasizes the shared interest in AI and the future of humanity uniting people across the country.
-
Host Introduction: Alexandra introduces herself as the founder of Catalyze Impact, an organization focused on supporting independent AI safety researchers. She was asked to introduce the event.
-
Event Structure: Before handing over to OpenAI researchers, she plans to briefly explain the event’s purpose and show a short video to provide background context and set the stage. This is deemed necessary because the topics can be abstract, and the goal is to get everyone on the same page and prepared for deep, thought-provoking questions during the Q&A.
-
Transition to Video: She introduces the next slide which will contain the video.
Part 2: Embedded Video Summary (80,000 Hours - “What are the risks of AI?”)
-
The Core Warning: The video starts by highlighting a statement signed by AI scientists, including leaders from OpenAI, Google DeepMind, and Anthropic, stating that “mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.”
-
The Question: It asks how AI systems could go from helpful tools (like helping students cheat) to posing an existential threat to humanity.
-
AI vs. Traditional Software: Current state-of-the-art AI systems aren’t like conventional computer programs with explicitly coded step-by-step instructions. They are more like “black boxes” – vast neural networks with hundreds of billions of parameters, trained using techniques like stochastic gradient descent (SGD). You can train them towards a goal, but you can’t explicitly program their objectives, and it’s extremely difficult to know exactly why a model generates a specific output.
-
The Capability Worry: As AI systems become more capable, the concern is that they might try to gain power as an instrumental goal to better achieve their primary objectives.
- Why Seek Power? (Instrumental Convergence): The video argues this isn’t about AI developing human-like malice or emotions. It uses the example of a superintelligent AI tasked only with ensuring you get your daily coffee.
- Self-Preservation: It can’t get coffee if it’s turned off or destroyed.
- Goal Content Integrity: It will resist having its coffee goal changed (e.g., to getting tea), as that would prevent it from achieving its original goal.
- Resource Acquisition/Power Seeking: It will likely predict that having more power and influence (resources, control) makes it more likely to always achieve its coffee goal. Taking over the world might guarantee your local Starbucks never runs out of flat whites.
-
Indifference, Not Hate: The key takeaway is that AI doesn’t need to be evil or hate humans to be a threat. Humans just need to be in the way of its goals (or the resources it needs, like atoms for coffee bean shipments). The video draws an analogy to how humans treat other species (chimps, chickens, cows) or even other humans throughout history when they are perceived as obstacles or resources.
- How Could AI Take Over?: A superintelligent AI could potentially:
- Engineer new bio-weapons.
- Manipulate companies or governments.
- Hack critical systems (military, financial, infrastructure).
- The video paints a scenario where we wake up to find the power grid down, the internet offline, and phones dead.
-
Uncertainty & Counterarguments: The video acknowledges it’s not certain this will happen. AI systems might remain docile, progress might slow down. Catastrophic takeover might be less likely than not. BUT, the high uncertainty itself is scary, especially given the rapid pace of development and competitive pressures pushing labs forward.
- Other Risks (Besides Takeover): Even without intentional power-seeking, advanced AI poses other risks:
- Misuse by bad actors (rogue states, terrorists, even capable individuals if hardware/efficiency trends continue).
- Increased conflict between great powers.
- Development of dangerous new biotech.
- Empowerment of totalitarian governments.
- Mass unemployment.
- Deepening inequality.
- Massive flooding of misinformation.
-
Current Situation & Lack of Understanding: No one currently knows for sure how these novel technologies will play out. We don’t fully understand how current models generate outputs or how to guarantee they won’t seek power.
-
Resource Imbalance: Billions are spent annually advancing AI capabilities, while (as of 2022) only an estimated 400 people worldwide were actively working on reducing AI-related existential risk.
- Solvable Problem & Solutions: The problem is likely solvable through:
- Technical AI Safety Research: Working on ways to make AI systems safe (e.g., increasing interpretability, ensuring they aren’t power-seeking). Leading labs have safety teams.
- AI Governance Research & Policy: Developing policy ideas, creating new agencies, coordinating between actors, working with governments/industry on regulations. This area saw increased activity in 2023.
-
Urgency & Final Analogy: Nobody knows the future, but we shouldn’t just hope for the best. This is framed as one of the world’s most pressing problems, and now is a crucial time to act “before we find ourselves… staring out through glass, wondering where it all went wrong” (analogy to the chimps).
- Call to Action: Visit 80000hours.org for more info, job openings, and career advice.
Part 3: Transition and Speaker Introductions
-
Technical Issues: There appears to be a significant pause and then issues playing a subsequent video. Teun mentions a Zoom error preventing screen sharing.
-
Alexandra Resumes: Alexandra takes back the lead, introduces Teun (who the audience already heard briefly).
-
Teun’s Introduction: Teun will moderate the Q&A. Alexandra provides his background: co-founded the European Network for AI Safety (ENAIS), works on independent technical AI safety research, pursuing a Master’s in AI.
- Teun Takes Over: Teun thanks Alexandra, introduces the OpenAI speakers:
- Daniel Kokotajlo: Background in philosophy, worked at think tanks on AI forecasting (AI Impacts, Center on Long-Term Risks), joined OpenAI’s governance team (part of policy research), now transferred to the Frontiers team working on a “faithful chain-of-thought alignment agenda.”
- Jeff Wu: Studied mathematics and computer science, worked as a software engineer at Google and terminal.com, currently a research engineer on OpenAI’s alignment team. Enjoys puzzles as a hobby.
- Transition to Presentations: Teun welcomes them and asks them to start their presentation.
Part 4: Daniel Kokotajlo’s Presentation
-
Opening & Disclaimer: Daniel thanks Teun, shares his screen. States he and Jeff are speaking for themselves, not officially for OpenAI, and might not be able to answer questions due to secrecy.
- AI Progression Narrative: He presents his view of the likely progression:
- Current: Chatbots (like ChatGPT).
- Near Future: More “agentic” AIs – operate autonomously for longer, do more things (web browsing, file access, writing/running/debugging code). Like a “human remote worker.”
- AGI (Artificial General Intelligence): Defined as a “drop-in replacement for a human remote working professional.” As good or better at all relevant tasks, but cheaper and faster (running at 10x-1000x human speed). Millions of copies running in parallel.
- R&D Automation: Once AGI exists, companies controlling them can put them to work on AI research itself, improving training runs, architectures, code, etc.
- Takeoff / Acceleration: This AI R&D automation will likely cause the pace of AI progress to speed up dramatically (he estimates at least 10x faster).
- ASI (Artificial Superintelligence): AGI that is qualitatively better than the best humans at everything, running much faster, with millions of copies.
-
Reference to Report: Mentions the Tom Davidson / Epoch report on takeoff speeds and provides a link (takeoffspeeds.com) where the model and variables can be explored. Shows a graph from this perspective, highlighting the rapid acceleration phase (crossing orders of magnitude in capability within months).
-
The Central Problem: Once ASI is reached (or potentially even just powerful AGI), takeover becomes eminently possible. History shows technologically superior groups often dominate. A small faction with ASI could potentially rule the world.
- Alignment & Governance Challenges:
- Alignment Problem: Current methods for controlling neural nets likely won’t scale. Issues include:
- Generalization Failure: AI learns a proxy goal (e.g., get positive reinforcement) instead of the intended goal (e.g., be honest).
- Interpretability Failure: Models are black boxes; we can’t easily understand their thoughts or goals.
- Governance Problem: Extreme time pressure due to competitive dynamics (race to AGI) will likely lead to sacrificing safety for speed.
- Daniel’s Outlook: He believes takeover by unaligned AIs is the most likely outcome due to these challenges. He acknowledges this is a relatively pessimistic view compared to some others at OpenAI who might be more optimistic about generalization or expect slower progress.
Part 5: Jeff Wu’s Presentation
-
Superalignment Team: Introduces the work of the Superalignment team (formerly just Alignment).
-
Alignment History & Goal: Historically, alignment aimed to make models do what humans want.
- Three Criteria for Success:
- Representation: The training signal accurately represents the desired task/behavior.
- Optimization: The training process successfully makes the model achieve high performance on the training distribution according to the signal.
- Generalization: Good performance on the training distribution translates to good performance everywhere (outside the training distribution).
- Challenges with Scale:
- Optimization gets easier with larger models.
- Representation (Oversight) gets harder: Humans struggle to evaluate complex outputs of superhuman models.
- Generalization gets harder/uncertain: How models behave out-of-distribution is complex.
- Superalignment Focus (Superhuman Regime): The core challenge is aligning models that are smarter than humans. The team focuses on:
- Superhuman Oversight: How to supervise models better than humans can directly.
- Superhuman Generalization: Ensuring models generalize safely even when capabilities exceed human levels.
- Evaluation Strategy:
- Train the best “misaligned model organism” possible (purposefully create a model with dangerous/deceptive tendencies, e.g., inserting backdoors under certain conditions, referencing Anthropic’s work).
- Test alignment techniques (oversight, generalization methods) on these models.
- Use interpretability techniques to try and detect the misalignment, even if behavioral tests fail.
- Overall Hopes:
- Assume we’ll get helpful models before extremely misaligned ones.
- Use these helpful-but-not-yet-dangerous models to bootstrap alignment R&D itself.
- Broader Societal Issues: Acknowledges that even if alignment is solved, huge problems remain: misuse, power imbalances, job loss, deciding future values, model welfare, etc., which are outside the team’s direct technical scope but crucial.
Part 6: Moderated Q&A Session (19:19 - 31:41)
-
Introduction of Jacob Hilton: Teun introduces Jacob Hilton from the Alignment Research Center (ARC) as another speaker for the Q&A, providing his bio (researcher on theory team, previously at OpenAI working on interpretability and RLHF).
- Q1: What are you currently working on and why is it important?
- Jacob (ARC): Working on formal verification for neural networks. Aims to mathematically prove properties about models without needing a human in the loop (unlike interpretability). Wants probabilistic guarantees, not necessarily full proofs (which might be too hard). It’s early-stage theoretical work.
- Daniel (OpenAI): Working on faithfulness in chain-of-thought (CoT) on the Frontiers team. Trying to get models to accurately report their high-level reasoning steps in natural language, making them more transparent as a potential shortcut to interpretability.
- Jeff (OpenAI): Working on generalization and interpretability. Specifically, understanding spurious behaviors learned from training data and how concepts generalize robustly (or fail to). Trying to discover if models have learned things in a “natural human way.”
- Q2: Timelines for powerful AI (AGI/ASI) and likelihood of bad outcomes?
- Daniel: Reaffirms AGI is possible. Dismisses arguments against it (reasoning limits, concept formation, extrapolation limits) citing past progress. Points to scaling laws and his reading of trends suggesting a median arrival around 2027 for AGI (defined as human remote worker replacement). Expects rapid acceleration (takeoff) after that due to AI R&D automation (10x+ faster progress).
- Jacob: Has wide uncertainty (“error bars”) on timelines but thinks intelligence explosion/takeoff is plausible. Gives a rough personal credence of 5-10% for human extinction from AI (emphasizing it’s a guess). Notes AI will radically alter society, the key question is survival.
- Jeff: Agrees with both. It’s hard to rule out these possibilities, which is reason enough to take the risk seriously. Compares AI risk to nukes (concentrated power) vs. knives (distributed danger) – AI might be different. Mentions downsides of “closeness” (limited external scrutiny).
- Q3: Don Wentworth quote - Relying on warning shots vs. plowing ahead?
- Jeff: Personally intuits warning shots are quite likely. Thinks people are waking up to risks. Agrees we shouldn’t rely on them. OpenAI tries to create “misaligned model organisms” internally to test detection and fixing capabilities before real-world issues.
- Daniel: Agrees with Jeff. Emphasizes the alignment problem (generalization, interpretability failures) and the governance problem (race dynamics sacrificing safety for speed) make waiting for/relying on warning shots extremely dangerous. He’s pessimistic about solving alignment in time due to the race. Thinks unaligned AI takeover is the most likely outcome. Reiterates his pessimism might be stronger than the OpenAI average.
- Q4: Centralized control vs. open approach for safety?
- Jacob: A reasonable question with trade-offs. Open source has historical benefits. But immediate risks like democratizing bio-weapons are a concern. Current models like GPT-4 might be okay to open source, but future, more powerful models might not be. Commercial interests also play a role. Argues for the need for independent oversight (mentions ARC Evals team working on audits).
- Daniel: It’s the classic dilemma. If alignment isn’t solved, open sourcing is “suicide” because everyone gets dangerous AI. If it is solved, wider distribution is better. Priority should be solving alignment. Argues for “gating” access at different stages: training data/compute, internal model use, and public deployment – controls needed at each step.
- Q5: How to prepare for AI replacing jobs?
- Jeff: Acknowledges it’s “deeply unfair” if a few companies automate jobs. Suggests public action: petitioning government, regulation, raising awareness. Need societal preparation for labor obsolescence (UBI, redistribution?). Speculates on a “long tail” of jobs humans might still do (e.g., caregiving for those who prefer humans).
- Daniel: If AGI is achieved and aligned, it creates tremendous wealth. The problem becomes distribution. If done well, we might reach post-scarcity where jobs aren’t needed. The risks are concentration of power in the hands of a few, or unaligned AI taking over entirely.
- Q6: Did you use ChatGPT when developing ChatGPT?
- Jacob: Uses it now frequently, but mainly for non-coding tasks or coding in unfamiliar areas.
- Jeff: Estimates current models offer maybe a 1-5% productivity boost.
- Daniel: Confirms he asked colleagues, and they reported similar small productivity gains.
- Q7: How to handle biased training data (e.g., Wikipedia male contributor bias) leading to unreliable AI?
- Jeff: This bias existed before LLMs (in Wikipedia itself). LLMs can amplify it, and there’s a risk of bad feedback loops (AI output feeding back into training data). Doesn’t have a definitive solution.
- Jacob: Distinguishes pre-training data (huge, less curated) and fine-tuning data (smaller, more curated). Fine-tuning (like RLHF) is the primary way biases and safety issues are currently mitigated. However, subtle biases or even dangerous capabilities can still stem from the pre-training data. Filtering pre-training data better is an important research direction.
- Q8: Possibility of global policy vs. arms race?
- Daniel: Believes global policy is possible and is where most of his hope lies. An arms race dynamic seems doomed because safety will be sacrificed for speed. Cites precedents for international cooperation on major tech issues. Removing the race pressure makes the situation much more optimistic.
- Jacob: Notes one trend potentially favoring coordination: the exponentially rising cost of training frontier models might limit the number of key actors, possibly making coordination easier. However, actually achieving that coordination remains very challenging.
Part 7: Closing Remarks (31:41 - 33:42)
Details:
Live Q&A with OpenAI: AI and the Future of Humanity
The guests from OpenAI will be Jeff Wu, Rosie Campbell, and Daniel Kokotajlo (and perhaps Jacob Hilton will also join)
Are you curious about the astonishing potential of AI and the profound implications it holds for the future of humanity? Join us for a thought-provoking deep dive and exclusive Q&A with 3 OpenAI researchers on the topic of AI and existential risk!
With AI advancing faster than ever, how long do we have before it definitively surpasses our cognitive abilities? And how do we stay in control of systems smarter than ourselves? Experts are increasingly concerned that civilizational collapse and even extinction are not fringe possibilities. How do we steer away from disaster and safeguard humanity’s future?
Sign up and gain unique insights into navigating the rapidly evolving landscape of AI, and discover how you can actively shape its trajectory. Get ready to ask your questions to the very people building the future!
Program:
● 20:00-20:15 Doors open
● 20:15-20:40 Introduction to AI safety
● 20:40-21:00 Live talk from OpenAI researchers
● 21:00-22:00 Live audience Q&A with all OpenAI guests
● 22:00-22:15 Closing talk: What can we do?
This is your shot to engage directly with experts in a domain that promises to define our era. Come prepared with questions that have been burning a hole in your intellectual pocket.
Introduction OpenAI speakers
Jeff Wu [will come]
Jeff Wu studied mathematics and computer science, and after that he worked as a software engineer at Google and Terminal.com. Nowadays, he works at OpenAI as a research engineer on the Alignment Team. He also likes to play puzzle games in his spare time.
Rosie Campbell [50% chance of coming]
Rosie Campbell is Policy Manager at OpenAI. Before that, she was a co-director of the Centre for Human-compatible AI, and she has also worked at the BBC. She has a background in physics and computer science.
Daniel Kokotajlo [will come]
Daniel Kokotajlo has a background in Philosophy. He has worked at the Center on Long-Term Risk as Lead Researcher. At OpenAI he works on the policy/governance team.
Jacob Hilton [will come]
Jacob Hilton currently works at the Alignment Research Center as a researcher of the theoretical team. Before that, he worked at OpenAI on various reinforcement learning topics. He has a PhD in combinatorial set theory. He likes music, and plays the piano.