Forward Future by Matthew Berman
Posts
👾 Biggest Obstacles and Problems That AI Still Has To Overcome

👾 Biggest Obstacles and Problems That AI Still Has To Overcome

Breakthroughs in AI tackle LLM issues like memory, hallucinations, and agent autonomy for lasting performance.

Kim Isenberg
May 19, 2025

Large language models (LLMs) are remarkably versatile. They can summarize documents, generate code or even brainstorm new ideas. And now we’ve expanded these capabilities to target fundamental and highly complex problems in mathematics and modern computing.

Google DeepMind

When Google DeepMind ran ten million tokens through the new long-term window of Gemini 1.5 Pro in a live demo in February 2025, a long-held dogma seemed to be broken: Transformer models do not necessarily have to fail due to their own short-term memory. Nevertheless, serious technical hurdles still loom over the field of progress today: limited context windows, hallucinations, a lack of long-term memory, the short-lived autonomy of agents and the phenomenon of catastrophic forgetting during retraining. How closely these problems are linked - and which solutions are actually within sight - is the focus of the following analysis.

Context Window – From Bottleneck to Wide Angle

Transformer self-attention scales quadratically with the sequence length; classic models capitulated beyond 8,000 tokens. Newer sparse and mixture-of-experts variants reduce the complexity almost to linearity. Gemini 1.5 Pro then demonstrated an almost perfect retrieval rate of over 99 percent for sequences up to 10 million tokens.

However, production systems struggle with RAM and latency costs. Research prototypes such as segment-position embeddings therefore prioritize important token blocks, while sparse token pruning discards weak attention scores on-the-fly. In the short term, hybrid pipelines dominate: a large off-site encoder distills additional knowledge into compact memory tokens that a smaller on-site decoder model reuses.

More recent work combines RAG with meta-cognitive self-testing: the model first estimates its own uncertainty and only retrieves evidence when the risk is high. Initial attempts achieve over 85 percent correct uncertainty predictions. The open flank remains the dialog drift: The longer the dialog, the greater the risk of RAG sources falling out of the top-k pool. Recurrent re-scoring at each turn is intended to remedy this.

Long-Term Memory – From Prompt Hack to Real Memory

Simply attaching session histories to the prompt window does not scale well. M+, a memory-enhanced model based on MemoryLLM, therefore integrates a co-trained retriever and persistent vector databases; it significantly increases knowledge retention in long-context benchmarks.

RecurrentGPT experiments completely without a database: after each paragraph, the model generates a condensed memory line that is returned as a “hidden state”.

Advantage: arbitrarily long sequences; disadvantage: cumulative error if the abstracts are imprecise.

A third line, Chain-of-Agents, allows several specialized models to cooperate via an external “blackboard”. Preliminary results in long QA benchmarks outperform both classic RAG systems and monolithic LLMs - but require synchronization and security mechanisms.

Autonomous Agents – How Long Does Self-Control Last?

The Voyager agent showed in 2023 that an LLM in Minecraft can learn new skills over hundreds of in-game days.

In real-world benchmarks, however, the best agents only perform around a quarter of complex tasks completely autonomously. Main bottlenecks:

Planning resolution - without explicit intermediate goals, the agent loses the common thread. The Plan-and-Act Framework therefore separates planning (Planner) and execution (Executor) and significantly increases the success rate in web-based long-horizon tasks
Context erosion - long action chains break the window; logs slip out. Auto-iterations end in endless loops.
Error accumulation - every wrong assumption is multiplied. Multi-level reflection loops halve the error rate, but double the runtime.

Overall, the time that agents can spend performing tasks independently is still only a few hours. Nevertheless, it can be assumed that this trend will continue to develop rapidly.

Catastrophic Forgetting – Loss of Knowledge During Fine-Tuning

During special retraining, LLMs lose up to 40 percent of their fact fidelity. A study from 2024 shows that Sharpness-Aware Minimization reduces knowledge loss by around a third.

In parallel, parameter-efficient tuning methods (LoRA, QLoRA) are becoming established: Only a few matrices are changed, the rest remain frozen. In combination with external memory, new domain knowledge can be added without losing global capabilities.

The obstacles are intertwined: larger context windows enable long-running agents, but require sophisticated memory organization; hallucinations decrease thanks to external knowledge sources, which in turn require permanent memory management; any retraining threatens long-term knowledge if memory and agent logic are not synchronized. In the next 12 to 36 months, RAG filters, million-token contexts and planner-executor architectures are likely to reach production maturity. The step towards agents that learn autonomously for months, on the other hand, will take several years of research.

Conclusion

The most profound technical stumbling blocks of modern AI currently fall into five categories:

Context management - sparse attention, segment embeddings, token pruning.
Hallucination control - RAG, uncertainty self-disclosure, reflection.
Long-term memory - M+, RecurrentGPT, agent blackboards.
Agent runtime stability - explicit scheduling, context protection, multi-agent firewalls.
Knowledge preservation during fine-tuning - sharpness-aware training, parametric isolation.

The first three show clearly foreseeable solutions; the last two require coordinated progress in model architecture, training technology and system design. As soon as permanently learning memories and stable planners merge, the dream of the long-term agent will become tangible - a system that operates for weeks or months instead of minutes without human training wheels. Until then, every new improvement must be embedded in such a way that it does not disrupt the fragile balance of the other components.

—

Ready for more content from Kim Isenberg? Subscribe to Forward Future

Kim Isenberg

Kim studied sociology and law at a university in Germany and has been impressed by technology in general for many years. Since the breakthrough of OpenAI's ChatGPT, Kim has been trying to scientifically examine the influence of artificial intelligence on our society.

Follow Kim on X

👾 Biggest Obstacles and Problems That AI Still Has To Overcome

Breakthroughs in AI tackle LLM issues like memory, hallucinations, and agent autonomy for lasting performance.

Context Window – From Bottleneck to Wide Angle

Long-Term Memory – From Prompt Hack to Real Memory

Autonomous Agents – How Long Does Self-Control Last?

Catastrophic Forgetting – Loss of Knowledge During Fine-Tuning

Conclusion

Kim Isenberg

Reply

Account

Content

Tools

Resources