Rambling on LLM and Connectomics Breakthroughs: A Comparison

In response to Konrad Kording's post

(long rambling post) Not sure if relevant to historical context from the recent LLM breakthroughs, but I'm guessing there's a simpler approach Konrad's hinting at(?):

LLMs: Historical Context and Breakthroughs

(LLMs before) "The common wisdom was that RNNs/LSTMs were essential for handling sequence order and long-term memory."
When the breakthrough "Attention is All You Need" paper happened, many researchers were surprised by the simplicity of the attention mechanism and transformer approaches.
The exact quote from "Are Transformers universal approximators of sequence-to-sequence functions?" by Alexander Shulgin and Justin R. Shulgin (2020) stated, "We offer no explanation as to why these architectures seem to work; we attribute their success, as with all else, to divine benevolence.", (Particularly memorable, subject to ridicule type of quote in the LLM community).

Towards Biologically Plausible and Efficient Models

Still, from the little I know of the pre-existing research around WBE, seems very likely that tunnelling deeper at the fruit fly vision system (or any vision system) would lead to increasingly simplified models that can still lead to more biological plausibility of the vision system.
To then measure the surrounding system of the vision system to study how the "animal" reacts to visual stimuli.
And to be careful of how "if all i have is a hammer... everything looks like a nail".
Or to put it another way, I'm waiting for the divine eureka from someone to crack a simpler, more energy efficient, faster-to-train integrate-and-fire model?
If more work is done in our system, I'm sure it can scale up to some artificial samples of the "1, 2, 3, 4,5,6,7... " computer vision example.
How to generate multiple virtual MN9 cells that light up "1, 2, 3, 4" to train a more efficient, simple connectome model?
Are there certain items in field of vision that trigger certain motions?
How does vision system trigger the other systems and how to accelerate biological experiments via virtual modeling experiments?

Critiques and Future Directions

And to play devils advocate, is Janalia oversimplifying the vision system + motor system?
Will putting humpty dumpty back together again yield the same fruit fly?
To lean more positively, what can we take from the current learnings of current SoTA multi-modal models and replicate to vision/motor of fruit fly?