Can natural intelligence judge Artificial Intelligence?

AGI, brain-machine interfaces, 'uploads' etc. can all wait. We have a bigger problem: we don’t yet know if wetware brains should assess predictive algorithms at all!

Nov 18, 2022

Welcome to Presbyopia, a new section of this newsletter that offers interdisciplinary research questions. This inaugural issue concerns the performance of the new crop of Generative AI algorithms.

When I last wrote about Artificial Intelligence in June for the main newsletter, I focused mainly on the journalism part. Reporters need to tell their readers a bit more about the tests they run when they preview the software in what sounds like a controlled environment. Did the friendly engineer giving you the demo helpfully suggest that you try asking the LLM (Large Language Model) to compare two random things (as against three)? The answer makes a big difference to what the reader can surmise about the state of the technology and its implications for the world.

As a counterpoint, the rest of the post took a turn around light philosophy. Sure, the machine was ‘only’ spitting out phrases fished from a large corpus and stitched on the fly to meet the perceived context. How artificial! Except, long before computers were a household necessity, linguists postulated that much human language in fact works that way1. (Not the best newspaper columns though. Nor the best newsletters and essays. 😅)

I discussed some of the neuroscience that should temper excitement about the capabilities of the algorithms. But I only mentioned in passing a concern that neither journalists nor philosophers could have addressed. We ask the machines to make something with fragments of text that we rather transparently call ‘prompts’. Then we sit and judge what we get back. But are ‘we’ a sound judge?

We know the human brain is a prediction engine. I’ve probably shared that link before because I harp on some of the same newish revelations in neuroscience constantly. I do that because, despite its universal relevance, a lot of this work still hasn’t been absorbed broadly. Regardless, right here we’re diving into a very specific application.

One of the simplest practical illustrations of this core mechanism in our brains is when in his TED talk Anil Seth plays a garbled recording that sounds like gibberish. Next, he reveals the sentence being spoken and then plays the same recording again. This time, you recognize every word. As with many optical illusions, it’s hard to believe it’s the exact same recording.

That indubitably demonstrates to those of us who are not neuroscientists that the brain is much better at perceiving what it expects to perceive. In fact, the word ‘perception’ should suddenly seem hollow. In everyday life, it uses the memory of the past to form its expectations. But in the case of that recording, its job is suddenly simplified because it has just been told exactly what to expect.

Here’s Anil Seth at TED in 2017. (The video skips to the 6:05 mark where that audio example begins, right after he’s discussed a more famous visual one.)

You feel me?

Now consider some MIT scientists demonstrating research on an AI algorithm capable of a “new kind of imaging”. It extracts ambient sounds from humanly imperceptible vibrations of houseplants or junk-food packaging litter as detected in an indoor video. They show off how well you, the reader, can make out the reconstructed sound. Of course, in both examples they shared publicly in a video, they first play the original sound for you!

The researchers imply applications in ‘forensics’. Picture the scene: Omar and Stringer hold a meeting to discuss a truce. McNulty has visual access from a strategically placed spy-cam but through soundproof glass. No matter. Omar carelessly tossed his empty bag of Lay’s in view just before Stringer arrived. Pryzbylewski can quickly run the MIT algorithm on the file and extract the sound. McNulty may be fluent in the street dialect by now2, but Baltimore PD is only going to spend on that algorithm if he (and the jury!) can make out what was said without just having heard the original voices first!

That was 2014. The research that Seth drew upon was already mainstream by then. Certainly available to MIT staff at their fingertips, or if they spoke to people at other departments. What is worse, in 2022 people are still sharing that video on twitter to incredulous gasps from a legion of followers. And for its part, the New York Times too is propagating [*] that research! (Quoting words from Samuel Hammond3. Told you this newsletter’s got its eyes on NYT.)

Massimo @Rainmaker1973

This algorithm can reconstruct an audio by analyzing minute vibrations of objects depicted in a video. In this experiment, intelligible speech was recovered from the vibrations of a potato-chip bag photographed from 15 feet away through soundproof glass ow.ly/SIdV30n37Hg

It seems to me that the only way to evaluate the performance of that algorithm is to play both the recordings separately to a speech-to-text algorithm and compare the output! Let the machines rate their cousins.

🧠

The prediction engine architecture of the brain is only half the problem. To compound it, we have the effects of language itself.

Spiders have silk, pangolins their tongues. The defining distinction of humans is a brain that evolved to cooperate with conspecifics (other humans and a great many of them at once). Through language. To the human brain, words are sovereign.

As a corollary, words prime cognition in all sorts of ways. At that link is a summary of some mechanisms reported in Nick Enfield’s book Language vs. Reality. Using words to describe a memory distorts it, and the mere presence of visible verbal cues influences how we understand and then perform a task.

All of which should make it obvious that we humans are not really equipped to judge whether a machine in its turn ‘understands’ a text prompt. That only sounds like a paradox, but given our evolutionary inheritance, it is not. Not if we were the ones who gave it the prompt. And nor if we were only made aware of the prompt someone else gave.

At the moment what is more likely going on is something like this: A reporter or tech influencer is invited to test ride the algorithm. They ask for a photo of a ‘mother skiing with her two children’. The act of them having chosen those words primes their brain. In a much shorter interval than Midjourney takes to respond these days, the brain has eagerly made all preparations to report that it is looking at a female human and two younger humans. The preparations mean it is likely to report that in a much more ambiguous image than it would have otherwise. This brain is now less likely to bother with details. It takes a primed brain extra closer scrutiny to notice that unlike real women, this woman has three legs. Her judgement thus compromised, the reporter selects some the best output to share with her readers. Just doing her job. But the brains of readers are primed too, because she tells the readers the text prompt right next to the image!

🔬

Which is why we need controlled lab experiments to assess our assessment of generative AI models. A simple experiment design might be: Recruit two subject groups. Researcher generates output from a prompt. To group 1, show prompt, then ask to rate the image. To group 2, show only the image then A) guess what it shows, B) guess what the prompt was, and C) rate the image given the actual prompt.

The absence of a significant difference in the ratings vindicates all the reporting thus far on how good the generative models have been getting. The prognoses that readers and investors have been making as to the magnitude of imminent economy-wide changes are also valid. If the study results indicate otherwise in replications with several models in various cultures, we have two problems, one much bigger than the other.

First, we were wrong. Second, since we just showed humans are not the right candidate as judges for this form of AI, we don’t know who to turn to!

I hate to again drizzle a tincture of philosophy right at the end, but the only solution would be to define AI radically differently.

Housekeeping

As a subscriber to Elite Scotoma, you’re currently signed up to receive issues of this new section Presbyopia. If you’re not interested in research questions and related ideas, you can selectively unsubscribe in settings at Substack.com or the app.

By the way, that post has since had a Post Date update on that point.

You see what I did there? That link? 😉 Everything really is connected!

To go needlessly meta again, Hammond linked to that MIT paper as a way of predicting what technology would be common “within a decade”.