Judge Experiments With ChatGPT, And It’s Not As Crazy As It Sounds
Would you freak out if you found out a judge was asking ChatGPT a question to help decide a case? Would you think that it was absurd and a problem? Well, one appeals court judge felt the same way… until he started exploring the issue in one of the most thoughtful explorations of LLMs I’ve seen (while also being one of the most amusing concurrences I’ve seen).
I recognize that the use of generative AI tools in lots of places raises a lot of controversy, though I think the biggest complaint comes from the ridiculously bad and poorly thought out uses of the technology (usually involving over relying on the tech, when it is not at all reliable).
Back in April, I wrote about how I use LLMs at Techdirt, not to replace anyone or to do any writing, but as a brainstorming tool or a soundboard for ideas. I continue to find it useful in that manner, mainly as an additional tool (beyond my existing editors) to push me to really think through the arguments I’m making and how I’m making them.
So I found it somewhat interesting to see Judge Kevin Newsom, of the 11th Circuit, recently issue a concurrence in a case, solely for the point of explaining how he used generative AI tools in thinking about the case, and how courts might want to think (carefully!) about using the tech in the future.
The case itself isn’t all that interesting. It’s a dispute over whether an insurance provider is required under its agreement to cover a trampoline injury case after the landscaper who installed the trampoline was sued. The lower court and the appeals court both say that the insurance agreement doesn’t cover this particular scenario, and therefore, the insurance company has no duty to defend the landscaper.
But Newsom’s concurrence is about his use of generative AI, which he openly admits may be controversial, and begs for people to consider his entire argument:
I concur in the Court’s judgment and join its opinion in full. I write separately (and I’ll confess this is a little unusual) simply to pull back the curtain on the process by which I thought through one of the issues in this case—and using my own experience here as backdrop, to make a modest proposal regarding courts’ interpretations of the words and phrases used in legal instruments.
Here’s the proposal, which I suspect many will reflexively condemn as heresy, but which I promise to unpack if given the chance: Those, like me, who believe that “ordinary meaning” is the foundational rule for the evaluation of legal texts should consider—consider—whether and how AI-powered large language models like OpenAI’s ChatGPT, Google’s Gemini, and Anthropic’s Claude might—might—inform the interpretive analysis. There, having thought the unthinkable, I’ve said the unsayable.
Now let me explain myself.
As Judge Newsom notes, a part of the case involved determining what the common understanding of the term “landscaping” meant, as it was not clearly defined in the contract. He also says that, due to a quirk of Alabama law, the final disposition of the case didn’t actually depend on this definitional issue, in part because of the landscaper’s insurance application, where he denied doing any work on recreational equipment.
But that allows Newsom the chance to explore how AI might be useful here, in a case where it wasn’t necessary. And that allows him to be somewhat more informal than you might expect from a judge (though, of course, we all have our favorite examples of judges letting their hair down a bit in opinions).
Importantly, though, that off-ramp wasn’t always obviously available to us—or at least as I saw things, to me. Accordingly, I spent hours and hours (and hours) laboring over the question whether Snell’s trampoline-installation project qualified as “landscaping” as that term is ordinarily understood. And it was midway along that journey that I had the disconcerting thought that underlies this separate writing: Is it absurd to think that ChatGPT might be able to shed some light on what the term “landscaping” means? Initially, I answered my own question in the affirmative: Yes, Kevin, that is positively absurd. But the longer and more deeply I considered it, the less absurd it seemed.
I kind of appreciate the thoroughness with which he admits that there are good reasons to think he’s absurd here — he even thought it himself! — before explaining how he changed his mind.
He admits that he did “the usual” thing when courts try to determine the ordinary meaning of a word, which often involves… looking up what the dictionary or other such reference materials say. So he did a run-through of dictionaries and looked at their definitions of “landscaping.” But he noted that it didn’t really help all that much in determining if the trampoline was landscaping.
Then, he also looked at the pictures associated with the case:
After languishing in definitional purgatory for a while, I decided to look at the case from a different perspective—and I do mean look. The record contains a series of photographs of Snell’s trampoline-related project. Here’s one, which shows his prep work—in particular, the empty sand pit and the below-ground retaining wall that reinforced its borders:
And another, which depicts the finished product, including both the polypropylene mat (the fun part) and the decorative wooden “cap”:
I’m not particularly proud of it, but I’ll confess that the photos affected the way I thought about the case. Nothing in them really struck me as particularly “landscaping”-y. The problem, of course, wasthat I couldn’t articulate why. And visceral, gut-instinct decisionmaking has always given me the willies—I definitely didn’t want to be that guy. So in a way, I felt like I was back to square one.
I swear, this is the “bloggiest” Appeals Court concurrence I’ve ever read. And it only gets more bloggy:
And that’s when things got weird. Perhaps in a fit of frustration, and most definitely on what can only be described as a lark, I said to one of my clerks, “I wonder what ChatGPT thinks about all this.” So he ran a query: “What is the ordinary meaning of ‘landscaping’?” Here’s what ChatGPT said in response:
“Landscaping” refers to the process of altering the visible features of an area of land, typically a yard, garden or outdoor space, for aesthetic or practical purposes. This can include activities such as planting trees, shrubs, flowers, or grass, as well as installing paths, fences, water features, and other elements to enhance the appearance and functionality of the outdoor space.
Interesting, for two reasons. First, ChatGPT’s explanation seemed more sensible than I had thought it might—and definitely less nutty than I had feared. Second, it squared with my own impression— informed by my own experience writing, reading, speaking, and listening to American English in the real world—that ordinary people might well use the word “landscaping” (1) to include more than just botanical and other natural improvements and (2) to cover both aesthetic and functional objectives. In fact, several of the examples that ChatGPT flagged—“paths, fences, [and] water features”—jibed with the sorts of things that had sprung to mind when I first started thinking about the case.
But, don’t worry. He didn’t just go with that because it confirmed his priors. He (rightly) recognized that’s not how any of this should work. Again, this reads like a blog post, not a judicial concurrence, but that’s what makes it fun.
Suffice it to say, my interest was piqued. But I definitely didn’t want to fall into the trap of embracing ChatGPT’s definition just because it aligned with my priors. (Bad.) So, in what might have been a mistake—more on that later—we went ahead and asked it the ultimate question: “Is installing an in-ground trampoline ‘landscaping’”? ChatGPT responded as follows:
Yes, installing an in-ground trampoline can be considered a part of landscaping. Landscaping involves altering the visible features of an outdoor area for aesthetic or practical purposes, and adding an in-ground trampoline would modify the appearance and function of the space. It’s a deliberate change to the outdoor environment, often aimed at enhancing the overall landscape and usability of the area.
For good measure, I posed the same questions to Google’s Bard (since replaced by Gemini). The precise details aren’t particularly important, but the upshot is that both models’ answers indicated that the trampoline-related work Snell had performed—the excavation of the pit, the construction of the retaining wall, the installation of the mat, and the addition of the decorative wooden cap—just might be landscaping.
Apparently, it was around this point that he realized the aforementioned “off-ramp” made by Alabama law, such that this didn’t matter. But he was intrigued that his experiments here had moved him out of the “that’s absurd” category into the “huh, this might be useful… somehow?”
So, he then uses more of the concurrence to explore the pros and cons. I won’t repost all of it, but the strongest argument in favor of considering this is that if the goal is to understand the “common” way in which a word or phrase is used, LLMs trained on the grand corpus of human knowledge might actually provide a better take on the common usage and understanding of such words and phrases.
The ordinary-meaning rule’s foundation in the common speech of common people matters here because LLMs are quite literally “taught” using data that aim to reflect and capture how individuals use language in their everyday lives. Specifically, the models train on a mind-bogglingly enormous amount of raw data taken from the internet—GPT-3.5 Turbo, for example, trained on between 400 and 500 billion words—and at least as I understand LLM design, those data run the gamut from the highest-minded to the lowest, from Hemmingway novels and Ph.D. dissertations to gossip rags and comment threads. Because they cast their nets so widely, LLMs can provide useful statistical predictions about how, in the main, ordinary people ordinarily use words and phrases in ordinary life. So, for instance, and as relevant here, LLMs can be expected to offer meaningful insight into the ordinary meaning of the term “landscaping” because the internet data on which they train contain so many uses of that term, from so many different sources—e.g., professional webpages, DIY sites, news stories, advertisements, government records, blog posts, and general online chatter about the topic.
He’s quick to admit that there are potential problems with this. There are questions about what LLMs trained on, how representative they might be. There might also be other questions about usage changes over time, for example. There are plenty of reasons why these results shouldn’t be automatically relied on.
But as I noted in my own explanation of how I’m using LLMs, the key point is to use them as a way to help you think through issues, not to rely on them as some sort of godlike answer machine. And Judge Newsom seems to recognize that. At the very least, it’s possible that an LLM might give you better (or, at the very least, different) insight into “common usage” of a word or phrase than a dictionary editor.
So far as I can tell, researchers powering the AI revolution have created, and are continuing to develop, increasingly sophisticated ways to convert language (and I’m not making this up) into math that computers can “understand.”… The combination of the massive datasets used for training and this cutting-edge “mathematization” of language enables LLMs to absorb and assess the use of terminology in context and empowers them to detect language patterns at a granular level. So, for instance, modern LLMs can easily discern the difference—and distinguish—between the flying-mammal “bat” that uses echolocation and may or may not be living in your attic, on the one hand, and the wooden “bat” that Shohei Otani uses to hit dingers, on the other. See id. And that, as I understand it, is just the tip of the iceberg. LLM predictions about how we use words and phrases have gotten so sophisticated that they can (for better or worse) produce full-blown conversations, write essays and computer code, draft emails to co-workers, etc. And as anyone who has used them can attest, modern LLMs’ results are often sensible—so sensible, in fact, that they can border on the creepy. Now let’s be clear, LLMs aren’t perfect—and again, we’ll discuss their shortcomings in due course. But let’s be equally clear about what they are: high-octane language-prediction machines capable of probabilistically mapping, among other things, how ordinary people use words and phrases in context.
And, he points out, dictionaries may be very good at proffering definitions, but they are still influenced by the team that puts together that dictionary:
First, although we tend to take dictionaries for granted, as if delivered by a prophet, the precise details of their construction aren’t always self-evident. Who exactly compiles them, and by what criteria do the compilers choose and order the definitions within any given entry? To be sure, we’re not totally in the dark; the online version of Merriam-Webster’s, for instance, provides a useful primer explaining “[h]ow . . . a word get[s] into” that dictionary. It describes a process by which human editors spend a couple of hours a day “reading a cross section of published material” and looking for new words, usages, and spellings, which they then mark for inclusion (along with surrounding context) in a “searchable text database” that totals “more than 70 million words drawn from a great variety of sources”—followed, as I understand things, by a step in which a “definer” consults the available evidence and exercises his or her judgment to “decide[] . . . the best course of action by reading through the citations and using the evidence in them to adjust entries or create new ones.”
Such explainers aside, Justice Scalia and Bryan Garner famously warned against “an uncritical approach to dictionaries.” Antonin Scalia & Bryan A. Garner, A Note on the Use of Dictionaries, 16 Green Bag 2d 419, 420 (2013). They highlighted as risks, for instance, that a volume could “have been hastily put together by two editors on short notice, and very much on the cheap,” and that without “consult[ing] the prefatory material” one might not be able “to understand the principles on which the dictionary [was] assembled” or the “ordering of [the] senses” of a particular term.
Judge Newsom wants you to know that he is not trying to slag the dictionaries here (nor to overly praise LLMs). He’s just pointing out some realities about both:
To be clear, I’m neither a nihilist nor a conspiracy theorist, but I do think that we textualists need to acknowledge (and guard against the fact) that dictionary definitions present a few known unknowns…. And while I certainly appreciate that we also lack perfect knowledge about the training data used by cuttingedge LLMs, many of which are proprietary in nature, see supra notes 6 & 8, I think it’s fair to say that we do know both (1) what LLMs are learning from—namely, tons and tons of internet data— and (2) one of the things that makes LLMs so useful—namely, their ability to accurately predict how normal people use language in their everyday lives.
[….]
Anyway, I don’t mean to paint either too grim a picture of our current, dictionary-centric practice—my own opinions are chock full of dictionary definitions, I hope to good effect—or too rosy a picture of the LLMs’ potentiality. My point is simply that I don’t think using LLMs entails any more opacity or involves any more discretion than is already inherent in interpretive practices that we currently take for granted—and in fact, that on both scores it might actually involve less.
And, of course, he has another long section on all the reasons to remain worried about LLMs in this context. He’s not a blind optimist, and he’s not one of those lawyers we’ve written about too often who just ChatGPT’d their way to useful and totally fake citations. He knows they hallucinate. But, he points, if “hallucinating” is misrepresenting things, lawyers already do that themselves:
LLMs can “hallucinate.” First, the elephant in the room: What about LLMs’ now-infamous “hallucinations”? Put simply, an LLM “hallucinates” when, in response to a user’s query, it generates facts that, well, just aren’t true—or at least not quite true. See, e.g., Arbel & Hoffman, supra, at 48–50. Remember the lawyer who got caught using ChatGPT to draft a brief when it ad-libbed case citations—which is to say cited precedents that didn’t exist? See, e.g., Benjamin Weiser, Here’s What Happens When Your Lawyer Uses ChatGPT, N.Y. Times (May 29, 2023). To me, this is among the most serious objections to using LLMs in the search for ordinary meaning. Even so, I don’t think it’s a conversationstopper. For one thing, LLM technology is improving at breakneck speed, and there’s every reason to believe that hallucinations will become fewer and farther between. Moreover, hallucinations would seem to be most worrisome when asking a specific question that has a specific answer—less so, it seems to me, when more generally seeking the “ordinary meaning” of some word or phrase. Finally, let’s shoot straight: Flesh-and-blood lawyers hallucinate too. Sometimes, their hallucinations are good-faith mistakes. But all too often, I’m afraid, they’re quite intentional—in their zeal, attorneys sometimes shade facts, finesse (and even omit altogether) adverse authorities, etc. So at worst, the “hallucination” problem counsels against blind-faith reliance on LLM outputs—in exactly the same way that no conscientious judge would blind-faith rely on a lawyer’s representations.
He also goes deep on some other downsides, including some we already discussed regarding what data the LLMs are trained on. If it’s only online speech, does that leave out speech that is common offline? Does it leave out communities who have less access to the internet? Basically, it’s part of the well-known “alignment problem” in generative AI, around the inevitability of some level of bias that is simply unavoidable. But that doesn’t mean you just shrug and accept things unquestioned.
He even considers that lawyers might try to shop around for different AIs that agree with them the most or, worse, try to “poison” an LLM to get it to agree with a preferred understanding. But, he notes, that seems unlikely to be all that effective.
There’s also this fun bit about the dystopian threat of “robo lawyers,” which I especially appreciate given that we once created a game, called HAL of Justice, for a legal academic conference that involved turning everyone involved into futuristic AI judges handling court cases.
Would the consideration of LLM outputs in interpreting legal texts inevitably put us on some dystopian path toward “robo judges” algorithmically resolving human disputes? I don’t think so. As Chief Justice Roberts recently observed, the law will always require “gray area[]” decisionmaking that entails the “application of human judgment.” Chief Justice John G. Roberts, Jr., 2023 Year-End Report on the Federal Judiciary 6 (Dec. 31, 2023). And I hope it’s clear by this point that I am not—not, not, not—suggesting that any judge should ever query an LLM concerning the ordinary meaning of some word (say, “landscaping”) and then mechanistically apply it to her facts and render judgment. My only proposal—and, again, I think it’s a pretty modest one—is that we consider whether LLMs might provide additional datapoints to be used alongside dictionaries, canons, and syntactical context in the assessment of terms’ ordinary meaning. That’s all; that’s it.
And with that, he closes with an interesting provocation. If you’ve come around to his idea that we should be considering this form of algorithmically-assisted brainstorming, what are the key things we should think about? He highlights that prompt construction will matter a lot. How do you create the “right” prompt? Should you try multiple prompts? Should you use multiple LLMs? Should there be some indication of how “confident” an LLM is in any particular answer? And, as noted earlier, how do you handle issues of words having meanings change over time, if the standard should be at the relevant time of the contract.
And he closes in the most blog-like fashion imaginable.
Just my two cents.
I find this whole discussion fascinating. As I highlighted in my own post about how we use LLMs for brainstorming, I recognize that some people hate the idea outright, while others are too utopian about “AI in everything” without thinking through the potential downsides. It’s nice for some to recognize that there is a reasonable middle path: that they have utility in certain, specific scenarios, if used properly, and not relied on as a final arbiter of anything.
Also, it’s just kind of fun to read through this quite thoughtful exploration of the topic and how Judge Newsom is considering these issues (fwiw, Newsom has been the author of opinions we’ve agreed with strongly, as well as ones we’ve disagreed with strongly, so it’s not as though I feel one way or the other about this based on his jurisprudence — it’s just a really interesting discussion).
I also appreciate that, unlike so many conversations on tech like generative AI these days, he’s not taking the extremist approach of it being “all good” or “all bad,” and is actually willing to explore the tradeoffs, nuances, and open questions related to the issues. It would be nice if the world saw more of that, just in general.