Transforming Books Into Statistics Won’t Bring Wisdom or AGI

Here’s a secret: turning books into statistics won’t bring AGI, cures for cancer, utopia, or any number of useful inventions that we are told are merely 12 to 18 months away. This activity also won’t bring their users wisdom. As a society, we are told that if we don’t let companies freely pillage the intellectual work of our past and present, we won’t get the life of leisure we are promised. But turning War and Peace into statistics won’t lead to significant breakthroughs. It won’t even bring knowledge… for you.

I believe that many of the people who work at the big AI companies know that training on a large corpus of non-domain-related works won’t lead to AGI or significant breakthroughs in areas such as cancer research, but they do know it may lead to breakthroughs in manipulation, and that has them interested.

Project Panama: Books Into Statistics

Recently, the Washington Post had an article about Anthropic’s Project Panama, a secret project to destructively scan every book on the planet. The image is shocking, and the whole situation feels dirty, which is probably why Anthropic tried to keep it a secret. Although it was found that they didn’t break any laws, they tried to keep it secret because they knew this would have a negative public perception. Mission accomplished.

In one sense, this is an attempt to obtain untainted training data. The internet is submerged in AI slop after the launch of ChatGPT. AI slop is good enough for the internet, but not so good for AI training. AI models tend to degrade when trained on their own outputs, a condition known as model collapse. So, instead of a model getting better, it gets worse. Seems models know what’s better for them than we do.

But the quest for untainted training data isn’t the whole story. If you are trying to scan every book on the planet, then you’ve made a decision to ingest and train AI on books of all kinds, inaccurate books, dated books, and even “bad” books. In short, accuracy isn’t the goal here.

Okay, so what’s a “bad” book? I mean books with universally accepted bad ideas, poor stories, poor writing, and many other issues. Trying to train on all books means you’ve made a conscious decision to also train on material such as Mein Kampf and The Turner Diaries. That’s right, you’ve “trained” on it, not assigned it to an AI model as homework for a classroom discussion. There are a few things I can say with 100% certainty, although I can say this: there is nothing in the works of books like Mein Kampf that will cure cancer.

Bad books shouldn’t be eliminated, although bad for AI, they can be beneficial for humans. You can always stop reading a poor novel or other books you feel aren’t providing proper value for your effort. As for books with bad ideas, when a human reads one, they can do so from a given perspective, trying to formulate a certain understanding. They can even be read with the intent of identifying and avoiding certain conditions in the future. Only a fool would think reading a book with bad ideas is always bad.

It’s basically just nom nomming the data, creating statistical grenades.

When an AI trains on a bad book, it incorporates the ideas and even the poor sense of style. It isn’t providing any perspective. It’s basically just nom nomming the data, creating statistical grenades. I’m certainly not claiming that training on Mein Kampf creates an AI Hitler, there are other ways that can happen. What I am saying is that the ideas and concepts contained in these books are kicking around in there somewhere, even if they are shoved way down in the statistical distribution. What this ultimately means is unclear.

I don’t mean to be disingenuous here. The definition of a “bad” book is highly subjective. This quickly devolves into a who decides scenario, which could lead to unintended consequences of its own. My point is that there should be more purpose to the activity.

There were also plenty of idiotic takes on Project Panama. Never underestimate the true cluelessness of the e/acc community. One thing they effectively accelerate their own idiocy.

e/acc person saying dumb things about libraries

One of my favorite arguments from the e/acc community was that they weren’t destroying the books, they were preserving them. To which I joked that it was preservation through destruction. Preservation through destruction sounds like a quote that could be ripped from Orwell, just like the pages of his books for Project Panama. Turning books into statistics to monetize them doesn’t preserve them in any sense of the word. This is a silly argument that can be destroyed by one simple question. If they are preserved through this process, then where are they?

Preservation through destruction sounds like a quote that could be ripped from Orwell, just like the pages of his books for Project Panama.

Manipulation and Imitation

So, why are AI companies foaming at the mouth to get their hands on books that seem to have nothing to do with their goals? If I had to guess, it has to do with a couple of factors.

The more of this type of data ingested for training, the more the system may be able to imitate humans under a variety of conditions. This can be used by users of the system to create a “personality” from the tool, or, more importantly, to manipulate people, fooling them into thinking the AI is actually a human. This manipulation could be applied in situations like customer service. I experienced this recently.

A broken water pipe forced me to call some local plumbing companies after hours. Quite a few of them were using a call service with what sounded like a human in a call center, complete with office background noise. This was clearly done to manipulate users. Oddly enough, when asked if they were AI, only a couple responded that they were. This is the type of manipulation that I find unacceptable. Gary Marcus recently wrote an article about this as well.

Of course, these conditions can also be used to claim that the AI has a consciousness or a self that needs to be protected. This is pure SciFi bullshit meant to ramp up the hype.

Another interesting and related condition is pastiche. The more of this type of data the model is trained on, the better it may get at imitating specific forms of human style. People can use these to fool themselves into thinking they are being creative.

As an example, generative AI can reliably generate books. They aren’t good books, or well-informed books, or well-written books, or accurate books, or present new information, or new perspectives, or any of the other countless characteristics we associate with a good book. But words slathered onto pages… This it can do reliably.

Generative AI hasn’t cured cancer yet, but it excels at creating slop. Slop is literally the number one use case for generative AI today, arguably more than coding. There’s no doubt that AI companies want more of this behavior to keep people engaged.

But let’s get back to books.

Next-Gen Nerds

When I was growing up, being called a nerd was considered a bad thing, and nerds read a lot. Now, they are popular, wear black t-shirts and blue jeans, and claim that reading is for losers. Their perspective is warped by an “optimization-at-all-costs” mindset. But don’t take my word for it.

In November of 2022, notorious tech bro and crypto con man Sam Bankman-Fried told a writer interviewing him, “I would never read a book.” He went on to say, “I’m very skeptical of books. I don’t want to say no book is ever worth reading, but I actually do believe something pretty close to that. I think, if you wrote a book, you fucked up, and it should have been a six-paragraph blog post.”

That’s the next-gen nerd’s perspective. An entire book should be six paragraphs. But why stop there? Why not six bullet points? Seems I just out-optimized the optimizer! In fact, many of their points can be addressed by reductio ad absurdum.

This perspective does not serve people well and further devalues books. The theory is that if books can be reduced to numbers, then the ideas they contain can be made “useful” in a programmatic or more efficient way. But ideas from books can’t be reduced to numbers, just like The Hitchhiker’s Guide to the Galaxy can’t be reduced to 42.

Oddly enough, when reducing books to numbers, you remove the content from context. Context, the very thing both humans and AIs need to make sense of things.

The idea that books are nothing but bloated friction is nonsense and could only be cooked up by delusional idiots. I acknowledge that poorly written books exist, and some books are 12 chapters when they should have been 6, but applying this perspective equally across all books is just plain stupid.

The Decline of Reading

Here’s a chart that should surprise no one.

Chart showing a decline of reading among teenagers

Everyone knows you don’t become an influencer by reading. Or reflecting. Or thinking. You become an influencer by reacting. Thinking before you do something takes too much time. What should be concerning is the rate of the lines. Kids who don’t read turn into adults who can’t read. And we are seeing this play out.

Kids who don’t read turn into adults who can’t read.

Gen Z is showing up to college unable to read.

Article about Gen Z being unable to read.

And teachers are taking notice, labeling it a crisis, as it should be.

Article about a college teacher observing that students can't read

Wisdom Takes Work

Here’s another secret: no matter how good the AI gets, you’ll never become wise without doing the work yourself. Wisdom manifests from reading, writing, and a whole lot of reflection. All activities that are devalued today. I’ve previously covered how knowledge and understanding aren’t generated from bullet points, but let’s go a bit deeper.

In letter 27 of Seneca’s Letters on Ethics, he discusses how real joy depends on real study. In this letter, he describes a man named Calvisius Sabinus, a wealthy man who wanted to appear learned, who devised a shortcut. He spent a great deal of money on slaves, one to know Homer, another Hesiod, and nine more for each of the lyric poets.

After he assembled this group, he would pester his dinner guests by having these slaves at his feet and regularly ask them for verses to quote. Even with this assistance, it was observed that he’d often stop mid-sentence. When a man named Satellius Quadratus made fun of him, saying he should train his busboys to be literary scholars too, Sabinus responded that the slaves had cost him a hundred thousand sesterces apiece. To which Quadratus said, “You could have bought as many libraries for less.”

Sabinus’ optimization led to a mistaken assumption that the knowledge possessed by anyone in his household was his own. This perspective is not only wrong, but it also led to ridicule. Today, we have a similar situation with AI, and many would claim that the information contained in an AI tool is knowledge they themselves possess, except that it isn’t. This resembles a situation we had previously with Google search, so we shouldn’t be fooled merely by upgraded tech.

Excellence of mind cannot be borrowed or bought. -Seneca

Wisdom isn’t recall. After all, someone who memorizes things wouldn’t be considered wise. It’s the perspective that’s gained from study and reflection across a variety of sources. It’s the ability to connect the dots between concepts and form new ideas. Wisdom manifests in someone who puts in the work, which includes reading, writing, and a healthy dose of reflection, all things labeled by the tech bros as “friction.”

Photo of the bust of Zeno of Citium by Paolo Monti.

When Zeno of Citium (The founder of Stoic philosophy) visited the Oracle of Delphi with the question of what he should do to live his best life, the god replied, “He should have intercourse with the dead.” This is recorded in Diogenes Laertius’ Lives of the Eminent Philosophers. In modern times, people have changed this to “converse with the dead,” no doubt because of how intercourse is used today, but I think intercourse has a much deeper meaning. No pun intended.

The oracle’s response didn’t mean Zeno needed to engage a psychic medium or have a seance. The only true way to converse with the dead is to read. This is the meaning Zeno inferred as well.

I’m obsessed with the works of long-dead authors such as Seneca, Aldous Huxley, Marshall McLuhan, Neil Postman, Montaigne, and many others. I can’t send them letters or call them on the phone. I can grasp their perspective through their writing, through books, letters, and other artifacts they’ve left behind. I can highlight passages, write my own notes, create my own modern perspective, and even challenge these authors in my own way, given the time that has passed from them to me.

Reading is a superpower that we seem eager to relinquish, which is a shame, because in many cases, reading is the remedy for so much of what ails us today. Reading does broaden the mind. It demonstrates that even ancient people encountered many of the same problems we have today. It creates room for reflection that extends far beyond the act of reading. The quest for wisdom and conversing with the dead create a satisfaction that can’t really be matched by other activities. It’s a feeling that isn’t possible to explain, and it’s something that needs to be experienced.

Getting Useful Information From Books

There’s a time-tested way to get the information from books that doesn’t require destroying them and turning them into a statistical distribution: you can actually read them. Dated concept, I know. I have a large piece in draft on literacy more broadly, so we’ll keep this piece focused on reading.

People might claim that the joke is on me because I only have knowledge from the books I’ve read, whereas someone with generative AI has knowledge from all of the books, despite never reading any of them. This is an idiotic statement to make. Setting aside the issues with retrieval, hallucinations, and other technical issues, the user doesn’t actually doesn’t have the knowledge of any of these books. At best, they have pieces torn from context. These people are like Sabinus, without knowledge but happy to annoy dinner guests.

The great thing about books is that you don’t need to read them all. Just like an explorer doesn’t have to explore every square inch of the globe, a reader is free to explore their unique interests and forge their own path of knowledge and wisdom.

Another response could be, since the AI has been trained on the content of someone like Seneca, you could have a Seneca bot and ask it questions. But this approach doesn’t make sense either. First of all, even if this were an effective approach, you’d have to know the right questions to ask. Since you haven’t read the source material and aren’t confronted with concepts in the writing, the “right” questions would escape them.

Second, none of the responses from the bot would stick with you, or in some cases, make any sense. The responses won’t form a connection in the way the content is encountered in the context of reading a book. The bot is going to “tell” you something, while reading will “show” you something. Reading is a true experience. Being told something is ephemeral and throwaway. Experiencing something can last a lifetime, while being told something may last only seconds.

Experiencing something can last a lifetime, while being told something may last only seconds.

Finally, the bot will approximate a pastiche response based on what it may have encountered during its training. It’s going to fill in the blanks in whatever statistical way makes sense. It won’t be the true response based on what the author really knew or felt. However, this response does actually fool people, and we’ve seen it time and time again in the generative AI era.

Much of this makes so much more sense when stated out loud. What do you think is the best way to get value from George Orwell? Do you think throwing 1984 into a massive statistical distribution or reading the book? Reading a book sticks with you in a way other methods can’t match.

There’s a mismatch in the application of technical thinking here. The goal of reading a book is to change the landscape of your mind, not have immediate recall over chapter and verse. Thinking that the point of reading a book is to remember everything is a warped perspective caused by our modern technological environment.

Unable To Read

Many people find it difficult to read long-form content, things like books and longer essays, articles, maybe even this very one. This may be due to attention hijacking, inability to focus, or discomfort, for lack of a better term.

People often tell me they wish they had time to read. I tell them, “Yeah, like I have boatloads of free time.” When viewed this way, zoomed out, it appears that nobody has time to read. However, this isn’t the case.

It’s true that getting information from books requires purposeful action and commitment. However, once started, it’s not as bad as people make it out to be. In many ways, it’s like an exercise routine. The best way to get started, or to get back into reading, is to create a habit. Start with the intention and don’t beat yourself up about it when you miss the mark.

What do you typically do before you go to bed? For many, this is swiping through social media, reading news stories, or maybe watching TV. This is a prime block of time to target for a reading habit.

You don’t need to start big. Maybe try 15 or 30 minutes of uninterrupted time. You may get distracted or feel uncomfortable. It’s fine. Like any habit, it will take some adjustment. At some point, it will click.

Don’t let some sort of idealized conception of reading throw you off. There’s so much reading advice out there, and much of it is bullshit. For example, nothing is more bullshit than speed reading. If you were worried about wasting your time reading, then speed reading will confirm your worries.

All of the things you are told are bad habits, such as vocalizing as you read, re-reading sections, and reading slowly, are all positives that reinforce the concepts contained in the book. Your goal is to develop an understanding of the material, form new connections between concepts in the book and in the world, and even develop new ideas based on these. None of that happens during speed reading.

It’s true, people read at different paces. You may feel like you read slower than other people, but that’s probably not true. Besides, who cares? You are reading for you.

As you begin reading again, you’ll find what works for you and what doesn’t. For example, maybe you prefer physical copies of books or the convenience of an eBook reader. The format is irrelevant if it works for you. Also, maybe you need to put your devices in do-not-disturb mode or make other purposeful interventions. Be intentional, find what works, and push forward.

My Approach To Reading

Let me share my approach, because I feel it’s pretty simple. I’ve explored using book tabs and other reference techniques, but I don’t use them consistently. Some people use reference cards, but I’ve never felt the need to go this far. I read for about an hour and a half every night before I go to bed. I typically have at least one fiction and one non-fiction book I’m reading at a time.

I prefer physical copies of books because they feel more engaging to me, and I don’t have yet another “device” in my hands. However, for nonfiction books I’m really trying to dig into, I’ll buy all three formats: physical, ebook, and audiobook. The audiobook is mostly for use while running on a treadmill or driving on a road trip.

For the physical copy of the book, I have three things: a Zebra Mildliner in lemon yellow, a pen, and a notebook. As I encounter interesting content, it could be concepts, things I’d like to quote, or anything else, I highlight it. If there is an entire section of a page, then I’ll highlight the first sentence and make a note with my pen in the margin, so when I revisit it, I have the context.

As I have ideas or make connections, I’ll stop reading and capture my thoughts in the notebook. I may include a piece of the book’s content there, but not always. I will always annotate the page number for reference, though. This way, it’s easier to revisit in the future.

After I’m done reading, I’ll take all the highlights from the physical book and apply them to the same sections in the eBook. I then sync those highlights to Readwise for both ease of reference and spaced repetition. I also have a physical commonplace book where I write the highlighted items. However, I’m not very consistent with this activity.

You may question the efficiency of my approach, as it seems I’ve added unnecessary extra work for myself. It appears that using the ebook and syncing the highlights is far more efficient. Once again, the optimization is a trap. The friction is the point.

In the act of transferring the highlights to the ebook, I’m once again confronted with all of the concepts I’ve highlighted, this time, after reading the whole book. Although I don’t remember things word-for-word, when I see the highlight, I’m reminded of the context in which the highlight was created. New ideas form, and I note those in my notebook. This activity further reinforces the content in my mind. I will sometimes reread the page or section during this activity. This activity provides much value.

Someone may argue that an AI can digest the entire book and provide relevant highlights without having to read it. Hopefully, by now you can recognize the issue. Even if it did this accurately, you’d be confronted with highlights out of context. The meaning and important features would be unavailable to you mentally.

Conclusion

The current AI age is making us wisdom-poor and manipulation-rich. The damaging consequences of the devaluation of reading are on the horizon for an entire generation and generations to come. It’s separating us from the very skills we need to defend ourselves and keep us robust in the modern environment. It’s removing our ability to reflect as modern technology pushes us to react.

Many believe they can’t read long-form content anymore, but that’s only because they haven’t tried. By creating a habit and some purposeful interventions, we can get back on track to finding wisdom.

Perilous Tech

The Warning Label For Emerging Technology