nightmare fuel dall-e generated image with the prompt "green skin, skull like face, shallow nose, thin lips"

Comparing different language models

This entry is part 4 of 13 in the series Artificial Intelligence

Over the last few months I’ve been bouncing back and forth between ChatGPT and Claude.ai for my AI interactions. ChatGPT is what everyone is most familiar with, of course. Claude.ai is a newer kid on the block and I really only discovered them by way of their amazing docs site that was linked to me by a friend who’s also interested in and experimenting with AI.

Claude bills itself as a conversational AI and my experiments between the two of them have shown me that they are both very good at very different things, but it was the “useful hacks” suggestions in the Claude docs that really piqued my interest.’

So, one major difference is that ChatGPT now has a mobile app whereas Claude does not. That means I can have ChatGPT sitting in my pocket and ask it questions and those questions show up in my chat history (or I can continue previous chats). While I can do this in Claude, I have to use the web UI and that leaves something to be desired.

Claude is still pretty early in development, too, and so — unlike ChatGPT — it can’t just continue conversations indefinitely. It limits how many interactions you can have per day and then, when you’ve reached the “token” limit, it stops the conversation completely and does not allow you to continue (meaning anything you’ve discussed or “trained” it in that conversation is, essentially, lost).

Despite this, I continue to go back to it. Why? Because Claude does one thing that ChatGPT struggles with — sounding human. Claude is far better at generating content that feels realistic or believable, and that makes it better at brainstorming creative ideas for, for example, RPG characters, locations, backstories, names of fictional places and items — it’s just better at the creative stuff. I was able to ask it, during a D&D game session last night, to give me a statblock for a character that I had prompted it into creating and massaged the story and interactions between that character and their surroundings, and it just…did it. And it was believable and on par with the player characters and the enemies they were fighting in terms of how balanced they were.

Evaluating the right tool for the job

Claude wasn’t trained on the internet. Or, at least, that’s what it told me. I’m sure, on some level, it was — almost all LLMs have been, that’s what makes them “large” — but it doesn’t remember/know that it was, and so if you ask it something specific about the “real world”, it’s not able to answer. ChatGPT, on the other hand, is happy to tell you that it was trained on internet data and that that cut off is September 2021. It’s also happy to give you completely fabricated directions if you ask it how to get from one place to another, so, there’s that.

Claude will say that it was trained to be “helpful, harmless and honest” and that is something that is built deep into its operation. As such, Claude was also trained to say “I don’t know” when it doesn’t actually know something or doesn’t have information about a thing — as opposed to ChatGPT’s habit of making some shit up and sounding authoritative about it. Allowing an AI bot to say “I don’t know” can help (if not actually eliminate) hallucinations, which is extremely beneficial. I’ve started telling ChatGPT to tell me if it doesn’t know something rather than spewing stuff that it thinks it knows (or maybe knows, or supposes, or assumes or whatever it is that causes it to tell me things that, when tested, just don’t work).

But not knowing the internet or anything relating to “the real world” can be limiting. For example, I was able to get ChatGPT to compile a playlist based on the Dungeons & Dragons setting Planescape with some slight mood tweaking and nudges into specific artists’ catalogues (e.g. Ramin Djawadi). Not only was it able to accomplish the task (mostly successfully) but the songs actually existed on Spotify and fit the mood I was going for (I created a playlist based on the interaction here). When I asked Claude to do the same thing, it balked, saying it didn’t have access to “real world” information so it had no way of doing that.

On the other hand, Claude is far better at creative and imaginative tasks that ChatGPT fails at (or fails to produce something that doesn’t sound like it was written by a robot). For example, I recently came across this gist with instructions for converting D&D 2e statblocks to 5e. I fed this to both Claude and ChatGPT.

Claude was able to create realistic adaptations of the statblocks and improvise spell lists, abilities, etc. Claude’s statblocks included reasonable stats for the other ability scores that aren’t covered by 2e statblocks and, once I gave it information about Challenge Ratings from the Dungeon Master’s Guide, it was able to apply the appropriate CRs (and therefore proficiency bonuses) and then scale the creatures up and down if I asked for different CR versions. On the other hand, when I asked it to format the statblock in Homebrewery-flavored markdown, it kept making mistakes (like not adding two spaces at the end of lines for a line break), even when I caught them and told it that it was doing it.

ChatGPT, on the other hand, did the task as asked but left all the attribute points for things it didn’t know at 10. When I told it to base the abilities on another creature (as noted in the gist), it was able to update it, and it had a natural ability to apply CR, but it was less able to improvise or create new things. However, when I told it to give me the statblock in Homebrewery markdown and provided an example, it was able to consistently do so with fewer mistakes.

My conclusion, therefore, is that for things that are definite and factual, like code or specific tasks, or for things where some understanding of the outside world is required, ChatGPT is the right solution. For anything creative, Claude is far superior.

Now let’s talk about GPT-4 and Bing chat.

Bing chat has the ability to actually query the internet. It’s a little like Siri when you ask it something it doesn’t just know. On a recent trip to California, I needed to find the closest place I could find boogie boards. Using searches and Google Maps and Yelp gave me results, but I wasn’t convinced they were the right results. So, while we were out on the beach in Dana Point, I asked Bing chat what the closest place to find boogie boards was and it told me Killer Dana on the Dana Point harbor (which had come up in my searches, too, so I was able to confirm that it was accurate), which was less than a mile away from where we were at the time. Then, because I wasn’t entirely sure about the nuance, I asked it the difference between “boogie board” and “body board” (there is no difference; “boogie board” comes from Morey Boogie, so the distinction is like calling lip balm “chapstick”, calling internet search “googling”, or calling paper tissue “kleenex”).

Now, I wouldn’t trust Bing chat for much of anything. My experiments in actually using Bing chat for things that were useful gave me the impression that it is about as good at Siri is in answering, that is to say, its knowledge is incredibly limited and it relies on the links its able to pull up to answer your question rather than any sort of built-in logic.

I recently upgraded my ChatGPT account to gain access to GPT-4. Supposedly, it has a much higher token limit, it’s faster, and better able to logic things out with a higher success rate. I’m not sure about all that, but my initial tests have found it to be:

  • way more verbose. It’s far more likely to explain things in extreme detail even when it wasn’t asked to.
  • forgetful. It seems to forget about things it said earlier in the conversation. Whereas GPT-3.5 (the default, free model you get from ChatGPT) will be able to refer to something it said earlier, I kept finding GPT-4 “forgetting” something (e.g. a piece of code and what it does) that it gave me just one prompt earlier.

On the other hand, it was able to handle more complex instructions despite telling me that a large script I pasted into it was incomplete or missing sections (something that GPT-3.5 has never said, and Claude would convert large pastes into attachments that the model can “read”). I’m still not sold yet on GPT-4, but one thing I haven’t played with yet is the ability to give it some customized instructions, so I’ve told it to be more succinct in its answers unless prompted for more information and to “think step by step” — a way to give LLMs permission to “think out loud” before they fully answer your question or prompt. We’ll see how that goes.

Series Navigation← “Jailbreaking” ChatGPTRevisiting the Lovecraft test →

Posted

in

by

Tags:

Comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.