Claude Code – how much hype, how much true wizardry?

I keep seeing mentions of Claude Code, many extolling its virtues, such as (from Scientific American):

Claude Code turns plain English into working software, making advanced coding much more possible for the rest of us.

I’ve seen our own Tim Peters experimenting with it. (I sense a conflict though. Hopefully Claude Code isn’t so good that it replaces Guido, Tim, Barry, Serhiy and all the other Python core devs before they’re ready to be replaced.)

I have no real need for this sort of thing (I am retired, after all), but a few questions do come to mind when I see fawning references such as the one above from SciAm.

  1. Can it do a reasonable job of refactoring existing code?
  2. What about translating between different languages (say, just picking two completely at random, Python to Rust)?
  3. I have code, but like many people I’m both a bit lazy and not really very good at writing unit tests. Could Claude beef up my test suite (or trim to minimize test case overlap)?
  4. Does Claude hallucinate?
  5. Who checks Claude’s work?

I’m sure there are places where this has been discussed exhaustively. Pointers appreciated. If it’s not to far afield for you, an answer or two (even partial) to get my thinking oriented in the right direction would be great.

3 Likes
  1. Amazingly.
  2. Yes
  3. Yes
  4. Rarely
  5. Unit tests/Linters
3 Likes

My only opinion on agents is that they tend to do things in their own style. You can nudge them in the right direction with specs, but at the end of the day I’d rather write code than a bunch of markdown specs.

My fear is that as more and more code is written this way, those specs get lost or the overall structure of things vanishes and you get this unmaintainable mess.

I’ve definitely shot myself in the foot that way before. Especially since it can write code faster than I can check it.

8 Likes

To add to this, I don’t use any AI assistants in any of my day to day. I also have never used Claude or any of the other newer ones, only co-pilot back when it came out.

My primary experience with agents is having to re-write thousands of lines of broken code that my colleagues put out with them.

1 Like

I’d say it’s worth the $5 (the equivalent credits have lasted me a couple of months) to at least try it. I wouldn’t say it’s what it says it is but it’s closer to it than others I’ve had to use. To give some of the experiences I remember:

  • I asked it to do a function signature refactor (something like make it take this data class instead of each field as separate parameters) including propagating it around all its usages: It did get it right although it took so long thinking about it that I doubt a manual find+replace would have been much slower [1].

  • I asked it to translate a fish completion file into a bash completion file: It did a convincing job of that. It missed one thing I wanted it to do but fixed it when I pointed it out. It would have taken me hours to do it myself. It was also good at picking up where it left off weeks later when I then added something to the fish file and wanted it propagated it into the bash file. One thing that did ruin it a bit though was it kept wanting to go back and tweak the bash file whenever I asked it to do completely unrelated things – it never explained its tweaks so I eventually told it it wasn’t allowed to touch that file anymore. I don’t know bash completion well enough to know how :ox::poop: the code is but it’s at least easy to test it.

  • I asked it to optimise something: It confidently walked me through 5 different optimisations for “massive performance gains!”. Half the changes did nothing. The others made it slower. I was mildly of impressed though that it managed not to break anything. It was a fiddly piece of code but all the tests remained green after each change[2].

  • I asked it to write a typing stubs file for one of my packages: I don’t use typing so I can’t assess its output other than to run mypy on my tests and check they pass which didn’t give me enough confidence to keep it. The one thing I did see it do that I didn’t want it to was to use overly broad unions of input types. e.g. It would take a function that was only intended to take strings but by happenstance could have been given a list without throwing a type or attribute error then the type hint would be the union of str, list and some crazy mess of protocol classes.

  • I asked it to write a README for a CLI application (in the same repo as its code so it could use that to figure out what the CLI could do): I wasn’t expecting it to do well. It was a mixed bag of LLM nonsense and surprisingly good guesses at how the CLI behaved. I didn’t keep any of it but seeing what I didn’t want it to look like helped me to figure out what I did want so it was worth it anyway.

  • I said thank you to it once and it billed me 60 credits to say you’re welcome :roll_eyes:

Overall, I struggle to find tasks that are less work to explain + review + provide clarifications on + re-review + adjust to code style/naming preferences[3] than to just do it myself. That’s partly me though – I’m way too fussy and opinionated to vibe code.


  1. factoring in the time it takes to describe the task to claude and review what it produces ↩︎

  2. I was checking ↩︎

  3. which, contrary to its claims, it doesn’t pick up well from the code you give it ↩︎

9 Likes

I have seen many example of bad AI unit tests like

result = func(1)
assert isinstance(result, int)
assert result != 0
# proper test:
# assert func(1) == 2

I think that if you are using AI then you want to know that you have good control of QA in other ways like you have lots of ways of verifying correctness (tests, static typing etc) so that you can easily check that the AI is doing screwy things. Letting the AI write the unit tests is something that requires heavy review: it is great at unit tests if you just want those tests to pass but not to be trusted if you remember that the purpose of the tests is to fail.

3 Likes

This is one of the reasons I refuse to not have type hinting for literally a every line of code for anything I plan on actually maintaining or is critical to any sort of production system.

Since typing is still in flux and newer to Python, there isn’t really a lot of training material for the AIs.

I’ve written a few projects from the jump with the rule that everything must be typed, and it honestly eliminated the need for a ton of basic unit testing since you get the real-time feedback from your linter/LSP.

I also can immediately track exactly what the issue is if say a function signature changes somewhere and it breaks something somewhere else without needing to do anything extra.

1 Like

No, you haven’t :smiley: Never used Claude. I have used Gemini and ChatGPT-5. Humans hallucinate too :wink:

They’ve made jaw-dropping progress over the last year or so. I thought of them as a joke at first; then occasionally useful; then often very useful; and today I’d rather “chat” with ChatGPT than with most humans about most tech issues. It’s become more of a collaborator than “an assistant”.

Perfect! Play with one - dig into an area you’ve been curious about and just see how it goes. Don’t come in with preconceptions about what it can and can’t do. Let it surprise you.

Some hints:

  • Push back! If anything it says seems dubious, wrong, or unclear, press it for more details, to “show its work”, or try different words. Their “understanding” of English seems excellent.

  • They’re not human, but in some good ways. They have no egos, no axes to grind, and never lose patience. They don’t judge you.

  • They’re not omniscient. While they can apply reasoning, they mostly reflect the consensus across vast amounts of training data. Given that the web is what it is, popular delusions are part of the data they’re fed. On several occasions, e.g., ChatGPT fed me a version of a popular misconception about a highly technical issue,. But it always “listened to reason” when I disputed it, and we always reached agreement in the end.

  • But that’s unique to me. For whatever reasons, at least ChatGPT does not feed what it learns from conversations with me back into a global knowledge base. It only 'remembers" them when it’s chatting with me.

  • Which is another surprising aspect: over time, they adjust their style to better match yours. My bots give me ever less picky details now, because they’ve picked up on that I already know some basic things :wink: Time with them gets ever more productive.

  • The biggest disappointment to me is that they have no curiosity either. When a chat ends, the process moves on to other things, and doesn’t give another cycle to my problem until I start another chat. IOW, it shows no initiative - mostly just reactive.

So I don’t count on one for “deep insights”. The heavy conceptual lifting is always on me. But sketch an approach for it, and it can write mountains of implementation code in an eyeblink.

Then again, I’m mostly working on new algorithms in focused areas. I’ve never pointed one at a massive (or even large) code base. YMMV.

8 Likes

All of the models are heavily trained on Python, C, C++. You can ask them directly how well they are trained in a particular language, i.e. they’ll tell me they’re just okay with LISP.

Most of my work is in CAD, all of the models struggle with heavy math, i.e. computational geometry. However, they’re extremely helpful nonetheless. If models start to collapse, this is when managing context windows is super helpful. A model collapse is similar to an Oscillator in Conway’s Game of Life, where it starts oscillating the same hallucinations, this case it’s best to edit, or clear the context window.

As a sub-contractor, some contracts don’t allow me to post code online, so I’ll fire up something locally in the 30b range. These are excellent for tail-end work. I can inject a C++ header(s), or a .PYI stub(s), the model will stamp out well written unit tests, and even documentation.

All of the models, even the minis are amazing time savers for all coding. I using it like pair programming, I do what I’m good at, let the model do what it’s good at. It takes some time working with it to see where it fits.

4 Likes

I wouldn’t describe Claude’s output as “working software”. Well OK, it might work. But it’s far from “production ready” (despite their fondness of that phrase). If guarantees of working software are required, it’s not replacing anyone. The hype (and the valuations) are greatly exaggerated, and unjustified.

But they are remarkable tools. The break through for me, was realising the standard I was hitherto demanding from generative AI, I actually quite rarely achieved myself.

I’m interested you picked 2). Porting a project that has an existing extensive test base, especially if that contains language independent end-to-end tests, is one of the more reliable applications of GenAI. When I’ve done that, I just attribute the original author (with mistakes and omissions being mine). The only advantage though (instead of simply using the original app), was so that I could review and audit Python code, instead of Go code though.

I strongly suspect, that if Generative AI does a good job generating code, then that task was remarkably close to something in the entire internet (or outside the internet) that it was trained on. And I’d prefer to go direct to the original source, especially if it’s open source. That said, when I’m in an unfamiliar domain, I’ve found they’re really useful for suggesting libraries, and for config heavy code (it’s far harder to get config wrong unnoticed, than to get code wrong).

As for 5), well you check it. E.g. you trust your tests.

6 Likes

I can tell you that for Mathematics it is a pain to use it. I asked him to prove some inequality without Calculus, one of those that compare a logarithm to a rational function. In very eloquent terms mentioned almost all keywords (AM-GM, Jensen’s inequality) that can be used to prove it without Calculus. He gave his argument with a step that was wrong. Never used any of the techniques mentioned. Never mentioned Bernoulli’s inequality, which is another possible tool to prove it. After every time pointing where the argument was wrong, he modified it, and convincingly gave another wrong argument. Sometimes he would use a previously proven inequality with the opposite sign. Other times he would just claim to have deduced the wanted inequality, but didn’t.

I myself would rather have a collaborator that doesn’t know as much, but doesn’t pretend they do. Even humans I avoid that type of charlatans, the ones that are very good at speaking.

He was also not able to give any useful idea on some simple instances of Post correspondence problem, and gave me wrong answers for who wins in games of Nim.


I had forgotten the concrete example. Found it: x\ln(x)\geq -\frac{1}{e}. You could try to see if Claude has already learned how to do it properly.

The specific Nim: Who wins in a game of Nim with heaps of sizes 3434324243, 42343424242424224253456546664 and 55345345345345433?

And the Post correspondence problem, that it did solve after many hints: Find a solution for the Post correspondence problem for the dominoes (abc, ab), (ca, a), and (acc, ba)

3 Likes

I think the best way to get a sense of what models can do is to play with them! They are powerful and general tools, very useful today and getting better quickly.

Recommend playing around with both chat (ChatGPT / Claude) and with the “agentic coding” products Codex / Claude Code where models can run code and take actions. Codex is running a promotion as of yesterday where you can use it for free.

(disclosure: have worked at OpenAI since 2020)

3 Likes

Do you have any advice for how to do this safely or can you recommend a good guide?

I would really like to try a terminal agent like claude-code but right now I don’t even trust these things enough to try them. I have just about managed to convince myself that it is okay to have GitHub Copilot or Supermaven running autocomplete in vim provided it is turned off by default and I think I trust that it is only reading the files that I think it is reading (the ones that are open in vim when it is running). In general though I have a high level of distrust not just for the LLMs but also (no offence intended) the companies that produce the commercial LLMs and the associated client software. That level of distrust goes as far as not wanting to give general read access to the filesystem in any of my normal computers.

I was thinking maybe I could so something like make a virtual machine and then share a directory (read and write) into the VM so that I have an upper bound on the damage that can be done. That would just be the starting point though and from there I want more fine grained control over what the agent can do inside the VM and in that directory and I definitely don’t want it doing things like git push.

I assume that people working at companies like OpenAI have established robust methods for sandboxing these things so how exactly does that work?

1 Like

One point that hasn’t been addressed in this thread is that these things only actually work when used by someone who knows what they are doing. I’m sure that you (Skip) could use these things to do something useful but the idea that someone who doesn’t know how to program can use these tools to produce good code is demonstrably false.

You asked “Who checks Claude’s work?” and the answer is that you do and if you don’t know how to do that then the code will be riddled with problems. Sometimes it is fine for the code to be riddled with problems e.g. if it is just a toy prototype but if the code is not supposed to be riddled with problems then you have to check Claude’s work. Then you either tell Claude to fix the problems or manually edit the output (I recently saw someone call this “self-adjusting”). From talking to others I think it is quite common to hit a point where you’ve prompted again and again and then you say okay git reset --hard and just type the code yourself because that is faster than checking the AI output.

Yes, if you just want a few lines of boilerplate for some common task then it can be fine but creating “working software” and doing “advanced coding” are not things that a novice can do just by using AI. I’m also not convinced that this is something that will change any time soon because always somewhere there needs to be a human in the loop who has enough understanding of what is going on in the code and what it is supposed to do in the real world.

From reviewing open source PRs I have seen many examples of people with different levels of ability/experience using different kinds of AI and most of it is bad. Novices produce garbage that might be randomly correct and more experienced people make mistakes that they would never have made if not using AI.

It doesn’t even work to say that novices can use AI to produce something that more experienced people would review and give feedback on. Honestly reviewing many PRs now is like prompting a broken LLM. The human on the other side actually makes it worse then using an LLM directly because they garble the review feedback when prompting the LLM.

AI spam is a growing burden on open source that actually threatens the whole model of open-to-anyone contribution where maintainers attempt to review code submitted by anyone. There was a register article about this yesterday:

4 Likes

I use the web interface quite frequently. Not claude specifically, but the same logic remains. That the web interface only has access to what you write there, what you paste in, and the files you deliberately upload to it. I do presume that everything will be used for further AI training, quite regardless of what these companies promise. But I don’t think I really need to care (with this use pattern).

It’s the same with the fact that you should assume anything you write to or from a @gmail address (or an email managed by http://outlook.office365.com) may be read by ICE/FBI/CIA agents. (I think EU citizens specifically are probably safe from having their mail read by google/microsoft engineers, but that’s about the limit of the privacy I expect.)

It’s annoying and a bit scary if you think about it, but most of what we do isn’t actually sensitive.[1]


  1. Which doesn’t mean that I don’t try to nudge people to communicate with me via channels that are properly private ↩︎

2 Likes

That’s true and is why I think that I could let Claude see a lot of the files that I spend a lot of time working on much as GitHub Copilot is allowed to see those files.

All of my computers contain at least some data that I have a professional and legal responsibility for though and I think that just running claude on any of them without proper sandboxing would be at least a minor breach of those laws, regulations and policies.

I have seen people running claude --dangerously-skip-permissions and even running that in a loop. Honestly I’m interested to see what you can do with that but it definitely needs sandboxing.

Don’t forget CCP!

I ran it in a docker container with only the project I was working on mounted so that, assuming it hasn’t found a way to break docker’s sandboxing, that was all it could see or touch. Claude-code is all terminal based anyway so it didn’t hamper the usability. I’ve never trusted it to let it see code that wasn’t already public.

1 Like

What do you mean by “LSP”? The only expansion I found for programming is “Language Server Protocol”, but “your LSP” is then puzzling since LSP is supposedly a fixed entity.

1 Like

Ah, I stand corrected. Thanks for the hints as well.

1 Like