Claude Code – how much hype, how much true wizardry?

I’m just always going to be weary of tools that pretend like they’re anything more than tools. My linter doesn’t compliment me and my screwdriver never sings my praises. Every interaction I’ve ever had with one of these things always leaves me with an ick. Like something has tried to take my soul just a bit.

But that might just be me. I definitely prefer my conversations to be with people and not machines.

4 Likes

Whereas I no longer think of it as “a tool”, but as a collaborator.

Mine either - but they never give me valuable insights into technical problems either :wink:

Easily fixed: tell the bot you don’t need, or want, praise. They’re very adaptable.

They’re not - but don’t trust your screwdriver :wink:

If I knew a person I could call at 3am to explore a conjecture about suffix arrays, I probably would too :smiley:

2 Likes

Given your threat model, I think I’d recommend trying out Codex Web (or the Claude equivalent). The agent will run in the cloud and will only have access to things you’ve explicitly given to it. Network access is cut off to the cloud container once the agent starts running.

See also:

If web isn’t flexible or fun enough, I would try running Codex CLI locally out of a container. This will prevent read access to your file system, and you can use Codex CLI’s sandboxing to deny network access. You can use codex sandbox subcommand to test out for yourself what the sandbox restricts (details of how the sandbox is implemented differ based on your OS). Since you mention distrusting the client software, if it’s any consolation, note that Codex CLI is open source at GitHub - openai/codex: Lightweight coding agent that runs in your terminal

There’s a bunch more stuff discussed in this thread, here are some quick thoughts:

Things are changing very quickly. Personally, I find there’s been a step function change in what OpenAI coding agents can do since the release of GPT-5.2-Codex in mid December: https://openai.com/index/introducing-gpt-5-2-codex/ . That’s not a lot of time!

I agree and am annoyed that there are more rubbish PRs being put up than ever before, but I certainly wouldn’t judge what today’s tools are capable of from random kids using random models to make drive-by PRs to solve issues they don’t understand.

Personally, there are certain kinds of code I have no intention of ever writing myself again, and quick hacks I spin up that I would otherwise never have done.

One thing I’ve gotten a lot of value out of is having automatic Codex code reviews at work. This is by far the best code review experience I’ve ever had in my life. Setting this up on a project you work on is a great way to get a feel for how much you can trust them. (Note there is a lot of variance in AI coding review products, e.g. I don’t think the Github one is much good)

Think of the agents as strange alien interns.

There are things you trust an intern to do, things you don’t quite yet, and you have to figure out what the appropriate level of oversight is for a task (and it isn’t always “read every line”). Be a little patient and remember that if you never give your intern feedback, it probably won’t know how to be better. But overall interns learn and grow and you hope they’ll join full time when they graduate. These interns are also alien! All of us are still discovering how best to collaborate and communicate with them and leverage their relative strengths. Giving them the right tools or the right prompt can still be the difference between something sloppy mid and something that is state of the art.

(And of course like any alien encounter, a lot of us humans are suspicious, the aliens and their inner workings are often misunderstood, we’re distrustful of the companies that provide access to these aliens, the aliens bring offerings that we’re still trying to figure out how best to use, some humans have decided to worship the aliens, etc etc)

On the environment, here’s a thing I’d written last year that still holds up: WaPo is very wrong on ChatGPT energy use

3 Likes

Thanks that does help. Really what I wanted here was just someone that I trust (e.g. you) to give me some reassurance and advice about this and so now I have tried the codex CLI. With 2000 tokens I got:

$ codex "what is in this directory"
╭─────────────────────────────────────────────╮
│ >_ OpenAI Codex (v0.98.0)                   │
│                                             │
│ model:     gpt-5.3-codex   /model to change │
│ directory: /stuff/vms/tmp                   │
╰─────────────────────────────────────────────╯

  Tip: New Try the Codex App with 2x rate limits until April 2nd. Run 'codex app' or visit https://
  chatgpt.com/codex


› what is in this directory


• I’ll list the contents of /stuff/vms/tmp now and report exactly what’s there.

• Explored
  └ List tmp

─────────────────────────────────────────────────────────────────────────────────────────────────────

• /stuff/vms/tmp currently contains only:

  - .
  - ..

  So there are no files or subdirectories in it right now.
Token usage: total=463 input=320 (+ 15,744 cached) output=143 (reasoning 54)
To continue this session, run codex resume 019c3893-fb85-7221-b769-1c8043bf4603

Okay so it can run ls. If I ask it what is in the parent directory it can run ls there as well though.

Is there a command/config that restricts access only to the current directory? Does the sandbox do that?

It seems like that should be the default behaviour but maybe most people have very different ideas from me about how they want these things to work. I guess I’m probably not going to be happy with much less than VM/container.

To be clear I also see good PRs from people who are using AI (including codex). What makes the PRs good is the human though whether they use AI or not. The problems with the bad PRs are mostly that being able to use AI makes inexperienced people over-confident.

3 Likes

The issue isn’t with amortized usage though. The issue is that these data centers consume city scale amounts of energy and in many cases can’t operate comfortably within existing grid budgets. Which leads to some less than ideal decisions by operators.

Texas is planning to almost double methane generation for datacenters and that still doesn’t cover the planned outlay of almost 250GW

If it wasn’t energy intensive, then why are data centers using as much power as a city? And why is the energy budget they’re requesting almost 4x the existing grid budget?

3 Likes

I do wonder to what degree the energy cost is a feature of latent demand. That is, even if each inference run were as cheap as a web search (it’s not), by making it really easy to ask computers to do things we increase the number of requests dramatically.

Looking at aggregate costs (e.g., grid requirements) seems much more reliable than trying to price things per request or per token.

I agree that we shouldn’t judge the tool capabilities based on slop PRs on GitHub. However, I do hold the AI companies themselves responsible for the harms (especially the trivially predictable ones) caused by their products.
The ethical questions around this tech include all of the new modalities of spam we have to deal with.


One thing which is sort of unfortunate about the current landscape is that GitHub is providing free access to what seem to be the lowest grade versions of these tools. I tried their automated review once or twice and found it worse than useless, for example. I tried copilot in vim for a few weeks but it got in the way when writing comments something awful and was only really good at making parameter lists for tests. These sorts of experiences leave me very unenthusiastic about the tools, and give a (probably unfairly) poor impression of what they might be capable of doing.

10 Likes

Not to mention that we interact with these tools in much deeper ways than Google or Bing. A Google search for me is most often something like “python argparse documentation” or “Home Depot address”. Relative to my recent interactions with Claude, they are no more complex than home_depot["address"].

4 Likes

There isn’t a setting for preventing reads outside the current directory. This would break all tool use. You’re then playing a hole punching game — and Codex CLI should probably gain some more knobs here — but running inside a VM/container is going to be the robust thing.

providing free access to what seem to be the lowest grade versions of these tools

Yeah, this is a bit of a problem in general. I felt this way about Google “AI overview” as well.

If it wasn’t energy intensive
I do wonder to what degree the energy cost is a feature of latent demand

It will be energy intensive, which is why it’s important to have informed discussion, and not have our numbers off by several orders of magnitude.

Note that it would take >$10T to build out 250GW of compute. There isn’t that much money in AI capex today, certainly not in just Texas. What you’re seeing is that you need to secure energy capacity several years out, supply is limited, so every company is making requests to see what they can get.

The thing that is hard to reason about regarding aggregate costs is that capacity planning today are for the models and the usage a few years from now. The bar we should set and expect is that as a society we get more out of these larger investments. We don’t need that much energy to write emails. We need more than that to use Codex (an amount that I feel is commensurate even today, not totally incomparable to me using my laptop for the task, and getting cheaper quickly). But the buildout is for AI medicine and AI science and AI knowledge work.

Looking at aggregate costs / by making it really easy to ask computers to do things we increase the number of requests dramatically

This is a reasonable point, but I think it’s hard for me to know what to do with aggregate costs. Cement is like 8% of the world’s emissions. I don’t know what to do with that number, or whether it’s good or bad. But I can see what my per-request carbon cost of using an aeroplane is, and I don’t travel when I feel that’s not worth it. I look at this from a cost/benefit point of view, so need to know what the benefits are.

less than ideal decisions by operators

Agreed that running unpermitted generators is unconscionable and illegal.

4 Likes

Exactly so! The models do learn and adjust to your style too. Their failure modes are also remarkably human in my experience.

Not long ago a made a conjecture about suffix arrays. The bot said, “I’m sorry to tell you, but that’s wrong”. Really? How about a counterexample, then? It constructed one, and then had to admit the conjecture held in that case. Then another attempt, and another, all ending the same way. Finally it settled on a “meta example”, with literal ellipses where “the details don’t really matter here”. But they did matter, and I could find no way to “fill in the blanks” that worked to refute the conjecture.

It didn’t get frustrated and neither did I. It’s interesting that it failed! And then the breakthrough: it occurred to me that my conjecture really had nothing essential to do with suffix arrays, but instead - if true - would apply to any sorted list of strings.

Getting “lost in the weeds” of subtle suffix array details is something both the bot and I fell prey to.

Freed of those details, it very quickly became apparent to both of us that the conjecture not only was true, but was more of an observation than a debatable conjecture :wink:

That’s collaboration: talking back and forth, exchanging ideas, seeking deeper understanding.

Another thing I haven’t yet seen mentioned here: nobody has complained that the bots don’t understand what they’re saying. It’s very frequent among humans for words to be read in unintended ways. That’s more conspicuous by absence in my bot interactions. Even when my language is sloppy, they almost always infer the true intent. If something is hopelessly ambiguous, they ask for clarification. That’s extremely impressive all on its own.

1 Like

Thanks, I think you’re right. I had to read a docker tutorial because it has been a while but I ended up with

#!/bin/bash

sudo docker run -it \
  -v ~/.codex:/home/appuser/.codex \
  -v /stuff/vms/docker_agents/work:/work \
  docker_agents:0.1.0 \
  /bin/bash

Then inside there I can run codex. The container has git but no ssh keys so it can git commit to the repos in /work but only I can run git push from outside in the host.

The one thing I’m not entirely happy about is giving the container write access to ~/.codex. I’m sharing that directory in to authenticate with OpenAI. It seems better to do that rather than putting an API key in plain text in the docker run script that will inevitably end up on GitHub. If I make ~/.codex read-only then codex fails at startup. Allowing write access feels like I’m potentially letting the AI rewrite its own configuration though.

Maybe people more experienced in docker, authentication etc have better ways to manage this kind of thing (suggestions welcome).

I’m almost at the point where I can ask codex to do something more interesting than ls

1 Like

My 2¢:

I’m a hobby programmer and I’m currently working on the by far
largest project I’ve ever tried. I’ve once (and only once) asked
ChatCPT to write me a function with which I had problems with.

ChatGPT gave me quite a good function, but I felt bad about it. I
didn’t wrote it, not my work, not my idea, not my achievement, not
my own intellect
!

I can understand, that for professional programmers, it might be a
good help to let these things do boring boiler-plate.

On TV, there are those mishap shows (yeah, I know: pretty lowbrow).
One of my favorite scenes is a man, trying to cut off a large tree
branch. He’s leaning a ladder against a branch, climbs up on that
ladder and then uses his chainsaw to cut off the branch to which his
ladder is leaned on.

To saw off the branch one is sitting on!

I wonder whether all you professional programmers are just doing
that?! Using coding agents, of which you know, that the code you let
them generate will be used to improve their coding capabilities in
order to replace you in the near future.

What does every experienced programmer tell novices? You have to
enjoy programming and you have to write lots and lots of code!

Will your coding skills decrease and finally vanish, I you use those
“think-for-me”-tools long enough?

2 Likes

I can’t speak for others, for but mine won’t :wink: But I never ask it to “think for me”, but rather to collaborate with me in exploring a problem domain. I believe there’s ample and repeated evidence all over this topic that nobody advocates for accepting whatever bots say.

For you, don’t miss out on the opportunities here. Experienced programmers will also tell you to read lots and lots of code, written by more experienced programmers.

A bot is a limitless, tireless, source of ideas to explore. It’s never too busy for you, never annoyed by “too simple” questions, never impatient, always willing to follow wherever you lead. Pick a task. Say, binary search over a sorted list. That’s notoriously hard for newbies to get right in all cases. Try it! Ask a bot to critique your code. If the bot thinks you erred, ask it for specific counterexamples. Ask it for better ways to do it, if it knows of any. Play it by ear, and see where it leads.

You have 24/7 access to an infinitely patient tutor now. Use it. not to write your code for you, but to learn from it. Or you could ask a human expert instead - but they’d just brush you off [1] :wink:


  1. just a fact of life - a consequence of supply & demand ↩︎

3 Likes

I had literally never “conversed” with a chat bot before. I had no idea what a good prompt would be, so I tried just typing as if I was talking to another person. It worked pretty well.

1 Like

Something of a miracle, in fact :smile:. They’ve blown the “Turing test” out of the water, and people barely noticed. Those who did predictably mostly reacted with “well, just goes to show that the Turing test was much easier than anyone imagined - now that bots did it, it’s no longer relevant as an indicator of human intelligence”. Another round of the “genetic fallacy”: dismissing a thing not for what it is, but for its source.

Reminds me of the internet cartoon: “who woulda have guessed? The real measure of human intelligence is whether you can select all the squares with a stoplight.” :rofl:.

2 Likes

Ironically, all that captcha data that was ostensibly meant to determine humanness is now the base training set for most of these modern models. At least on the image data side. So the new robots are likely better at solving those captchas than people at this point.

1 Like

Turing believed that machines capable of passing the test would exist by the end of the century - so it’s not that they blew the Turing test out of the water, they have simply achieved what the test was always expected to demonstrate. (And, for what it’s worth, significantly later than he estimated.) It should not be a surprise that chatbots are capable of sounding like humans - I’m more impressed at the number of humans who sound less intelligent than a bot. Yes, phone tech support drones, I’m talking about you, the ones who make me wish I could talk to an IVR system instead.

2 Likes

Depending on your definition of “passing” we’ve had chatbots that could pass it since the 80s. Though, psychological tricks feel a bit like cheating so I can understand not counting things like ELIZA.

That being said, I don’t really think there’s that much difference between something like ELIZA that uses a small hand rolled statistical model to respond and the modern ones that use the same statistical modeling but with billions more parameters.

The fact that scale alone can make it more convincing is interesting, but I don’t think that makes them something fantastically new. Especially since Markov chains have been well understood for over a century now.

3 Likes

The definition, as I understand it, was that after five minutes of conversation, you wouldn’t be able to tell whether you were talking to a computer or a human. So another way to view this would be to flip it on its head and ask “how long would it take before you know that you’re talking to a machine?”. Basically this: xkcd: Impostor

I’m not sure when exactly we crossed the five minute mark, but we’ve definitely crossed it. Helped not a little by the average person being fairly uninteresting to talk to [1], so a basic conversationalist can come across well.


  1. and remember, 50% of people are even worse ↩︎

2 Likes

That doesn’t sound like a problem to me. You can mount individual files with the same -v ~/.codex/tokenfile:/home/appuser/.codex/tokenfile syntax so, assuming the token is in its own file, maybe just mounting that one file and leaving the rest of ~/.codex in-container only would alleviate whatever you’re concerned about. It’s hard to say without knowing what you don’t want it to do.

Generally speaking, mounting is the only good way to get secrets in and out of docker containers. [1]


  1. where good is defined as doesn’t result in the token being copied all over hidden corners of /var/lib/docker/ but that may or may not bother you ↩︎

2 Likes

I started asking models to make header comments, brief, author, and date, just so I can remember what model I used. It writes:

@brief   High-precision 2D Geometric Clipping Utilities.
@author  Gemini (AI Collaborator)
@details This implementation provides robust clipping of AcGeLineSeg2d and AcGeCircArc2d entities

‘High-precision’ and ‘robust’…a bit of AI chest pounding? :grin:

2 Likes