PROVIDERS

The following are notes and advice for the providers we have tested.

Anthropic Claude Haiku 4.5

This is the default for Anthropic unless otherwise specified.

We have tested with Anthropic Claude Haiku 4.5 and it has worked quite well for anything we have tried. All cases worked on the first attempt and did not need any adjustments. But it is a more expensive option compared to non-Claude models.

Issues:

If given overly vague instructions, it will sometimes create things to do that have nothing to the problem at hand.
On rare occasions, very specific and exacting instructions will cause it to look for unrelated things to do. Rewording the instructions usually helps.

Anthropic Claude Sonnet 4.6

The original default used to be Anthropic Claude Sonnet 4.6. It worked very well, but is substantially more expensive than Haiku. There may be cases when it will work better than Haiku, but we have yet to find one. Nevertheless, it is fully supported.

Version 4.5 works great too.

Issues:

The same as with Claude Haiku, but more pronounced in creativity (and thus wasted money). Runaway control logic in WWWHerd will keep it from going too crazy most of the time.
On rare occasions, telling it to click a button by just saying the button label, rather than noting that it is a button with that label, can send it on a hunting expedition for anything related to the words in the label. (Admittedly, it can be quite humorous it its efforts).

For example, this is prone to the problem:

##ACT CLick help.

Whereas this is not:

##ACT CLick the help button.

Qwen 2.5 VL 72B

OpenRouter model: qwen/qwen2.5-vl-72b-instruct

IMPORTANT: If using OpenRouter, be sure to turn on "Response Healing" in your settings. Qwen is particularly prone to corrupting the response data. Even with this set, it happens about 1.25% of the time, though WWWHerd retry mechanisms often make it ok.

We have also run both self-hosted and OpenRouter hosted Qwen VL 2.5 72B. It works well and is much cheaper than Anthropic Claude. However, typically the cases will need more details in your instructions. Below is a list of known issues related to this.

Self-hosting requires significant hardware to work properly. You will need at least an RTX 4090 GPU and 64GB memory, at the bare minimum. And even then your runs will go fairly slow.

Issues:

Overly broad statements will confuse it to the point it agent could take an hour to just respond with nonsense. For instance, with Clause you can tell it to "Login to the application" and it will work. With qwen you will need to tell it to fill in the username and password fields and then hit the login button. Whenever your instructions seem to fail often, try breaking them down into smaller pieces. Given that qwen is substantially cheaper than Claude, you won't lose much by breaking it down.
Sometimes you will have to tell it to scroll down to find elements because it won't do it on its own. Instead, it will just keep saying it can't find the element until the runaway protection stops it.
It is prone to get "lost"--looping around trying to find something on a page it long ago accidentally navigated away from. Clarifying the logic or trying different wording will help.
The errors related to "No actions generated" or "Error planning actions" are often due to backend problems causing the AI to timeout. There isn't anything we can do about that, but unless OpenRouter is having serious trouble, a retry should fix it. NOTE: you can make this deliberately happen by either giving the case no discernible logic (crazy talk) or very convoluted logic.
The endpoint providers for Openouter are unpredictable in their reliability. Sometimes they can be rock solid and sometimes they constantly fail. If immediate retries always fail, then either wait some time or look at your OpenRouter configuration force or restrict some providers.
Tab UI elements can confuse it. It is best to tell it to click the tab label text rather than the tab itself.

This is the model we try to make sure always passes the Basic Acceptance Test (BAT). Check it out to see how we worked around model-specific issues.

Qwen 3 VL 235B

OpenRouter model: Qwen/Qwen3-VL-235B-A22B-Instruct

It is starting to work. The problem is that it doesn't handle detailed instructions very well. For instance, telling it to fill out the login form and press the login button will just confuse it, and it will keep trying to fill out the fields. But simply saying "login to the application with username x and password y" works. It is uncertain if custom prompting inside WWWHerd will help this. More info to come as we learn more about it.

All other models

No other models have been tested, but there are some promising signs for future models. At a bare minimum, a model must support visual precision for it to have a chance of working. (such as reporting exact coordinates in images).

Note about OpenRouter

The stability of OpenRouter is at the mercy of the endpoint providers. Some are better than others. Some are unpredictable, working great one day and terrible another. It is very normal for me to have to run the BAT test for Qwen on two separate days for it to pass--one day working great and the other not working at all. For Qwen providers, I have yet to find one that is consistently reliable all the time. However, Anthropic models on OpenRouter are usually as stable or as close to as stable as Anthropic's own service (noting that they are one of the providers too).

General information

We have started adding cost visibility and management features and will continue to do so. We have already seen a positive impact. But using AI still costs. However, it is important to look at the big picture: constantly maintaining your workflows and tests due to breakage is very expensive. I have seen organizations reserving as much as 75% of their test engineering resources to keep QA running. WWWHerd's goal is to get rid of that as much as possible, so engineers can do what they want to be doing--making cool new stuff...