Inflated Engine Evaluations

@jomega said in #35:
> Is the point to see if Stockfish ever has a classical eval that is a pawn or more based on something other than material?
> It does.
>
> A KPPkpp endgame where SF gives Black a 'static' eval of -1.62.
>
> lichess.org/editor/8/8/8/4k2p/p6P/4K2P/8/8_w_-_-_0_1

Yes it was. And thanks you for that example. I have not read more of this post (and still need to read the rest of thread), and i would like to examine this further to see what in the components amounts to about 2 pawns of material count.

Now, this is a limit case. Depending on whether this falls under one of the few categories of endgames position explicitly constructed (with some algebraic formulation), this may not be as un-supportive of my argumentation. I would like earlier examples. Perhaps making compositions complexifying that one, so as to get out of the special explicit case of the static evaluation, would help better get a handle on the "dynamic" range comparative mass between positional components and material count components (not chess dynamic but engineering response surface as one varies the dependent variable or parameter or input signal) .

what is the maximal response range of pure positional combined components versus the maximal range of material count of just combined SF score, non special endgame case (how many are there in SF, how may should there be in real chess? or even perfect chess? compare, same order of magnitude? that is a tangent question for now).

Perhaps some integration over many positions classified as pure positional or high material count imbalance of all components would lead to some mass view of each type of components.

Why exclude endgame cases? because I want to focus on more complex situations early or middle game, where decisions are not clear cut within my horizon on the plan perspective. (my to be honest, but I mean anyone not satisfied with their own chess understanding beyond their limited horizon).

Also, because I am sure that having replaced King material value (=Huge in the past) with a finite case-ology of endgame classes with static value highly influenced by naked King (something like that, i don't know how those cases manage to project|collapse|compress endgame dynamics into static valuation) can be arbitrarily made to reach some high value for those few cases. Is this example such a case by the way ( i assume i should read the full posts that follow, next few days job for me)?

If this is an endgame case recognized at static evaluation in classical SF, then this is not what i was talking about. I am talking about conversions or compensation potential arithmetic balances. what maximum equivalency can there be between a positional compensation for a purely material imbalance and a purely positional set of imbalances. Yes I mean those from that table up there in one post, those that never have any material count tentacle in them, or groups of them if too much overlap, i.e. statistical interactions of the vector component over many positions, groups that vary together for example.

My point being that SF positional can be approximated easily by material count only if integrated over many positions with known classes. That the actual parameters are biased toward material count signals of imbalance (omitting near mate currency, which is another type of conversion). And that the positions with small material count imbalance would have a tough time competing with high material count.

dboing

edited

#42

Trying to untangle all my points about how SF static score as a whole is far from considering equivalent material conversion currency and positional conversion. Which is about the OP first sentence. not there yet, I say. and we might be stuck not there.

The whole idea of counting material is that an imbalance would be predictive of a terminal outcome (terminal imbalance value). The point of positional valuation should be in the limit be able to consider equally any material imbalance as possibly convertible into a positional imbalance (sacrifice, e.g.) or vice versa, with the ultimate conversion being the terminal currency of W,D,L in that order for a given color, which is another problem in itself (because not same nature there is no more surrogate aspect of position information to convert, the game is terminated.... I personally consider outcome odds given a pair of playing quality levels (say perfect, ok, as usual) to be the least confusing and phase dependent one dimensional score.

So basically 3 currencies around to trade into each other. material positional and outcome odds (can make that ordering into a float by linearly combining rewards if one must). One to check that. Is SF really there? who cares? anyone using engine as helper for learning or analysis should. Also, then we would know for sure if SF has not been shaping human population playing style into its own biases.

I claims SF still dominated by material count at the static evaluation. That is not to say that combined with exhaustive fast forward tree search with sparse selection or probing with static evaluation at leafs, this could not capture earlier positional signals, i think it does to some extent (not measurable until competitor with different angle enters the pool), and we have seen it improve by patching with NNue which can finally handle those ignored branches previously (which is actually testament to the previous bias, by the way), even if just postponing the bias a moderate depth search further via NN fitting classical SF itself, it would still pick up some positional currency not converted yet into material at the leaf of input SF depth).

This is one claim. now forgetting that (trying to untangle, but there is also some interaction as long as first claim is not acknowledged). A more difficult and technical (statistics and non-linear optimisation concern, but also chess theory levels of rigor, or consistency or else).

I also briefly touched the problem of separation, related but different, yet still showing that the positional interpretation of SF score is limited. one would have to have access to the whole table of components to make a correct judgment, and one could clean up all those statistical interactions by doing global optimisation, and using random position covering enough of position space (without bias of tree search a la AB engine). The current "one parameter at a time since beginning of SF history" is optimizing only on such positions (that are tree search selected for parsimonious full static evaluation). The optimization algorithm would get most juice from positions already having higher potential for response difference, and if existing set of parameters minus the only one thing to be updated, is already biased toward material count, I wonder what would that give as a chance to get the positional signal optimisation out. The material domination potentially perpetuating itself is one problem. It could be solved in an automated tool that would make available the purely positional components (perhaps their sum), and the whole score. But then withing that purely many components purely position output, there is still some cleanup to to, that is the second problem. Finding out the lucky ones that were not made fuzzy by later added components, or that already in chess theory were well defined AND luckily sharing no correlation with other well defined factors, might be a first job.

I think one needs to build well characterized databases of positions, in order to study the type of question i am asking. just one example is not enough. it is something though, proof by existence. forcing me to be more precise about my claim and putting everyone to sleep... Now i shut up for a few days. need to read this great opportunity of a thread. thanks you op. sorry to impose my prose that much. this has been a long time itch of mine (and i am not sure that I am alone). lots to read though. sorry i repeated arguments already made. or did not consider them yet. this might come later.

dboing

edited

#43

@Sarg0n said in #1:
> Engined don't count the sheer material but take also positional considerations into account. We all know that.
>
> For 40 years ago we have had something like 1 pawn is something close to +1 depending on the positional factors. The last couple of years this evaluations have skyrocketed. Why`?
>
> Engines calculate very deep and therefore assess the winning properties very high, there is almost hardly any connection to the material involved if there is a winning position.
>
> Maybe a total switch to winning probabilities is the better way? Like AlphaZero does? A paradigm shift.

Ok, i only just now started reading the thread. (having spit out most of my accumulated thoughts impulsively, calmer now).
And, thank you very much for the rest of your post. I agree with the final sentence. I have the same preference hidden in my walls of text. although i tried to explain the various components in possible paradigms floating around. I call them currencies.

I also wonder about those high values. And you have in the past shown some virtuosity in how to dance with SF to show such interesting range in the scoring dynamics. It may very well be that NNue with its solar flare glimpse into a moderate depth future given by its pre-game training with classical SF itself, would push back earlier in middle game depths the positions with low signal, that it would ignore within its tree searches. now it can find material conversion even further, but also those endgame cases that replace the previously one value material equivalent of King (if my historical take home understanding is right, from infinite King value for any mate position, now it has been made more subtle). Assuming no table base of course. only the endgame cases. I think there ought to be a lot of endgame cases to make material dominated scoring dock with chess position complexity (or perhaps even position counting, although i suspect that once the paradigm would shift, it would be easier to have a correct balance in currencies, skipping cp altogether for positional components). centipawns out the window.

dboing

#44

@biscuitfiend said in #11:
> In the specific position you gave, Black has way better piece activity and in many lines, if White recaptures on e5 for example, then White is losing the Knight on e2. Does that not explain the evaluation?

Is this reflected by the internals of SF with the static evaluation applied at the position for which you are able to pass that judgment using the characteristics you invoke (sorry for the weird phrasing, not trying to be confusing, just not agile).

I am not yet caught up with the sequential reading of this thread, but your seemed self-contained. I also don't know if the 50 category spectrum at that position can be transformed into your choice of features perceived.. that would be instructive to me to see you struggle with that (kidding a bit).

dboing

edited

#45

Just thinking back. I would not expect non-material imbalance to always match any material imbalance somewhere down the line. although technically a really good static evaluation function might be a able to evaluate that from the predecessor position to that imbalance if static characteristics of that position could be found to predict that material imbalance. So the quality of knowledge about current position contained in the static evaluation function (however complex and including full set of parameter chosen) may have something to do with currency exchange and compensation maximum balance possible.

but we knew that. so i think one ought to quantify the first statement of the op. which i now think is not the main point of this thread. Actually, all my babble does come down to: why are we scratching our heads with such surrogate currencies in units of material or not count, when we should focus on quantifying their outcome odds predicting values. that would be a less confusing basis. and the discussion would start on healthier quantitative ground. the closest to basic termination rules of chess (there is no material scale of values attributed to pieces in the rule-set, only mobility rules in the abstract of no other chess units on board but the one in question, its class, reminder).

It seems that some people think that NNue is the reason for those new high scores values that don't seem at all to be about material imbalance. I read above about offering the classical evaluation in parallel, however to be sure, we would need also to control the full PV depth associated to each score, down to the position actually evaluated by either NNue or classical. If the tip of that PV has been evaluated, and lichess were to offer it on display (and optional user created variation dump), then one could actually call another classical (non-NNue) scoring on that tip as new root with some moderate depth for go-depth parameter (or lichess own default for that), and find out if that explains the NNue eval at the tip. that would provide in turn for a new PV from that tip of previous PV, and a new tip, with an actual static evaluation with classical SF. with would have now some better understanding of the balance between material and positional (non material) components with all things being classical SF static (full) evaluation.

And if I am wrong and SF has behind the scene changed how the master NN upstream of the current NNues has been trained to approximate as target scoring oracle, and is not using classical SF of moderate depth as supervisor, that would show too.

Lichess can actually help SF and us curious users wanting to fully interpret SF as analytical tool. just allow more control over that within what SF already gives. Full PV, Cloud off or on per user request, and NNue on or off. some of those already there.

Then the call for a more stable currency about a FEN scoring system, would be having all the data needed at population level.

biscuitfiend edited

#46

@dboing To be honest, I don't know why everybody is getting so hung up on static evaluations. People should know better. Up until neural nets came along, engines have always worked by evaluating positions several moves ahead, i.e., at some nonzero depth. Evaluating the position on the board is useless, because one side might have checkmate in 8. For a long time that was basically the whole point of using an engine.

My comment takes into account one tactical aspect of the position, which is that white is either losing their knight or getting totally dominated in the centre (and probably losing just as much material anyway). In that respect, I still think it's the best explanation that's been provided thus far, though I have no idea why a veteran like Sarg0n didn't come up with it himself.

dboing

#47

@biscuitfiend said in #46:
> @dboing To be honest, I don't know why everybody is getting so hung up on static evaluations. People should know better. Up until neural nets came along, engines have always worked by evaluating positions several moves ahead, i.e., at some nonzero depth. Evaluating the position on the board is useless, because one side might have checkmate in 8. For a long time that was basically the whole point of using an engine.
>
> My comment takes into account one tactical aspect of the position, which is that white is either losing their knight or getting totally dominated in the centre (and probably losing just as much material anyway). In that respect, I still think it's the best explanation that's been provided thus far, though I have no idea why a veteran like Sarg0n didn't come up with it himself.

the problem might be that the only static evaluation being done (for the PV finally chosen) are very outside human ability to see... far in the human fog.
Humans are curious. i know i am, and part of y curiosity is to know how chess works, not just getting from above that there is a material conversion to be had as an objective outside my ability to see from my current position through a bush of branches all the way to where the fast forward engine get its holy grail of hey I found a mostly material conversion from a decision at your current position. or if you are near mate (but near in the fast forward sense + solar flare) then something luckily falling into a case explicitly covered that would signal back up the min-max that it is more important that all the silly material grabs in the same bush.

I guess you are right. If only wanting a black box answer (which is not the monopoly of NN parameters), and no critical thinking or interpret-ability in human thinking terms, then what are we talking about. who cares about the score amplitude, we only care about the ranking. and those value are there so that we get the correct ranking.. centi-pawns or else does not matter this is actually any arbitrary scale that would provide the same rank order.

well, maybe than rank order is biased. and the numbers to build the ranking are just a symptoms that human can detect with their random exploration of chess space here on lichess. (which I love. and which should continue to make its database public in all form, this is one of a kind online opportunity for chess science, scuze my grandiloquence).

But i have no read your comment prior yet. sorry. I tend to go from general to more specific. reread your post better. so I will retrace... backlog on this thread. maybe by looking at your sub-thread discussion. i coud illustrate or adjust my answer.

I am not sure in which direction you mean. because the expectation from engine help to be to scout ahead (mate in 8) might have been coming from the times where tips of PV were not as far as they are today. I don't know enough about that history.

biscuitfiend

#48

@dboing To be completely honest I have no idea what you're talking about. My point is just that if you ask an engine for help then you shouldn't complain when it calculates some number of moves ahead. That's literally the only way to determine which side has an advantage: calculate to try to figure out what should happen next. Chess is not static.

dboing

edited

#49

@biscuitfiend said in #48:
> @dboing To be completely honest I have no idea what you're talking about. My point is just that if you ask an engine for help then you shouldn't complain when it calculates some number of moves ahead. That's literally the only way to determine which side has an advantage: calculate to try to figure out what should happen next. Chess is not static.

Our impulsive intuition thinks otherwise. I would ask a lot of GM whether they really need dynamic depth to generate the few candidate decision to ponder. calculation off-line (not in game) might help (or previous games calculations, this nuance is crucial to understand the difference between an on-line only engine, and one that accumulates experience itself (not via its programmers experience, but itself. not a main point though here, and an purposeful dichotomy, as AB engines have been trying also to improve beyond human recognition the amount of off-line knowledge contained in the static eval, that is what I was criticizing about the op first statement intro). And as humans we are hardwired to keep influencing our calculation with it (not well without much experience --- guided, herder, or not when having had many full years to spend on that).

AB engine are (has been initially at least) betting on calculation only to figure out remote solutions (with lots of width initially), probably happy with poor quality static evaluation being compensation by the sheer exhaustiveness of the cast net.

We as humans always have some subconscious (along with conscious) evaluation of the position in face going on. In learning how chess works, getting close to such a thinking model would be appreciated as better helper than a divination engine.

I would like to learn from the helper engine, to the maximym extent possible where its divination comes from. I am curious that way. and if it relies on out of my fog static evaluations even with the claim that is was exaustive, my take home experience from having used such a limited but oracly helper, would not carry me very far in my progress about understanding chess. I would like to be able to interpret. And since there has been an attempt to include one at a time some human chess theory feature, I assume that I might not be the only to want some interpretability.

I don't have any other thoughts. I may just be used to engine in the scenery, and as anything taken for granted my distractibility ends up wondering where does that come from.... natural curiosity. questioning common sense is a
habit, as I have met many common senses in need of fresh air in my life trajectory (or i imagined it).

woll

#50

Sometimes the evaluation is really too big.

lichess.org/analysis/8/p7/Pp6/1Pp1b3/B1Pp4/R2Pp3/R3Pp1k/Q4K2_w_-_-_0_1

+23.9 ;)

This topic has been archived and can no longer be replied to.