Peer Evaluations of Peer Evaluations: A Randomized Managed Trial and Different Experiments – Machine Studying Weblog | ML@CMU


Alexander Goldberg, Ivan Stelmakh, Kyunghyun Cho, Alice Oh, Alekh Agarwal, Danielle Belgrave, and Nihar Shah

Is it doable to reliably consider the standard of peer critiques? We research peer reviewing of peer critiques pushed by two major motivations: 

(i) Incentivizing reviewers to supply high-quality critiques is a crucial open downside. The flexibility to reliably assess the standard of critiques may also help design such incentive mechanisms. 

(ii) Many experiments within the peer-review processes of varied scientific fields use evaluations of critiques as a “gold customary” for investigating insurance policies and interventions. The reliability of such experiments depends upon the accuracy of those overview evaluations.

We performed a large-scale research on the NeurIPS 2022 convention by which we invited contributors to guage critiques given to submitted papers. The evaluators of any overview comprised different reviewers for that paper, the meta reviewer, authors of the paper, and reviewers with related experience who weren’t assigned to overview that paper. Every evaluator was supplied the entire overview together with the related paper. The analysis of any overview was based mostly on 4 specified standards—comprehension, thoroughness, justification, and helpfulness—utilizing a 5-point Likert scale, accompanied by an general rating on a 7-point scale, the place the next rating signifies superior high quality.

(1) Uselessly elongated overview bias

We examined potential biases as a result of size of critiques. We generated uselessly elongated variations of critiques by including substantial quantities of non-informative content material. Elongated as a result of we made the critiques 2.5x–3x as lengthy. Ineffective as a result of the elongation didn’t present any helpful info: we added filler textual content, replicated the abstract in one other a part of the overview, replicated the summary within the abstract, replicated the drop-down menus within the overview textual content.

We performed a randomized managed trial, by which every evaluator was proven both the unique overview or the uselessly elongated model at random together with the related paper. The evaluators comprised reviewers within the analysis space of the paper who weren’t initially assigned the paper. Within the outcomes proven beneath, we make use of the Mann-Whitney U take a look at, and the take a look at statistic will be interpreted because the likelihood {that a} randomly chosen elongated overview is rated greater than a randomly chosen unique overview. The take a look at reveals important proof of bias in favor of longer critiques.

Standards Check statistic 95% CI P-value  Distinction in imply scores
Total rating 0.64 [0.60, 0.69] < 0.0001 0.56
Understanding 0.57 [0.53, 0.62] 0.04 0.25
Protection 0.71 [0.66, 0.76] <0.0001 0.83
Substantiation 0.59 [0.54, 0.64] 0.001 0.31
Constructiveness 0.60 [0.55, 0.64] 0.001 0.37

(2) Creator-outcome bias

The graphs beneath depict the overview rating given to a paper by a reviewer on the x axis, plotted in opposition to the analysis rating for that overview by evaluators on the y axis.

We see that authors’ evaluations of critiques are rather more optimistic in the direction of critiques recommending acceptance of their very own papers, and detrimental in the direction of critiques recommending rejection. In distinction, evaluations of critiques by different evaluators present little dependence on the rating given by the overview to the paper. We formally take a look at for this bias of authors’ evaluations of critiques on the scores their papers obtained. Our evaluation compares authors’ evaluations of critiques that really helpful acceptance versus rejection of their paper, controlling for the overview size, high quality of overview (as measured by others’ evaluations), and totally different numbers of accepted/rejected papers per writer. The take a look at reveals important proof of this bias.

Standards Check statistic 95% CI P-value  Distinction in imply scores
Total rating 0.82 [0.79, 0.85] < 0.0001 1.41
Understanding 0.78 [0.75, 0.81] < 0.0001 1.12
Protection 0.76 [0.72, 0.79] <0.0001 0.97
Substantiation 0.80 [0.76, 0.83] < 0.0001 1.28
Constructiveness 0.77 [0.74, 0.80] < 0.0001 1.15

(3) Inter-evaluator (dis)settlement 

We measure the disagreement charges between a number of evaluations of the identical overview as follows. Take any pair of evaluators and any pair of critiques that receives an analysis from each evaluators. We are saying the pair of evaluators agrees on this pair of critiques if each rating the identical overview greater than the opposite; we are saying that this pair disagrees if the overview scored greater by one evaluator is scored decrease by the opposite. Ties are discarded.

Curiously, the speed of disagreement between critiques of papers measured in NeurIPS 2016 was in an analogous vary — 0.25 to 0.3. 

(4) Miscalibration

Miscalibration refers back to the phenomenon that reviewers have totally different strictness or leniency requirements. We assess the quantity of miscalibration of evaluators of critiques following the miscalibration evaluation process for NeurIPS 2014 paper review data. This evaluation makes use of a linear mannequin of high quality scores, assumes a Gaussian prior on the miscalibration of every reviewer, and the estimated variance of this prior then represents the magnitude of miscalibration. The evaluation finds that the quantity of miscalibration in evaluations of the critiques (in NeurIPS 2022) is greater than the reported quantity of miscalibration in critiques of papers in NeurIPS 2014.

(5) Subjectivity

We consider a key supply of subjectivity in critiques—commensuration bias—the place totally different evaluators in another way map particular person standards to general scores. Our strategy is to first learn a mapping from standards scores to general scores that most closely fits the gathering of all critiques. We then compute the quantity of subjectivity as the typical distinction between the general scores given within the critiques and the respective general scores decided by the realized mapping. Following previously derived theory, we use the L(1,1) norm because the loss. We discover that the quantity of subjectivity within the analysis of critiques at NeurIPS 2022 is greater than that within the critiques of papers at NeurIPS 2022.

Conclusions

Our findings point out that the issues commonly encountered in peer reviews of papers, akin to inconsistency, bias, miscalibration, and subjectivity, are additionally prevalent in peer critiques of peer critiques. Though assessing critiques can support in creating improved incentives for high-quality peer overview and evaluating the influence of coverage choices on this area, it’s essential to train warning when deciphering peer critiques of peer critiques as indicators of the underlying overview high quality.

Extra particulars: https://arxiv.org/pdf/2311.09497.pdf

Acknowledgements: We sincerely thank everybody concerned within the NeurIPS 2022 overview course of who agreed to participate on this experiment. Your participation has been invaluable in shedding mild on the essential subject of evaluating critiques, in the direction of bettering the peer-review course of.

Leave a Reply

Your email address will not be published. Required fields are marked *