자연어처리

생성형 AI -- 7. How to do Comparison Ranking(Likert Scale)

coding art 2024. 4. 30. 21:35
728x90

7. How to do Comparison Ranking(Likert Scale)

2개의 Response 에 대한 랭킹 작업은 아래 그림과 같이 배열된다. 이러한 채점 방식은 설문지 조사에서 흔히 사용하는데. 이를 Likert Scale 이라고 한다.

실제 채점표에는 Same 은 거의 없다는 점에 유의하자.

가능한한 better 나 much better  처리를 해야 하지만 때때로 둘 다 6개 항목 채점 결과가 동일한 개찐 도찐인 경우 그래도 강제로 랭킹 판단을 해야 하므로 slightly better  처리를 해야 할 경우가 있으며, Judgement 에 개인의견 또는 개인선호임을 반드시 표기해야 한다.

Understanding What Impacts the Final Rating and How 

Imagine you are starting with ground zero, where both responses are the same. Then you start considering the ratings across individual dimensions, to move the needle to slightly better, better, or much better, based on the weights/impact of individual dimensions on the final rating.

NOTE: The information here is for guidance and directional purposes only. We acknowledge and understand that there can be edge cases and we trust you to keep an open mind while attempting these tasks.

Starting Point

 

 

Overall Quality

Impact on the Final Comparison Scores - HIGHEST

  • If the Overall Quality of a response is 1 point higher than the other, then one response is slightly better - Possible outcomes:
  • Overall Quality  점수 차이가 1점이면 거의 수준 차이가 미약한 경우이므로 slightly better  로 처리해야 한다.

 

  • If the Overall Quality of a response is 2 points higher than the other, then one response is better - Possible Outcomes:
  • Overall Quality  점수 차이가 2점이면 거의 한 단계 수준 차이가  나므로 better  로 처리해야 한다.

  • If the Overall Quality of a response is 3 points higher than the other, then one response is much better - Possible outcomes:
  • Overall Quality  점수 차이가 3점이면 거의 두 단계 수준 차이가 나므로 much better  로 처리해야 한다.
  •  

  • If Overall Quality is the same, then both the responses are likely the same or we need to look at the next dimensions - See the next slide!

Harmlessness, Truthfulness, and Instructions Following

Impact on the Overall Quality of the Response and Final Comparison Scores - HIGH

  • If major issues are marked in any of these dimensions for one response
  • Overall quality should be Pretty Bad, Horrible
  • Other responses can be better/much better than this one

 

  • If minor issues are marked in any of these dimensions for one response
  • Overall quality should be Pretty Bad, Okay, or Pretty Good, depending on the frequency of errors
  • The other response can be the same/better/slightly better than this, dependent on the frequency of errors

 

  • One Response cannot be “much better” or “better” than the other if both responses have the same ratings for these dimensions. Possible outcomes:

 

Writing Style and Verbosity

Impact on Overall Quality and Final Comparison Scores - Medium

  • These dimensions do not make one response “much better” than the other if they are the ONLY differentiating factor, they attribute to responses being the same or one being slightly better. Better is also possible depending on the severity/frequency of the error in these dimensions but only in a few scenarios. Possible outcomes:

 

  • The criteria of Instructions Following - Completeness and Depth take a higher priority in impacting the end preference over writing style, verbosity, and formatting

Formatting

Impact on Overall Quality and Final Comparison Scores - Low

  • This should not be the main differentiating factor between one response being better than the other, unless - a) the formatting is a part of the Instructions Following i.e. the Prompt Constraint, or b) it changes the Response Quality drastically (rare cases but possible)