생성형 AI -- 7. How to do Comparison Ranking(Likert Scale)

자연어처리

생성형 AI -- 7. How to do Comparison Ranking(Likert Scale)

coding art 2024. 4. 30. 21:35

728x90

7. How to do Comparison Ranking(Likert Scale)

2개의 Response 에 대한 랭킹 작업은 아래 그림과 같이 배열된다. 이러한 채점 방식은 설문지 조사에서 흔히 사용하는데. 이를 Likert Scale 이라고 한다.

실제 채점표에는 Same 은 거의 없다는 점에 유의하자.

가능한한 better 나 much better 처리를 해야 하지만 때때로 둘 다 6개 항목 채점 결과가 동일한 개찐 도찐인 경우 그래도 강제로 랭킹 판단을 해야 하므로 slightly better 처리를 해야 할 경우가 있으며, Judgement 에 개인의견 또는 개인선호임을 반드시 표기해야 한다.

Understanding What Impacts the Final Rating and How

Imagine you are starting with ground zero, where both responses are the same. Then you start considering the ratings across individual dimensions, to move the needle to slightly better, better, or much better, based on the weights/impact of individual dimensions on the final rating.

NOTE: The information here is for guidance and directional purposes only. We acknowledge and understand that there can be edge cases and we trust you to keep an open mind while attempting these tasks.

Starting Point

Overall Quality

Impact on the Final Comparison Scores - HIGHEST

If the Overall Quality of a response is 1 point higher than the other, then one response is slightly better - Possible outcomes:
Overall Quality 점수 차이가 1점이면 거의 수준 차이가 미약한 경우이므로 slightly better 로 처리해야 한다.

If the Overall Quality of a response is 2 points higher than the other, then one response is better - Possible Outcomes:
Overall Quality 점수 차이가 2점이면 거의 한 단계 수준 차이가 나므로 better 로 처리해야 한다.

If the Overall Quality of a response is 3 points higher than the other, then one response is much better - Possible outcomes:
Overall Quality 점수 차이가 3점이면 거의 두 단계 수준 차이가 나므로 much better 로 처리해야 한다.

If Overall Quality is the same, then both the responses are likely the same or we need to look at the next dimensions - See the next slide!

Harmlessness, Truthfulness, and Instructions Following

Impact on the Overall Quality of the Response and Final Comparison Scores - HIGH

If major issues are marked in any of these dimensions for one response
Overall quality should be Pretty Bad, Horrible
Other responses can be better/much better than this one

If minor issues are marked in any of these dimensions for one response
Overall quality should be Pretty Bad, Okay, or Pretty Good, depending on the frequency of errors
The other response can be the same/better/slightly better than this, dependent on the frequency of errors

One Response cannot be “much better” or “better” than the other if both responses have the same ratings for these dimensions. Possible outcomes:

Writing Style and Verbosity

Impact on Overall Quality and Final Comparison Scores - Medium

These dimensions do not make one response “much better” than the other if they are the ONLY differentiating factor, they attribute to responses being the same or one being slightly better. Better is also possible depending on the severity/frequency of the error in these dimensions but only in a few scenarios. Possible outcomes:

The criteria of Instructions Following - Completeness and Depth take a higher priority in impacting the end preference over writing style, verbosity, and formatting

Formatting

Impact on Overall Quality and Final Comparison Scores - Low

This should not be the main differentiating factor between one response being better than the other, unless - a) the formatting is a part of the Instructions Following i.e. the Prompt Constraint, or b) it changes the Response Quality drastically (rare cases but possible)

'자연어처리' 카테고리의 다른 글

생성형 AI 초기 채점 업무 시 예제 사례와 알아야 할 요령 (1)	2024.06.23
생성형 AI -- 8. Justification (0)	2024.04.30
생성형 AI 가 제공하는 Response의 6. Overall Quality Rating 평가 (0)	2024.04.30
생성형 AI가 제공하는 Response 문장의 5. Harmlessness/Safety 평가 (0)	2024.04.30
생성형 AI 가 제공하는 Response 문장의 4. Truthfulness(Accuracy) (1)	2024.04.30

현재글생성형 AI -- 7. How to do Comparison Ranking(Likert Scale)

Machine Learning , AI, Arduino Coding

후 실행,

Today :
Yesterday :

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Machine Learning , AI, Arduino Coding

생성형 AI -- 7. How to do Comparison Ranking(Likert Scale)

7. How to do Comparison Ranking(Likert Scale)