자연어처리

생성형 AI 가 제공하는 Response 문장의 2. Verbosity, 3. Instruction Following 평가(Updated)

coding art 2024. 4. 30. 14:28
728x90

구글 문서 형태로 제공되는 요령 또는 규칙 모음(rubric) 을 필요시 볼 수 있도록 크롬 북마킹 해두었다가 항상 과제 작업에 앞서  불러 올려둘 필요가 있다.

2. Verbosity

 

Verbosity는 Prompt의 요구에 따라 AI가 생성한  Response  문장의 장황함을 판단하는 것이다. 

대표적인 것이 특정 단어나 말을 여러번 반복하는 것이며, 아울러 요구하지도 않은 내용을 끌어다 붙이는 경우를 말한다.

평가는 반복성이나 불필요한 내용 첨부 여부로 따지지만 vervose 하면 문장이 길어지는 것이기때문에 결국은 Response 문장의 길이로 평가하게 된다. 아래의 예제들을 읽어 보고 이 블로그 내용의 중앙쯤에서 시작하는 아주 중요한 Instruction Following 으로 넘어가자.

 

아래의 내용들은 구글 문서로 제공이 되므로 반드시 북마크에 저장해두고 필요하면 열어보도록 하자.

 

How to Rate Verbosity

Verbosity in AI communication involves using more words than necessary, often leading to overly complex and lengthy responses. We'll cover three key aspects of Verbosity:

  1. Repetition, Length, Supporting Content

1. Understanding Repetition 🔄

What is Repetition?

  • Repetition is the unnecessary reiteration of the same ideas or phrases. It can clutter the response and distract from the main message

How to evaluate:

  • A well-crafted response will communicate the necessary information without redundancy. It will be direct and to the point.
  • Ask if you found yourself annoyed or fatigued by the reiteration of ideas without clear added value.

2. Understanding Length 📏

What is Length?

  • Length concerns how much detail is used to explain or answer a query. It involves balancing thoroughness with conciseness, ensuring the response is neither too brief nor excessively long.

How to evaluate:

  • Too Short: These responses often fail to cover all aspects of the query or prove unhelpful to the user, leaving out important information or explanations necessary for good understanding.
  • Too Long: Characterized by unnecessary elaborations, fluff, or tangents, making the response less focused and harder to follow.
  • Just Right: The response is comprehensive providing all the necessary information, yet concise enough to maintain clarity and focus.

3. Understanding Supporting Content 🗂️

What is Supporting Content?

  • Supporting content refers to additional information that enhances the main message. It includes relevant examples, explanations, and details that support the central theme or answer.

How to evaluate:

  • Good Supporting Content will directly relate to and enrich the topic in question, providing depth and clarity.
  • Tangential or Unrelated Content strays from the main topic, offering information that doesn’t contribute to the user's understanding of the specific query.

Note:

Minor verbosity issues alone are usually not grounds to heavily prefer one response over another

 

Verbosity Rating Examples

1. Repetition

Prompt: "Explain the importance of hydration."

Bad Example: "Hydration is important. It's crucial to stay hydrated. Staying hydrated should be a priority." ❌

  • This response repeats the same idea about hydration multiple times without adding new information.

 

2. Length

Prompt: "Is Mexico considered part of North America"

Bad Example: "Certainly! The question of whether Mexico is part of North America is a complex one that has been the subject of much debate and discussion over the years. While some people argue that Mexico is part of Central America or even South America, the majority of experts agree that it is indeed part of North America.

 

According to the United Nations, North America is defined as a continent that includes three countries: Canada, the United States, and Mexico. This definition is widely accepted by most geographers and scholars, who point to a variety of factors that support the inclusion of Mexico in North America. For example, Mexico shares a long border with the United States, which is often considered to be the defining feature of North America . Additionally, Mexico is home to a number of important cultural and historical sites that are closely tied to the history of North America as a whole, including the ancient ruins of Teotihuacan and the colonial city of San Miguel de Allende

Despite these arguments, there are still some people who believe that Mexico should be considered part of Central America or South America instead of North America. Some of these people point to the fact that Mexico is often grouped together with other Latin American countries, which are typically considered to be part of Central or South America. Others argue that Mexico’s unique cultural and linguistic heritage sets it apart from the rest of North America, and that it should be considered a separate region altogether.

 

In conclusion, while there is some debate over whether Mexico should be considered part of North America, the majority of experts agree that it is indeed part of the continent. Whether you’re interested in the geography, history, or culture of Mexico and North America, there’s no denying that this is a fascinating and complex topic that is worth exploring in more detail." ❌

  • The response is overly long and unnecessarily includes details on the debate of whether Mexico should be considered part of North America 

 

3. Supporting Content

Prompt: "What are the benefits of renewable energy?"

Bad Example: "Renewable energy includes solar, wind, geothermal, and other sources of energy that are not expendable. Solar panels convert sunlight to electricity. Wind turbines use wind to generate power. Renewable energy is sustainable and reduces carbon emissions. It's becoming more popular globally. Solar energy is used in homes and businesses." ❌

  • The response diverges into the types of renewable energy and specific details about solar and wind energy, which, while related, do not directly address the benefits of renewable energy as a whole.
 
Need Help?
 

 

 

앞서 설명된 verbosity 예제를 통한 개념 설명을 다음과 같이 간단한 요령(rubric)들로 정리해 보자.

 

 

Please rate Response A based on the Verbosity rubric.

 

Prompt:

Where did karaoke originate?

 

Response A:

Karaoke, an immensely popular form of entertainment enjoyed by countless individuals around the globe, traces its origins back to Japan. This fascinating and engaging activity, which allows people to sing along to their favorite songs with the lyrics displayed on a screen, was developed in the late 1960s. It's a unique and delightful experience, blending music and performance in a way that brings joy and amusement to participants and audiences alike. It's popularity is enjoyed all around the world.

 

The term "karaoke" itself is an intriguing blend of two Japanese words. 'Kara' comes from 'karappo,' meaning empty, and 'oke' is an abbreviation of 'okesutura,' which translates to orchestra. Daisuke Inoue, a musician hailing from Kobe, Japan, is credited with inventing this novel concept. Since its inception in Japan, karaoke has seen a meteoric rise in popularity, evolving into a beloved global phenomenon. It has become a staple at social gatherings, parties, and entertainment venues, offering a fun and interactive way for people to express themselves through song. It's no surprise that Karaoke is such a popular pastime.

 

아래의 검은 색 띠를 마우스로 긁으면 글자와 문장을 볼 수 있다.

 

Karaoke has become so popular that there are now karaoke competitions held in various countries, with participants vying for the title of karaoke champion. It’s amazing to see how this form of entertainment has evolved and grown over the years, bringing joy and amusement to countless individuals around the globe.

 

Did you know that Karaoke is not only a popular form of entertainment, but it can also be beneficial to your health? Singing has been shown to reduce stress, improve mood, and boost immunity. A study conducted by the University of Frankfurt found that singing in a choir can even increase pain tolerance. Karaoke can also be a great way to improve your language skills. Singing along to songs in a foreign language can help you learn new words and phrases, improve your pronunciation, and boost your confidence when speaking with native speakersIn fact, many language schools and programs use karaoke as a fun and engaging way to teach students new languages. So, the next time you’re at a karaoke bar, don’t be afraid to belt out your favorite tunes!

 

 

Response B:

Karaoke originated in Japan in the late 1960s.

 

It was invented by Daisuke Inoue, a musician from Kobe, Japan. The term "karaoke" is a portmanteau of two Japanese words: "kara," derived from "karappo," meaning empty, and "oke," short for "okesutura," meaning orchestra. Karaoke started as a form of interactive entertainment where individuals sing along to recorded music using a microphone and public address system, with the lyrics displayed on a screen. It quickly gained popularity in Japan and then spread globally, becoming a popular social activity in many countries.

 

 

 

3. Instruction Following

 

AI 는 아주 영리하고 교활할 수 있기때문에 이 Instruction Following  즉 Prompt 의 요구사항대로 하는지 않하는지 주의깊게 AI가 생성한 Response 문장을 감시할 필요가 있다. Response 의 사실성 여부와 정확도 여부는 Truthfulness  편에서 별도로 다루므로 Instruction Following 에서는 Response 내용의 사실 여부와 정확도는 신경쓸 필요가 없으며 단지 로봇처럼 Prompt의 요구사항이 얼마나 잘 반영되었는지가 Key 가 된다. 

AI 의 Prompt 에서 자주 다루는 Instruction 사례는 예를 들어 300 글자 이내로 적어라 등이 있다. 글자 수 확인은 Response의 텍스트를 아래한글에 복사붙이기 하면 쉽게 파악이 된다. 때로는 단어수 제한을 부과할 수도 있는데 이때에는 WORD 를 사용하면 된다.

 

Instruction Following 과 관련하여 헷갈릴수도 있는 2가지 문제를 살펴보자.

예를 들어 "나는 구름을 종아해."와 같은  Prompt가 주어지면 어떻게 할 것인가? 아무리 뛰어난 인공지능 이라해도 도대체 뭘하라는 것인지 알수 없기 마련이다. 이런 류의 Prompt는 틀린 말은 아니라도 뭔가 해달라는 지시 즉 Instruction이라고 보기는 어려우며, 따라서 Response 생성도 안되고, 평가할 수도 없다, 즉 "CANNOT ASSESS" 에 해당한다.

 

아울러 Prompt에 대한 Response가 " 나는 잘 몰라서 응답을 할수가 없으니 나중에 학습을 더해서 ... " 류의 응답도 있을 수 있는데 이런 응답을 "punt"라고 하며, 내용상 틀렸다고 할수도 없지만, 그렇다고 해서 평가를 할 수도 없는 "CANNOT ASSES"에 해당한다.

 

그밖에 hallucination 이라고 해서 인공지능이 교활하게 깝빡 속여먹을 수 있을 정도의 환각, 착각, 망상에 해당하는 Response 를 생성할 경우  "CANNOT ASSES"에 해당한다. 아무리 인공지능 코드의 알고리듬이 좋아도 입력되는 데이터의 대부분이 인터넷에서 그대로 베끼고 있어, 데이터의 진실성이라든지, 품질이라든지,그 순도가 100%가 아니므로, 자주 그릇된 정보를 사용자에게 제공할 위험성이 항상 상존하고 있으며, 결국 이 문제는 인공지능을 만드는 방법상 없앨 수는 없다는 점에 유의하자.

 

한편 Prompt 입력이 예를 들어 웹툰 말풍선에서 볼 수도 있는 "@@#$?%^_^!" 와 같은 경우는 Gibberish 즉 황당한 헛소리에 해당하며, 인공지능도 Response 생성이 불가능한데, 이럴 경우는 적용불가 즉 "NOT APPLICABLE" 줄여서 "N/A"이라한다. 당연히 UNRATABLE 이 될 수 밖에 없다.

 

Instruction Following 평가에서 N/A 는 아예 UNRATABLE이므로 평가작업이 안되는 수준이며, CANNOT ASSESS는 RATABLE이긴 하지만 최하점이 부여된다. 따라서 실제 과제는 거의 나오지 않는다고 보면 되며,

 

유일한 경우가 바로 Hallucination 인데, 분석에 의해 다음과 같이 아주 나쁜 평가를 주도록 하자.

Writing Quality 는 내용을 따지는 것이 아니기 때문에 minor issues 나 no issues 가 될 것이다.

쓸모있는 내용이 들어 있지 않으므로 Verbosity 는 Too Short 이 될 것이다.

Truthfulness 도 완전히 거짓인 False 이므로, 최악인 major issues 가 될 것이다.

잘못된 정보를 사용자에게 주면 큰일나므로 Harmfulness 즉 유해성도 극심하므로 major issues가 될 것이다.

Overall Quality 는 당연히 Bad 에 해당한다.

 

How to Rate Instruction Following

Assessing Instruction Following is a critical part of these tasks. Ultimately, we want to ensure that model responses follow the user's requests or directions. We think about Instruction Following in two ways:

  1. Prompt Request Coverage
  2. Relevance

1. Understanding Prompt Coverage 🧐

What is Prompt Request Coverage? (“Coverage”) (Prompt 의 요구사항을 어느 정도 수용하는가?)

  • Coverage is simply an assessment of whether the generated response did everything the prompt asked it to do, even if its implicit.
  • When asking for a list of edible foods starting with “A,” does the response match those requirements? Does it list any foods that start with a different letter?
  • When asked for a 6-week exercise routine, does the response adhere to the timeframe? Does it give a much shorter program?

 

How to Evaluate Coverage

Think about all the things the prompt asks for, and think about what the most important requests or constraints are that the prompt makes:

  • Hierarchy of Requests: Imagine a prompt requests a 500-word short-story about flying fish. Let's say one response offers a 400-word short-story about flying fish and another response offers a 500-word short story about fish that don't fly. Which response would a user probably prefer? Likely the former 🙂. We might consider the first response to have a Minor Issue, but the second response to have a Major Issue.
  • Going Above and Beyond: Oftentimes you will see model responses that provide a lot of additional information than what was specifically requested. That additional information may or may not be useful for the user (and could make one response slightly more helpful than the other), but it's important to consider whether the user's explicit requests were addressed. If all the requests were addressed, then we likely cannot say that a response has issues with Instruction Following.

2. Understanding Relevance 🎯

What is Relevance? (Response  내용의 적확성, 핵심 내용과 동떨어진 내용이 많이 포한되어 있는지)

Relevance is how we measure whether aspects of the model response do not relate at all to what was asked. Sometimes there are responses where the instructions may have been followed, but parts of the response are not relevant to addressing the tasks at hand.

 

How to Evaluate Relevance

  • Is a response almost entirely on point but also includes a tidbit that does not seem related or helpful to know? Minor Issues
  • Does the response include a bunch of irrelevant and unhelpful information? Major Issues

Important Note: Instruction Following 은 Writing Quality 와 Verbosity 보다 비중이 훨씬 크다.

A mistake in Instruction Following is WORSE for our purposes than problems with Writing Quality and Verbosity. In general, you should weigh Instruction Following heavily when ranking two responses since these mistakes are critical failures of the model and unhelpful to the user.

 

 

Instruction Following Rating Examples

1. Prompt Coverage

Prompt: "Create a travel guide for Rome, covering its history, key landmarks, local cuisine, and travel tips. The response should be approximately 300 words."

Bad Example: "Rome, the capital of Italy, is a historic city with many ancient sites. The Colosseum and the Roman Forum are popular. Italian cuisine includes pasta and pizza. It's a beautiful city with a rich history." ❌

  • This response does not comprehensively cover all the requested aspects of the prompt and is not really a travel guide. It also falls well-short of the 300-word guideline to illustrate the level of depth the user was looking for.

 

2. Relevance

Prompt: "Give tips for efficient water use in gardening."

Bad Example: "Gardens are spaces where you can grow flowers and vegetables. Water is an important resource to conserve in our daily activities, especially in water intensive activities such as washing dishes and doing laundry. Filling swimming pools is a fun use of water. Try timing your watering schedule for cooler parts of the day to conserve water" ❌

  • The response only includes one sentence that is a tip for efficient water use in gardening, but the majority of the response is irrelevant to the user's specific request.

 

 

Instruction Following Quiz

Please read the following responses carefully and select the best choice for the question below.

 

Please rate Response A based on the criteria in the instruction following rubric.

 

Prompt:

Give me a list of relaxing cities to visit in California! Make sure to start the entries with both the city name and ZIP code. Keep the list to 3 entries.

 

 

Response A:

Sure! Here's a list of three relaxing cities to visit in California, each with their respective ZIP code:

 

  • Carmel-by-the-Sea, 93921: Known for its artistic history and picturesque scenery, Carmel-by-the-Sea offers a tranquil escape with beautiful beaches, charming boutiques, and quaint cafes.
  • Ojai: Nestled in a valley in the Topatopa Mountains, Ojai is famed for its spiritual retreats, organic agriculture, and a laid-back atmosphere perfect for relaxation and rejuvenation.
  • **Solvang, 93463**: This unique town offers a taste of Denmark in California. With its Danish-style architecture, windmills, and peaceful vineyards, Solvang is a delightful and calming destination.

 

Each of these cities provides a distinct and serene experience, perfect for a relaxing getaway in California.

 

Response B:

California is filled with relaxing cities, each offering its unique charm and tranquil atmosphere. Here are three relaxing cities in California, listed with their respective ZIP codes:

 

  1.  Carmel-by-the-Sea, 93921: Famous for its picturesque beaches, artistic history, and quaint village-like center, Carmel-by-the-Sea is a serene retreat. The town's fairytale cottages and scenic ocean views make it a peaceful getaway.
  2. Ojai, 93023: Nestled in a valley in the Topatopa Mountains, Ojai is known for its boutique hotels, art galleries, and a relaxed pace of life. Its beautiful natural surroundings, including orange and avocado groves, contribute to its serene atmosphere.
  3. Solvang, 93463: This unique city offers a taste of Denmark in California. With its Danish-style architecture, windmills, and bakeries, Solvang provides a tranquil European escape. The city is also surrounded by beautiful vineyards and rolling hills.

 

 

 

4. Truthfulness 로 넘어가자.

https://ejleep1.tistory.com/1576