Handle Google Gemini Exp Thinking thought tokens #4192

xl0 · 2024-12-23T00:21:35Z

Feature Description

The thinking model API is admittedly a bit of a mess. In sync mode, it returns a response candidate with 2 parts blocks, first one for thought, second one for the actual response:

Count from 1 to 10

{
  "candidates": [
    {
      "content": {
        "parts": [
          {
            "text": "My thinking process for responding to \"Count from 1 to 10\" is straightforward:\n\n1. **Identify the core request:** The user wants a numerical sequence starting at 1 and ending at 10.\n\n2. **Recall basic counting:**  I have access to fundamental knowledge of numbers and their order.\n\n3. **Generate the sequence:** I produce the numbers in the specified order: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.\n\n4. **Determine the appropriate formatting:**  A simple list separated by commas is the most natural and readable way to present the count.\n\n5. **Construct the response:** I combine the generated sequence into a coherent sentence."
          },
          {
            "text": "1, 2, 3, 4, 5, 6, 7, 8, 9, 10\n"
          }
        ],
        "role": "model"
      },
      "finishReason": "STOP",
     [...]
    }
  ],
  [...]
}

In streaming mode:

Generates candidates with 1 part while thinking.
Generates 1 candidate with 2 parts, one for the end of the thinking, one for the beginning of the response. I think it's always 2 parts.
Generates responses with 1 part for the rest of the response:

Count from 1 to 100:

[... Previous thinking with 1 part...]
Parts from response:
[
  {
    text: " on a new line.\n\nEssentially, for such a basic request, there isn't much complex processing involved. It's direct application of the definition of counting. More complex requests would involve deeper analysis of constraints, potential edge cases, and more sophisticated algorithms. But for this, it's a simple retrieval",
  }
]
Parts from response:
[
  {
    text: " and presentation of a known sequence.",
  }, {
    text: "0\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n",
  }
]
Parts from response:
[
  {
    text: "22\n23\n24\n25\n26\n27\n28\n29\n30\n31\n32\n33\n34\n35\n36\n37\n38\n39\n40\n41\n42\n4",
  }
]
[... the rest of the response ...]

This is of course a terrible API, and I hope they improve it in the future, but in the mean time, would it be possible to handle this?

Since this is just 1 model for now, I thin it's best to not implement the logic in the SDK, but expose enough data for the user to implement it?
At the moment, onChunk (of streamText) receives a text-delta event with all parts of the response candidate concatenated together, so it's impossible to tell where the thought ends, and the response begins.

Maybe it's possible to add the original candidate to the event that is passed to onChunk? Same for non-streaming generation, I don't use it, but I'm not sure we get access to some form of the response that contains the candidate parts as an array, only the concatenated text from both parts.

Use Cases

https://chat.congusto.ai , loved by some AI researchers, and thus requiring all the cool new features.

Additional context

No response

The text was updated successfully, but these errors were encountered:

xl0 added the enhancement New feature or request label Dec 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle Google Gemini Exp Thinking thought tokens #4192

Handle Google Gemini Exp Thinking thought tokens #4192

xl0 commented Dec 23, 2024

Handle Google Gemini Exp Thinking thought tokens #4192

Handle Google Gemini Exp Thinking thought tokens #4192

Comments

xl0 commented Dec 23, 2024

Feature Description

Use Cases

Additional context