Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Google Gemini Exp Thinking thought tokens #4192

Open
xl0 opened this issue Dec 23, 2024 · 0 comments
Open

Handle Google Gemini Exp Thinking thought tokens #4192

xl0 opened this issue Dec 23, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@xl0
Copy link

xl0 commented Dec 23, 2024

Feature Description

The thinking model API is admittedly a bit of a mess. In sync mode, it returns a response candidate with 2 parts blocks, first one for thought, second one for the actual response:

Count from 1 to 10

{
  "candidates": [
    {
      "content": {
        "parts": [
          {
            "text": "My thinking process for responding to \"Count from 1 to 10\" is straightforward:\n\n1. **Identify the core request:** The user wants a numerical sequence starting at 1 and ending at 10.\n\n2. **Recall basic counting:**  I have access to fundamental knowledge of numbers and their order.\n\n3. **Generate the sequence:** I produce the numbers in the specified order: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.\n\n4. **Determine the appropriate formatting:**  A simple list separated by commas is the most natural and readable way to present the count.\n\n5. **Construct the response:** I combine the generated sequence into a coherent sentence."
          },
          {
            "text": "1, 2, 3, 4, 5, 6, 7, 8, 9, 10\n"
          }
        ],
        "role": "model"
      },
      "finishReason": "STOP",
     [...]
    }
  ],
  [...]
}

In streaming mode:

  • Generates candidates with 1 part while thinking.
  • Generates 1 candidate with 2 parts, one for the end of the thinking, one for the beginning of the response. I think it's always 2 parts.
  • Generates responses with 1 part for the rest of the response:

Count from 1 to 100:

[... Previous thinking with 1 part...]
Parts from response:
[
  {
    text: " on a new line.\n\nEssentially, for such a basic request, there isn't much complex processing involved. It's direct application of the definition of counting. More complex requests would involve deeper analysis of constraints, potential edge cases, and more sophisticated algorithms. But for this, it's a simple retrieval",
  }
]
Parts from response:
[
  {
    text: " and presentation of a known sequence.",
  }, {
    text: "0\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n",
  }
]
Parts from response:
[
  {
    text: "22\n23\n24\n25\n26\n27\n28\n29\n30\n31\n32\n33\n34\n35\n36\n37\n38\n39\n40\n41\n42\n4",
  }
]
[... the rest of the response ...]

This is of course a terrible API, and I hope they improve it in the future, but in the mean time, would it be possible to handle this?

Since this is just 1 model for now, I thin it's best to not implement the logic in the SDK, but expose enough data for the user to implement it?
At the moment, onChunk (of streamText) receives a text-delta event with all parts of the response candidate concatenated together, so it's impossible to tell where the thought ends, and the response begins.

Maybe it's possible to add the original candidate to the event that is passed to onChunk? Same for non-streaming generation, I don't use it, but I'm not sure we get access to some form of the response that contains the candidate parts as an array, only the concatenated text from both parts.

Use Cases

https://chat.congusto.ai , loved by some AI researchers, and thus requiring all the cool new features.

Additional context

No response

@xl0 xl0 added the enhancement New feature or request label Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant