-
-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] generator.iterate()
returns corrupted result objects in some cases
#689
Comments
Not every iteration will emit text data. Every iteration processes one token per active job, but one token doesn't always result in text output. This can happen if there's a partial match to a banned string or stop string, which will hold any text until the condition is resolved. There are also plenty of characters that are encoded as multiple tokens, in which case you may only get text on every 2nd or 4th iteration or whatever. I would simply use |
But then why is the result object emitted at all? AFAICT, such objects contain no actionable information. It's not even possible to tell which of the conditions you describe has occurred. What is the user supposed to do with such a "result"? |
It's a design choice. The generator performs one iteration per call to It's technically possible to return less than one character (as a byte array containing incomplete UTF-8 codes), but I don't think the Tokenizers library allows for it in a clean way and it would complicate client code a lot having to deal with that. The other option is for Alternatively, an iteration could simply not return a result for a job that doesn't produce output, as opposed to returning an empty result. To me these seem roughly equivalent, but I went with the latter because it at least is a way for the client to confirm that a job is running and making progress, even when that progress hasn't yet produced a decodable string. |
That's indeed the behavior I would have expected. I agree that the other two options are bad, but the current approach is a bit as if an HTTP server framework were to emit empty request objects to the handler while request chunks arrive over TCP, and then emit the populated object once the complete request has been parsed. In general, I believe that objects should make as many guarantees about their structure as possible to simplify downstream code. The reasonable assumption
does not hold with the current implementation, and that forces the user to either complicate their state machine or to weaken lookups like you described. IMO it would be preferable to just not receive such incomplete results in the first place. Alternatively, they could contain a field that specifies why there is no token content, but I see very little value in that from a user perspective. |
It's in theory possible for the generator to not produce any decodable text for a significant amount of time (say, in the scenario where it keeps producing banned strings and then having to rewind). Would it not be better to at least yield some response so that the client knows that it's still making progress? |
I don't really agree with this. It works when the stop condition is specifically an EOS token, but for stop strings and for reaching the token limit, you'd have to run dummy iterations for jobs that have already reached an end state in their previous iteration. As it is I don't think it's really hard to deal with. EOS also isn't a condition you necessarily have to react to, since the job is automatically removed from the queue as it ends. So all the client needs to do is: while generator.num_remaining_jobs():
results = generator.iterate()
for result in results:
identifier = result["identifier"]
outputs[identifier] += result.get("text", "")
# Optionally, if you want to react precisely when each job finishes
if result["eos"]:
print(outputs[identifier])
# generator is now finished with all jobs enqueued prior to the loop There's also the async generator that allows you to create jobs as individual async generators. This still runs my_job = ExLlamaV2DynamicJobAsync(...)
for result in my_job:
print(result.get("text", ""), end = "")
# generator is now finished with my_job but may have other active or pending jobs from other coroutines |
I completely agree that there shouldn't be "dummy iterations" in the background. But that doesn't mean that the generator should emit result objects for jobs for which there aren't any actual results. This isn't how most other libraries operate either. An HTTP server doesn't push incomplete requests to the handler just to inform it that it is still listening on the socket. If something goes wrong during generation, I expect the library to raise an exception, rather than telling me intermittently "yes, I'm still making progress, although I don't have anything to show yet". The way the generator operates is fine; I just think |
But then it comes down to whether you get an empty list of results or a list of empty results. Either way, I don't see a big difference in terms of how the client has to react, but the latter conveys a bit more information (specifically, which jobs are currently active, though more metadata might be added in the future) that could be of use to someone, and which you could filter out easily if it's not useful: results = [r for r in generator.iterate() if "text" in r] |
OS
Linux
GPU Library
CUDA 12.x
Python version
3.12
Pytorch version
2.5.0
Model
No response
Describe the bug
This is an actual result object I received from
generator.iterate()
:As you can see,
eos
isFalse
, but the fieldstext
,token_ids
, etc. are all missing. Thus this "result" object contains no information at all. I have not been able to determine yet whether there are tokens missing from the output because of this, or this is a non-event that gets emitted for some reason.This is a VERY rare occurrence. I have to start hundreds of jobs and generate between 10,000 and 50,000 tokens in order for this to happen. Perhaps a race condition somewhere in the threading logic?
Reproduction steps
Any use of the dynamic generator appears to trigger this eventually, provided enough jobs are started and enough tokens are generated. Just place a check like
in the loop and eventually you will get an error.
Expected behavior
If EOS has not been reached, results should contain token data.
Logs
No response
Additional context
No response
Acknowledgements
The text was updated successfully, but these errors were encountered: