[REQUEST] Synthetic Data generation features #669

AstrisCantCode · 2024-11-03T21:16:58Z

Problem

I've been working on generating completions for the LLaVA-Instruct dataset. Putting aside the need for multimodal support (which is something I'm jankily hacking together on my end), It got me wondering if there were alternate decoding strategies that could take advantage of all the requests already being aggregated.

Solution

consider implementing functionality that supports the following:

Pre-tokenize the dataset and sort prompts by the number of tokens, to minimize padding (doesn't need to be implemented in exllama itself but still an important step)
dynamically add prompts of a similar length to the current batch of sequences (whenever one of the sequences completes), again to minimize padding.
offload layers and their corresponding KV cache to CPU memory to run large models and batch sizes on a limited VRAM budget
Use very large batch sizes that maximize memory usage while keeping enough space for the current layer and hidden states

Alternatives

I know that exllamav2 already supports things like paged attention and dynamic batching. There's a good chance I'm totally overthinking this problem, and the aforementioned features address the concerns better. I just don't know if cache fragmentation is more detrimental to performance than a few tokens of padding.

Explanation

It was my understanding that using larger batch sizes generally improves performance, but at the cost of memory usage. For ordinary chat uses, layer offloading is just too slow to make sense. But for generating synthetic data, TTFT doesn't matter. So you could theoretically make the batch size significantly larger. The time per batch is higher, but the larger batch size (ideally) more than makes up for it

Examples

No response

Additional context

No response

Acknowledgements

I have looked for similar requests before submitting this one.
I understand that the developers have lives and my issue will be answered when possible.
I understand the developers of this program are human, and I will make my requests politely.

turboderp · 2024-11-07T11:55:16Z

I think you may be overthinking it. There's an example here for using the dynamic generator to do bulk inference on many sequences while taking advantage of deduplication and batching automatically, up to the limits imposed by whatever cache size you can fit in VRAM. There is no need for padding this way.

Cache fragmentation shouldn't be an issue, though I wasn't entirely sure about this so I added a defragmenter that automatically limits how much of an impact it might have, if it is an issue.

I'm not sure at what point it would make sense to start offloading the model to system RAM. Perhaps at some extreme batch size (10k or whatever?), but generally the overhead of offloading layers is huge. There's about a 100x bandwidth difference between PCIe and VRAM. What's more, the benefits of batching only extend to the point where the memory bus is no longer saturated. After that point, a pass at bsz 1000 (or whatever) has twice as much latency as a pass at bsz 500.

AstrisCantCode · 2024-11-09T00:09:08Z

oh, I see. I guess I'm just wondering if it'd be feasible to increase the batch size enough to where the time it takes for a layer to finish running is roughly equal to the time it takes to transfer a layer and its associated KV cache. Then you could perform inference and data transfer simultaneously, and both not have to worry about a PCIe bottleneck, and not have to worry about model size (in terms of the number of layers) as long as those layers can fit in CPU mem, which is comparatively cheap and abundant.

turboderp · 2024-11-11T11:05:29Z

It's certainly possible to do inference layer by layer on a huge batch size. In fact this script does it already, to measure the difference in hidden states between a quantized model and the unquantized version loaded layer by layer.

There isn't currently a mechanism for doing so with a cache, though. And for efficiency I guess you'd need a triple-buffered approach where you have one layer of keys/values being swapped to system RAM, one being worked on by the GPU, and then a third being loaded for the next layer. And weights would need to be double-buffered.

Bulk inference with the dynamic generator is already kind of efficient, especially if you have some shared prefix for multiple sequences in a batch, or sequences of dissimilar length, but I guess this could be worth trying out. Not sure how much of a priority I could make it at the moment, though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REQUEST] Synthetic Data generation features #669

[REQUEST] Synthetic Data generation features #669

AstrisCantCode commented Nov 3, 2024

turboderp commented Nov 7, 2024

AstrisCantCode commented Nov 9, 2024 •

edited

Loading

turboderp commented Nov 11, 2024

[REQUEST] Synthetic Data generation features #669

[REQUEST] Synthetic Data generation features #669

Comments

AstrisCantCode commented Nov 3, 2024

Problem

Solution

Alternatives

Explanation

Examples

Additional context

Acknowledgements

turboderp commented Nov 7, 2024

AstrisCantCode commented Nov 9, 2024 • edited Loading

turboderp commented Nov 11, 2024

AstrisCantCode commented Nov 9, 2024 •

edited

Loading