[WIP] Field-aware factorization machines #604

mdymczyk · 2018-05-11T04:48:48Z

Initial implementation of field-aware factorization machines.

Based on these 2 whitepapers:

And the following repositories:

https://github.com/guestwalk/libffm (original impl)
https://github.com/alexeygrigorev/libffm-python (Python interfact for it)
https://github.com/RTBHOUSE/cuda-ffm (CUDA implementation of a simplified method)

Currently only initial GPU implementation as CPU will most probably just be a copy of the original impl (without the SSE alignments for now).

No benchmarks so far as there's still something wrong (getting different results).

Thing to be still done:

add validation set option and early stopping (FFM seems to need this a lot as it tends to overfit)
add multi GPU support
review the data structures used - using an object oriented approach with Dataset/Row/Node hierarchy is good for development but might provide a lot of overhead when copying data to the device, refactoring this into 3 (or more) continuous arrays might provide a lot of speedup
review the main method wTx (in trainer.cu) - probably can be rewritten in a more GPU friendly manner
probably something else I'm forgetting

If anyone wants to take it for a spin:

>>> from h2o4gpu.solvers.ffm import FFMH2O
>>> import numpy as np
>>> X = [ [(1, 2, 1), (2, 3, 1), (3, 5, 1)],
...      [(1, 0, 1), (2, 3, 1), (3, 7, 1)],
...      [(1, 1, 1), (2, 3, 1), (3, 7, 1), (3, 9, 1)] ]
>>>
>>> y = [1, 1, 0]
>>> ffmh2o = FFMH2O(n_gpus=1)
>>> ffmh2o.fit(X,y)
<h2o4gpu.solvers.ffm.FFMH2O object at 0x7f2d30319fd0>
>>> ffmh2o.predict(X)
array([0.7611223 , 0.6475924 , 0.88890105], dtype=float32)

The input format is a list of lists containing fieldIdx:featureIdx:value tuples and a corresponding list of labels (0 or 1) for each row.

henrygouk · 2018-05-31T05:02:52Z

src/gpu/ffm/trainer.cu

+
+  if(update) {
+    expnyts[rowIdx % MAX_BLOCK_THREADS] = std::exp(-labels[rowIdx] * losses[rowIdx]);
+    kappas[rowIdx % MAX_BLOCK_THREADS] = -labels[rowIdx] * expnyts[rowIdx % MAX_BLOCK_THREADS] / (1 + expnyts[rowIdx % MAX_BLOCK_THREADS]);


Is this line correct? There is a slightly different equation in the paper, but this does match the C code in libffm.

@henrygouk true and not sure - need to double check. For now I went with the original C implementation but need to experiment.

henrygouk · 2018-05-31T05:03:35Z

src/gpu/ffm/trainer.cu

+        const T w1gdup = (weightsPtr + idx1)[d+1] + g1 * g1;
+        const T w2gdup = (weightsPtr + idx2)[d+1] + g2 * g2;
+
+        (weightsPtr + idx1)[d] -= cLearningRate[0] / std::sqrt(w1gdup) * g1;


This is the AdaGrad update rule. Could be useful to look at using other update rules in future. Something like AMSGrad could potentially converge faster.

Also, this technique is using HogWild! optimisation, which does not guarantee repeatability across different runs. Is this a problem?

@henrygouk yes to both :-)

definitely different updaters would be great, preferably abstracting this part and encapsulating it in an Updater class would be best. Can make an issue with potential candidates

for initial release I think it should be ok but in the long run I'm pretty sure indeterministic results will be a no-go, especially for our own use in DAI. Any suggestions how this could be fixed? Serializing it would incur a huge performance hit. Stochastic gradient descent? Not sure how viable that method is? Anything else?

henrygouk · 2018-05-31T05:06:05Z

src/gpu/ffm/trainer.cu

+
+    const T v = vals[threadIdx.x] * (threadIdx.x + i < MAX_BLOCK_THREADS ? vals[threadIdx.x + i] : values[n1 + i]) * scales[rowIdx % MAX_BLOCK_THREADS];
+
+    if (update) {


In future, it might be a bit cleaner to separate the training/prediction code into different kernels.

henrygouk · 2018-05-31T05:09:44Z

src/gpu/ffm/trainer.cu

+
+  __syncthreads();
+
+  T loss = 0.0;


Is it accurate to call this a loss? I think this is more like the prediction, but I may be misinterpreting things.

In the original I think they call it t, not sure why - can rename. From my understanding, it is used as loss and also used to calculate predictions.

henrygouk · 2018-05-31T05:20:06Z

src/gpu/ffm/trainer.cu

+
+  T loss = 0.0;
+
+  for(int i = 1; n1 + i < rowSizes[rowIdx + 1] - cBatchOffset[0]; i++) {


How difficult would it be to parallelise this loop and add a reduction step afterwards? I'm guessing nontrivial, due to the different row sizes, but this is the main way I can think of to get some more parallelism out of this.

Not sure, instead of spinning up a thread for each field:feature:value (lets call it a node) tuple and running this loop for each consecutive node in that row we could spin up a thread for each node pair. Didn't try that approach yet as I was afraid of thread overhead.

Currently the main slowdown I'm noticing is becuase of:

number of registers used by the wTx kernel which limits the number of concurrent blocks being ran

read/write from global memory when using the weights array since that access isn't very coalesce within blocks. Interesting experiment is to move lines 157/158:

(weightsPtr + idx1)[d+1] = w1gdup; (weightsPtr + idx2)[d+1] = w2gdup;

Right after:

const T w1gdup = (weightsPtr + idx1)[d+1] + g1 * g1; const T w2gdup = (weightsPtr + idx2)[d+1] + g2 * g2;

There's a visible (~10%) slowdown on relatively large data (400k rows, 39 nodes in each row). I'm assuming this is due to how CUDA loads/stores data.

Basically the major slowdown in coming from the if (update) { branch. Putting both weights and gradients into the same array helped quite a bit but even on 1080Ti this is only ~2-3x faster than the CPU implementation.

…le builds. Build cleanup.

…older. Missing deps in runtime Docker runtime file.

…ocker make targets.

…emporarily remove OMP. Add log_loss printout.

…ack to python

…l slow, utilizes only ~10-15% of the GPU

* squash weights and gradients into single array for memory reads * utilize shared memory for fields/features/values/scales as much as possible * compute kappa once per node instead of rowSize times

mdymczyk · 2018-06-26T00:34:05Z

So both CPU and GPU implementations are there and working, the only issue left is that GPU batch mode gives slightly different results with same # of iterations (or converges in a much larger number of iterations) compared to GPU batch mode with batch_size=1 and CPU modes. I'm guessing this is because we are using HOGWILD! and the order of computations during gradient update differs (and might not be 100% correct?).

mdymczyk · 2018-06-26T04:38:51Z

One more thing: this needs to be compared against bigger data (libffm_toy.zip) and the original cpp implementation (https://github.com/guestwalk/libffm - not the Python API). I think the GPU version was getting a bit different results, so needs double checking before merging.

mdymczyk force-pushed the new-algo/ffm branch from bbef54c to 8844cfb Compare May 11, 2018 05:05

mdymczyk force-pushed the build/centos-rewrite branch 3 times, most recently from a25d455 to 50dc67b Compare May 31, 2018 04:26

henrygouk reviewed May 31, 2018

View reviewed changes

mdymczyk force-pushed the build/centos-rewrite branch from 50dc67b to e053cb5 Compare May 31, 2018 06:05

mdymczyk added 22 commits May 31, 2018 20:42

Remove nonccl build. Rewrite builds to CentOS. Unify x86_64 and ppc64…

c4a8b21

…le builds. Build cleanup.

Bring back 1 script and some convenient make targets

23ee96b

Make full git clone on googletest

c0552da

Move tests to a common root folder. Move python req files to python f…

5e39049

…older. Missing deps in runtime Docker runtime file.

Install custom arrow and pillow in runtime dockers for ppc64le. DRY d…

fa86dd7

…ocker make targets.

Install git for runtime on ppc64le

a8e92eb

Install git in runtime docker

1f46abb

Initial FFM GPU implementation.

ba7b3d3

Initial Python bindings

46f737b

Initial ffm prediction impl

8aec2ab

Fix gradient/weights computation. Pass parameters by ref to Python. T…

d443976

…emporarily remove OMP. Add log_loss printout.

Fixes logloss calc. Fixes weight index calculation.

b1b0c6a

Fixes label initialization - on the actual object not a copy.

d27df65

Pass ffm model as reference so final weights get updated and passed b…

0a7514a

…ack to python

Rewrite FFM data structures into pure pointers - much faster but stil…

329d2de

…l slow, utilizes only ~10-15% of the GPU

Faster wTx computation but still slow and not scalable

90911f5

FFM wTx using a faster kernel - still only on par with CPU

6ad0d25

FFM now runs ~2x faster on large data:

25fb40b

* squash weights and gradients into single array for memory reads * utilize shared memory for fields/features/values/scales as much as possible * compute kappa once per node instead of rowSize times

Don't copy/allocate weights unnecessarily in the model

a127759

Fix indexing issues in FFM

d2b97a3

Initial validation dataset handling

94265db

Validation dataset

3990da8

mdymczyk force-pushed the new-algo/ffm branch from e086015 to 3990da8 Compare June 1, 2018 00:05

Validation data and early stopping

f08a917

mdymczyk changed the base branch from build/centos-rewrite to master June 1, 2018 04:25

mdymczyk added 2 commits June 1, 2018 16:59

Support GPU computation of datasets larger than GPU memory

665237a

FFM CPU

1732077

mdymczyk force-pushed the new-algo/ffm branch from dc73e71 to 1732077 Compare June 13, 2018 17:52

Fix tests and pylint

b32728e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Field-aware factorization machines #604

[WIP] Field-aware factorization machines #604

mdymczyk commented May 11, 2018 •

edited

Loading

henrygouk May 31, 2018

mdymczyk May 31, 2018

henrygouk May 31, 2018

henrygouk May 31, 2018

mdymczyk May 31, 2018

henrygouk May 31, 2018

henrygouk May 31, 2018

mdymczyk May 31, 2018

henrygouk May 31, 2018

mdymczyk May 31, 2018

mdymczyk commented Jun 26, 2018

mdymczyk commented Jun 26, 2018


		const T v = vals[threadIdx.x] * (threadIdx.x + i < MAX_BLOCK_THREADS ? vals[threadIdx.x + i] : values[n1 + i]) * scales[rowIdx % MAX_BLOCK_THREADS];

		if (update) {


		T loss = 0.0;

		for(int i = 1; n1 + i < rowSizes[rowIdx + 1] - cBatchOffset[0]; i++) {

[WIP] Field-aware factorization machines #604

Are you sure you want to change the base?

[WIP] Field-aware factorization machines #604

Conversation

mdymczyk commented May 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdymczyk commented Jun 26, 2018

mdymczyk commented Jun 26, 2018

mdymczyk commented May 11, 2018 •

edited

Loading