Update trainer.py #20473

TheMGGdev · 2024-11-08T17:18:51Z

Cleaning each time you compile the model the previous trainer metrics also erased the compilation of other models with are a cominbation of that model. Given problems in archiquectures shuch as GAN. As explain in issue:

google-cla · 2024-11-08T17:18:55Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

TheMGGdev · 2024-11-08T17:23:52Z

The issue is: #20474

codecov-commenter · 2024-11-08T17:25:27Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 59.94%. Comparing base (8409e18) to head (848fdde).

❗ There is a different number of reports uploaded between BASE (8409e18) and HEAD (848fdde). Click for more details.

HEAD has 6 uploads less than BASE

Flag BASE (8409e18) HEAD (848fdde)

keras 4 1

keras-torch 1 0

keras-tensorflow 1 0

keras-jax 1 0

Additional details and impacted files

@@             Coverage Diff             @@
##           master   #20473       +/-   ##
===========================================
- Coverage   82.07%   59.94%   -22.14%     
===========================================
  Files         515      515               
  Lines       47416    47415        -1     
  Branches     7439     7439               
===========================================
- Hits        38919    28421    -10498     
- Misses       6691    17246    +10555     
+ Partials     1806     1748       -58

Flag	Coverage Δ
keras	`59.94% <ø> (-22.00%)`	⬇️
keras-jax	`?`
keras-numpy	`59.94% <ø> (-0.03%)`	⬇️
keras-tensorflow	`?`
keras-torch	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

fchollet · 2024-11-09T20:05:55Z

Thanks for the PR!

@hertschuh, @jeffcarp, do you remember why this line was inserted? I believe it was intended to fix a bug, however removing it does not seem to break any test.

If we do remove the line, then we need a test for the use case it was breaking.

hertschuh · 2024-11-10T02:59:36Z

Thanks for the PR!

@hertschuh, @jeffcarp, do you remember why this line was inserted? I believe it was intended to fix a bug, however removing it does not seem to break any test.

If we do remove the line, then we need a test for the use case it was breaking.

I don't have any context on this.

This is the PR that added this line to fix some bug: #20197

cc @james77777778

james77777778 · 2024-11-10T08:16:44Z

@fchollet @hertschuh
I'm looking into it and will report back soon.

james77777778 · 2024-11-10T15:04:26Z

@fchollet
I believe the change of this PR should be fine. The drawback is that there might be some unused metrics in the sub-trainers.

Previously, that line was necessary to prevent an error caused by the nested Trainer arch.
Here’s a reproducible snippet:

import keras

train_images = keras.random.uniform(shape=(32, 784))
train_labels = keras.random.randint(shape=(32, 1), minval=0, maxval=9)

trainer1 = keras.Sequential(
    [keras.layers.Input(shape=(784,)), keras.layers.Dense(1, activation="relu")]
)
trainer1.compile(loss="mse", metrics=["mse"])  # Not calling `build` for `trainer1`'s `CompileLoss`

# Create another model.
inputs = keras.Input(shape=(784,))
x = trainer1(inputs)
outputs = keras.layers.Dense(10)(x)
trainer2 = keras.Model(inputs=inputs, outputs=outputs)
trainer2.compile(loss="binary_crossentropy", metrics=["accuracy"])
trainer2.fit(
    train_images, keras.utils.to_categorical(train_labels, 10), epochs=2
)
# `fit` might fail because `trainer1`'s `CompileLoss` is not built

However, there is logic that skips the metrics for the sub-trainer:

keras/keras/src/trainers/trainer.py

Lines 264 to 267 in 8deee17

    
           if isinstance(layer, Trainer): 
        
               # All Trainer-related metrics in sublayers should be ignored 
        
               # because a new Trainer has been instantiated. 
        
               continue

This allows us to keep the metrics in the sub-trainer without encountering the unbuilt issue.

We might want to consider adding a similar test as shown above to prevent future breakages.

fchollet · 2024-11-10T18:23:39Z

Thank you, @james77777778 and @hertschuh !

@TheMGGdev are you able to add a basic, minimal unit test for your use case, so we avoid breaking it in the future?

We should also include a test based on @james77777778's snippet above.

TheMGGdev · 2024-11-12T09:31:18Z

I don't know if this is exactly what you are asking for. Here is a more simplified example of the issue 20474. In Keras 3.6 it gives error but in Keras 3 it works. In the issue I explain better the error and where I think the error is. This is the code with the prints for debugging:

import numpy as np

from keras import __version__
from keras.models import Sequential, Model
from keras.layers import Input, Dense
from keras.optimizers import Adam

print(__version__)

model_1 = Sequential([
            Input(shape = (100, )),
            Dense(100, activation = "sigmoid"),
        ],)
model_2 = Sequential([
            Input(shape = (100, )),
            Dense(80, activation = "sigmoid"),
        ],)

model_1.compile(loss = 'binary_crossentropy', optimizer = Adam(), metrics = ['accuracy'])

###Print for debugging/show the error 
print('---Debugging/show the error after compiled model_1 and before compiled combined---')
print('model_1.compiled ->', model_1.compiled)
print('model_1.metrics ->', model_1.metrics)
print('model_1._loss_tracker ->', model_1._loss_tracker)
###

combined = Model(Input(shape=(100,)), model_2(model_1(Input(shape=(100,)))))
combined.compile(loss = 'binary_crossentropy', optimizer = Adam())

###Print for debugging/show the error
print('---Debugging/show the error after compiled model_1 and combined---')
print('model_1.compiled ->', model_1.compiled)
print('model_1.metrics ->', model_1.metrics)
print('model_1._loss_tracker ->', model_1._loss_tracker)
###

model_1.train_on_batch(np.random.normal(0, 1, (64, 100)), np.random.normal(0, 1, (64, 100)))
combined.train_on_batch(np.random.normal(0, 1, (64, 100)), np.random.normal(0, 1, (64, 80)))

And this is the code without the prints for the test:

import numpy as np

from keras import __version__
from keras.models import Sequential, Model
from keras.layers import Input, Dense
from keras.optimizers import Adam

print(__version__)

model_1 = Sequential([
            Input(shape = (100, )),
            Dense(100, activation = "sigmoid"),
        ],)
model_2 = Sequential([
            Input(shape = (100, )),
            Dense(80, activation = "sigmoid"),
        ],)

model_1.compile(loss = 'binary_crossentropy', optimizer = Adam(), metrics = ['accuracy'])

combined = Model(Input(shape=(100,)), model_2(model_1(Input(shape=(100,)))))
combined.compile(loss = 'binary_crossentropy', optimizer = Adam())

model_1.train_on_batch(np.random.normal(0, 1, (64, 100)), np.random.normal(0, 1, (64, 100)))
combined.train_on_batch(np.random.normal(0, 1, (64, 100)), np.random.normal(0, 1, (64, 80)))

This is the output in Keras 3.6:

TheMGGdev · 2024-11-12T09:38:22Z

I think that the solution for both PR is to clean only the layers of the models that don´t have compiled == True, so the one that have not been compiled already. This way it should work for both cases

TheMGGdev · 2024-11-27T11:02:35Z

Any update in this pull request?

fchollet · 2024-11-27T17:00:55Z

Are you able to provide a unit test?

TheMGGdev · 2024-11-27T17:04:46Z

What exactly are you asking for, I guess what I already sent doesn't work, does it?

fchollet · 2024-11-27T17:15:06Z

The code snippet you posted (2nd one) can be turned into an appropriate unit test I think.

mattdangerw · 2024-11-27T20:55:34Z

@TheMGGdev Francois is just asking to adapt the code above into a unit tests--an addition to a *_test.py that will run automatically against every future code change. That's a general expectation for us with bug fixes so we don't break ourselves in the future. See this recent PR as a arbitrary example #20550

Are you able to add a unit test here?

Surya2k1 · 2024-12-03T09:10:45Z

And this is the code without the prints for the test:

It seems the code snippet breaks in 3.7.0 with the proposed change.

3.7.0
2024-12-03 14:35:41.341850: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
  File "D:\OSS\Keras\keras\proposed_fix(test).py", line 89, in <module>
    multi_compile_test()
  File "D:\OSS\Keras\keras\proposed_fix(test).py", line 87, in multi_compile_test
    combined.train_on_batch(np.random.normal(0, 1, (64, 100)), np.random.normal(0, 1, (64, 80)))
  File "D:\OSS\Keras\keras\keras\src\backend\tensorflow\trainer.py", line 598, in train_on_batch
    logs = self.train_function(data())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\OSS\Keras\keras\keras\src\backend\tensorflow\trainer.py", line 224, in function
    outputs = one_step_on_data(data)
              ^^^^^^^^^^^^^^^^^^^^^^
  File "D:\OSS\Keras\.venv\Lib\site-packages\tensorflow\python\util\traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "D:\OSS\Keras\keras\keras\src\backend\tensorflow\trainer.py", line 110, in one_step_on_data
    outputs = self.distribute_strategy.run(step_function, args=(data,))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\OSS\Keras\keras\keras\src\backend\tensorflow\trainer.py", line 56, in train_step
    y_pred = self(x, training=True)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "D:\OSS\Keras\keras\keras\src\utils\traceback_utils.py", line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
        ^^^^^^^^^^^
  File "D:\OSS\Keras\keras\keras\src\ops\function.py", line 179, in _run_through_graph
    output_tensors.append(tensor_dict[id(x)])
                          ~~~~~~~~~~~^^^^^^^
KeyError: 'Exception encountered when calling Functional.call().\n\n\x1b[1m2425813381776\x1b[0m\n\nArguments received by Functional.call():\n  • inputs=tf.Tensor(shape=(64, 100), dtype=float64)\n  • training=True\n  • mask=None'

Update trainer.py

848fdde

Cleaning each time you compile the model the previous trainer metrics also erased the compilation of other models with are a cominbation of that model. Given problems in archiquectures shuch as GAN. As explain in issue:

google-ml-butler bot added the size:XS label Nov 8, 2024

google-ml-butler bot assigned gbaned Nov 8, 2024

TheMGGdev mentioned this pull request Nov 8, 2024

NO _loss_tracker on train_on_batch because compile model multiple times. Possible Bug. #20474

Closed

gbaned requested a review from fchollet November 19, 2024 06:32

google-ml-butler bot added the awaiting review label Nov 19, 2024

james77777778 mentioned this pull request Dec 6, 2024

Fix the issue when using Model.compile multiple times #20602

Merged

Surya2k1 mentioned this pull request Dec 23, 2024

Fix issue with nested model output as input #20678

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update trainer.py #20473

Update trainer.py #20473

TheMGGdev commented Nov 8, 2024

google-cla bot commented Nov 8, 2024

TheMGGdev commented Nov 8, 2024

codecov-commenter commented Nov 8, 2024

fchollet commented Nov 9, 2024

hertschuh commented Nov 10, 2024

james77777778 commented Nov 10, 2024

james77777778 commented Nov 10, 2024 •

edited

Loading

fchollet commented Nov 10, 2024

TheMGGdev commented Nov 12, 2024 •

edited

Loading

TheMGGdev commented Nov 12, 2024

TheMGGdev commented Nov 27, 2024

fchollet commented Nov 27, 2024

TheMGGdev commented Nov 27, 2024

fchollet commented Nov 27, 2024

mattdangerw commented Nov 27, 2024

Surya2k1 commented Dec 3, 2024

Update trainer.py #20473

Are you sure you want to change the base?

Update trainer.py #20473

Conversation

TheMGGdev commented Nov 8, 2024

google-cla bot commented Nov 8, 2024

TheMGGdev commented Nov 8, 2024

codecov-commenter commented Nov 8, 2024

Codecov Report

fchollet commented Nov 9, 2024

hertschuh commented Nov 10, 2024

james77777778 commented Nov 10, 2024

james77777778 commented Nov 10, 2024 • edited Loading

fchollet commented Nov 10, 2024

TheMGGdev commented Nov 12, 2024 • edited Loading

TheMGGdev commented Nov 12, 2024

TheMGGdev commented Nov 27, 2024

fchollet commented Nov 27, 2024

TheMGGdev commented Nov 27, 2024

fchollet commented Nov 27, 2024

mattdangerw commented Nov 27, 2024

Surya2k1 commented Dec 3, 2024

james77777778 commented Nov 10, 2024 •

edited

Loading

TheMGGdev commented Nov 12, 2024 •

edited

Loading