Fix memory leak by removing custom copy logic #1227

nicoloesch · 2024-10-24T01:03:25Z

Fixes #1222 .

Description

This change removes the custom __copy__ logic in torchio.Subject. As a result, calling copy.copy() on a torchio.Subject will no longer create a mixture of shallow and deep copies. Therefore, to

create a shallow copy, utilise copy.copy(subject)
create a deep copy, utilise copy.deepcopy(subject)

Removing this mix of deep and shallow copies within the method for a shallow copy (__copy__) appears to be beneficial for the memory management in python and the garbage collector correctly freeing torchio.Subject and their images that are no longer referenced.

This required the change from copy.copy to copy.deepcopy in the torchio.Transform base class to not modify the original subject but rather create a copy of it that is modified using the transform (applied only if copy=True during instantiation of any transform). I assume this is the excepted behaviour based on the provided documentation and code logic.

Checklist

I have read the CONTRIBUTING docs and have a developer setup (especially important are pre-commitand pytest)
Non-breaking change (would not break existing functionality)
Breaking change (would cause existing functionality to change)
Tests added or modified to cover the changes
Integration tests passed locally by running pytest
In-line docstrings updated
Documentation updated, tested running make html inside the docs/ folder
This pull request is ready to be reviewed

Important Notes

The absence of the memory leak has only been verified on my local machine as outlined in the comment in the original issue and should be verified independently by the reviewer on their machine given the instructions in the linked comment.

…ude tests

codecov · 2024-10-24T01:05:08Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.75%. Comparing base (cfc8aaf) to head (f8335ef).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1227      +/-   ##
==========================================
- Coverage   87.39%   84.75%   -2.65%     
==========================================
  Files          92       92              
  Lines        6023     6007      -16     
==========================================
- Hits         5264     5091     -173     
- Misses        759      916     +157

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nicoloesch · 2024-10-24T01:49:47Z

Update: I am able to run my models with roughly 6GB of memory usage instead of 70-100GB. It appears the change fixed the memory leak.

romainVala · 2024-10-24T13:35:39Z

Hi
I do confirm that using this PR, does indeed remove the memory increase observe in this example

romainVala · 2024-10-24T13:40:53Z

Not sure this is related to this issue, but I now wonder what the copy argument of transform is intended for ?
I thought that copy=False would then apply the transform directly to the input subject
but :

import torchio as tio
suj = tio.datasets.Colin27()
tr = tio.RandomAffine(copy=False)
sujt = tr(suj)

#even though copy=False, sujt is a new instance of suj   they are different (and have different data

suj.history
[]
sujt.history
[Affine(scales=(tensor(0.9252,  ...

tests/data/test_subject.py

nicoloesch · 2024-10-24T22:16:28Z

@romainVala I assume the copy argument is to make a deep copy of the subject and only operate on a copy of the subject, whereas the initial loaded subject stays unchanged. I think one use case of copy=True could be if the Subject contains Image elements that are initialised with a tensor opposed to an image path. However, I also was wondering whether we should not default this argument to False instead as we would re-load the inidividual every time we iterate over the DataLoader if we provide a path argument - At least that is my understanding. What are your thoughts @fepegar?

fepegar · 2024-10-24T22:28:24Z

we would re-load the individual every time we iterate over the DataLoader if we provide a path argument

Is this not what you would expect? Otherwise, you might run out of memory if your dataset is not small.

nicoloesch · 2024-10-24T22:36:46Z

Is this not what you would expect? Otherwise, you might run out of memory if your dataset is not small.

This is exactly what I expect but wouldn't then the copy be entirely obsolete? We are re-loading it from disk every time so we always have the initial state available. What is the point to protect (through copying) the initial subject, if we never re-store it, i.e. change the image on disk?

romainVala · 2024-10-25T06:56:58Z

this was exactly my point, I do not see how the copy argument change the behaviour

If I test with a tensor, it is the same

import torchio as tio, numpy as np, torch

img = tio.ScalarImage(tensor=torch.rand((1,10,10,10)), affine=np.eye(4))
suj = tio.Subject({'image1':img})
tr = tio.RandomAffine(copy=False)
sujt = tr(suj)

suj.image1.data[0,0,0,:4]
Out[16]: tensor([0.7858, 0.4274, 0.7144, 0.6861])

sujt.image1.data[0,0,0,:4]
Out[17]: tensor([0.7443, 0.5312, 0.6450, 0.3963])

So both original subject and the transformed one are identical despite copy=False argument.

but wait , I find out where it comes from:
RandomAffine create an other transform Affine, but forget to pass the copy argument
therefore the real transform here is done with copy=True !

If I change the line

torchio/src/torchio/transforms/augmentation/spatial/random_affine.py

Line 187 in cfc8aaf

transform = Affine(**self.add_include_exclude(arguments))

to
transform = Affine(**self.add_include_exclude(arguments), copy=self.copy)

then
tr = tio.RandomAffine(copy=False)
is now working as expected (ie suj and sujt are now identical !)

so this is indeed related to this issue #1223

Note that it is now also working for my first exemple (with subject colin) so the copy argument act similary wheter or not one use tensor as input

justusschock · 2024-10-27T09:27:56Z

src/torchio/datasets/slicer.py

@@ -36,7 +36,7 @@
 }


-class Slicer(_RawSubjectCopySubject):
+class Slicer(Subject):


nit: this changes behavior where it's not necessary.

before a copy of this class was of type Subject now i's only a instance of a subclass.

It's fine by me in general, I just remember these mechanics to be required like that in the past. Thoughts here @fepegar ?

@justusschock Thanks for taking a look at the PR!

The _RawSubjectCopySubject inherits from Subject and implements only the following:

class _RawSubjectCopySubject(Subject): def __copy__(self): return _subject_copy_helper(self, Subject)

I removed the entire class as

It already implemented exactly what Subject did, i.e. a custom __copy__, with the exact same logic using _subject_copy_helper

As discussed in the entire issue, the custom __copy__ with a mix of deep and shallow copies resulted in a memory leak

Based on the logic of the custom __copy__ method, it appears we want to protect the Image and ultimately the mutable tensors in each of them as we are manually creating a deep copy just for Image but keeping all other attributes as a shallow copy regardless of whether they are mutable or not. I figured that mixing both in the shallow copy dunder/magic method most likely resulted in the memory leak, which has been shown in #1222 . To create a deep copy we can just rely on the regular copy.deepcopy without the requirement of a custom logic, particularly both Subject and Image are of type Dict and just copy each of the key-value pairs.

To check for it, I implemented two new tests for both shallow and deep copies. You will see that both new tests pass as expected. In addition, everywhere we are creating a shallow copy of the subject I changed it to a deep copy to really protect it.

If we should have copy=True as a default argument to Transform might be a different discussion as we never change a tensor (an Image) on disk so we never really change it permanently anyways, which might make the copy itself redundant.

If we should have copy=True as a default argument to Transform might be a different discussion as we never change a tensor (an Image) on disk so we never really change it permanently anyways, which might make the copy itself redundant.

I do not fully understand, this argument: Of course we do not want to change the file on disk. The issue is only about Subject variable in a python script. I advocate for keeping copy=True as default value, because most of the time when we wrote
subject_transformed = transform(subject)
on expect to have 2 different instance of Subject (unless we explicitly ask for copy=False)
... no ?

Well I figured that the copy of the Subject is in place to protect the initial state of the Subject prior to any modification. I am specifically refering to any Transform, that may alter the content of the Subject and more precisely the Image, which usually is loaded from disk (multiple nii.gz or similar medical images). However, every modification is only temporary as we never modify the true origin of any Subject, i.e. the original image on disk is never replaced in the code unless the user actively calls Image.save. The question/thought I had was why protect something that we never permanently change? We always reload the ground-truth from disk whenever we access any Image/Subject. So why have copy=True as a default, that may or may not incur a memory leak as shown in #1222 without any real purpose?

Not sure to follow you here ...

The real purpose of copy=True is not to protect the file on disk but only to protect the loaded subject so that you can write

#with transfrom being any torchio transform subject_transformed = tranform(subject)

with copy=True default (in the transform) then subject_transformed and subject have now two different contents

This is the default behaviour we want to keep, and therefore I think it is important to keep copy=True by default ...

Maybe I have an issue in my thinking. Could you explain to me why it is important to do a copy by default of a Subject, if we never save the initial subject and never return the initial subject, which means it will be just deleted by the garbage collector eventually? We are just returning the modified/transformed Subject, forget about the initial one and keep progressing in the code. Why not do the tensor operations in-place? Why "protect" something we do not care about anymore?

I am also not entirely sure if the discussuion about the default value of copy belongs in the current issue. As long as #1223 would be fixed, one can simply pass down copy=False regardless. I was just wondering about the default value and why it is needed as described above...

ok, this is probably just a matter of the exact use case. I can imagine cases where we need to keep the initial state of subject's images after the transform in order to access it later in the python script (without reloading it from disk). For instance if one wish to compare the image before and after a motion artefact.
So my point was just to say that copy=True is not useless .

this being said, I am not sure if this should the it be the default ...

Anyway we do agree, that it is important to make the transform compliant to the user choice (copy=True or False) so indeed #1223 need to be fixed .

fepegar · 2024-11-16T22:09:43Z

@nicoloesch I'm assuming #1228 needs to be merged before this?
@romainVala are you happy with the changes in this PR?

Sorry folks, I'm a bit busy these days.

nicoloesch · 2024-11-28T05:40:14Z

@fepegar please excuse my delayed response. I think the order of merging both commits should not matter as they touch upon different parts of the code base. The only mutual file is transform.py, where this commit just changes how __call__ behaves in terms of making a shallow/deep copy, whereas #1228 touches on the arguments of specific transformations during apply_transform in the case of re-using an already existing transform.

Remove custom __copy__ logic and use default copying mechanism + incl…

92149c1

…ude tests

fepegar reviewed Oct 24, 2024

View reviewed changes

tests/data/test_subject.py Outdated Show resolved Hide resolved

Split combined shallow copy assert into two

f8335ef

romainVala mentioned this pull request Oct 25, 2024

Certain Transformations disregard *args and **kwargs (including copy) #1223

Open

1 task

justusschock approved these changes Oct 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix memory leak by removing custom copy logic #1227

Fix memory leak by removing custom copy logic #1227

nicoloesch commented Oct 24, 2024

codecov bot commented Oct 24, 2024 •

edited

Loading

nicoloesch commented Oct 24, 2024

romainVala commented Oct 24, 2024

romainVala commented Oct 24, 2024

nicoloesch commented Oct 24, 2024

fepegar commented Oct 24, 2024

nicoloesch commented Oct 24, 2024

romainVala commented Oct 25, 2024 •

edited

Loading

justusschock Oct 27, 2024

nicoloesch Oct 27, 2024

romainVala Oct 28, 2024

nicoloesch Oct 28, 2024

romainVala Oct 29, 2024 •

edited

Loading

nicoloesch Oct 29, 2024 •

edited

Loading

romainVala Oct 30, 2024 •

edited by fepegar

Loading

fepegar commented Nov 16, 2024

nicoloesch commented Nov 28, 2024

Fix memory leak by removing custom __copy__ logic #1227

Are you sure you want to change the base?

Fix memory leak by removing custom __copy__ logic #1227

Conversation

nicoloesch commented Oct 24, 2024

codecov bot commented Oct 24, 2024 • edited Loading

Codecov Report

nicoloesch commented Oct 24, 2024

romainVala commented Oct 24, 2024

romainVala commented Oct 24, 2024

nicoloesch commented Oct 24, 2024

fepegar commented Oct 24, 2024

nicoloesch commented Oct 24, 2024

romainVala commented Oct 25, 2024 • edited Loading

justusschock Oct 27, 2024

Choose a reason for hiding this comment

nicoloesch Oct 27, 2024

Choose a reason for hiding this comment

romainVala Oct 28, 2024

Choose a reason for hiding this comment

nicoloesch Oct 28, 2024

Choose a reason for hiding this comment

romainVala Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

nicoloesch Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

romainVala Oct 30, 2024 • edited by fepegar Loading

Choose a reason for hiding this comment

fepegar commented Nov 16, 2024

nicoloesch commented Nov 28, 2024

Fix memory leak by removing custom copy logic #1227

Fix memory leak by removing custom copy logic #1227

codecov bot commented Oct 24, 2024 •

edited

Loading

romainVala commented Oct 25, 2024 •

edited

Loading

romainVala Oct 29, 2024 •

edited

Loading

nicoloesch Oct 29, 2024 •

edited

Loading

romainVala Oct 30, 2024 •

edited by fepegar

Loading