Skip to content

Commit

Permalink
feat: get_tree_diff to allow 'aggregate' parameter
Browse files Browse the repository at this point in the history
  • Loading branch information
kayjan committed Nov 6, 2024
1 parent 41154a0 commit be4779f
Show file tree
Hide file tree
Showing 5 changed files with 404 additions and 190 deletions.
9 changes: 6 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Added:
- Tree Export: Print tree to allow alias.
- Tree Export: Mermaid diagram to include theme.
### Fixed:
- Misc: Doctest for docstrings, docstring to indicate usage prefers `node_name` to `name`.
- Tree Helper: Get tree diff to take in `aggregate` parameter to indicate differences at the top-level node.
- Misc: Documentation to include tips and tricks on working with custom classes.
### Changed:
- Misc: Docstring to indicate usage prefers `node_name` to `name`.
- Misc: Standardise testing fixtures.
### Fixed:
- Misc: Polars set up to work on laptop with M1 chip.
- Tree Export: Mermaid diagram title to add newline.
- Tree Helper: Get tree diff string replacement bug when the path change is substring of another path.
- Tree Export: Polars unit test to work with old (<=1.9.0) and new polars version.
- Tree Helper: Get tree diff string replacement bug when the path change is substring of another path.

## [0.22.1] - 2024-11-03
### Added:
Expand Down
160 changes: 105 additions & 55 deletions bigtree/tree/helper.py
Original file line number Diff line number Diff line change
Expand Up @@ -250,6 +250,7 @@ def get_tree_diff(
other_tree: node.Node,
only_diff: bool = True,
detail: bool = False,
aggregate: bool = False,
attr_list: List[str] = [],
fallback_sep: str = "/",
) -> node.Node:
Expand All @@ -267,6 +268,9 @@ def get_tree_diff(
If `detail=True`, (added) and (moved to) will be used instead of (+), (removed) and (moved from)
will be used instead of (-).
If `aggregate=True`, differences (+)/(added)/(moved to) and (-)/(removed)/(moved from) will only be indicated at
the parent-level. This is useful when a subtree is shifted and we want the differences to shown only at the top node.
!!! note
- tree and other_tree must have the same `sep` symbol, otherwise this will raise ValueError
Expand All @@ -276,50 +280,79 @@ def get_tree_diff(
Examples:
>>> # Create original tree
>>> from bigtree import Node, get_tree_diff, list_to_tree
>>> root = list_to_tree(["Downloads/Pictures/photo1.jpg", "Downloads/file1.doc", "Downloads/photo2.jpg"])
>>> root = list_to_tree(["Downloads/Pictures/photo1.jpg", "Downloads/file1.doc", "Downloads/Trip/photo2.jpg"])
>>> root.show()
Downloads
├── Pictures
│ └── photo1.jpg
├── file1.doc
└── photo2.jpg
└── Trip
└── photo2.jpg
>>> # Create other tree
>>> root_other = list_to_tree(["Downloads/Pictures/photo1.jpg", "Downloads/Pictures/photo2.jpg", "Downloads/file1.doc"])
>>> root_other = list_to_tree(
... ["Downloads/Pictures/photo1.jpg", "Downloads/Pictures/Trip/photo2.jpg", "Downloads/file1.doc", "Downloads/file2.doc"]
... )
>>> root_other.show()
Downloads
├── Pictures
│ ├── photo1.jpg
│ └── photo2.jpg
└── file1.doc
│ └── Trip
│ └── photo2.jpg
├── file1.doc
└── file2.doc
>>> # Get tree differences
# Get tree differences
>>> tree_diff = get_tree_diff(root, root_other)
>>> tree_diff.show()
Downloads
├── Pictures
│ └── photo2.jpg (+)
└── photo2.jpg (-)
│ └── Trip (+)
│ └── photo2.jpg (+)
├── Trip (-)
│ └── photo2.jpg (-)
└── file2.doc (+)
>>> # Get tree differences - all differences
>>> tree_diff = get_tree_diff(root, root_other, only_diff=False)
>>> tree_diff.show()
Downloads
├── Pictures
│ ├── photo1.jpg
│ └── photo2.jpg (+)
│ ├── Trip (+)
│ │ └── photo2.jpg (+)
│ └── photo1.jpg
├── Trip (-)
│ └── photo2.jpg (-)
├── file1.doc
└── photo2.jpg (-)
└── file2.doc (+)
>>> # Get tree differences - all differences with details
>>> tree_diff = get_tree_diff(root, root_other, only_diff=False, detail=True)
>>> tree_diff.show()
Downloads
├── Pictures
│ ├── photo1.jpg
│ └── photo2.jpg (moved to)
│ ├── Trip (moved to)
│ │ └── photo2.jpg (moved to)
│ └── photo1.jpg
├── Trip (moved from)
│ └── photo2.jpg (moved from)
├── file1.doc
└── photo2.jpg (moved from)
└── file2.doc (added)
Comparing tree attributes
>>> # Get tree differences - all differences with details on aggregated level
>>> tree_diff = get_tree_diff(root, root_other, only_diff=False, detail=True, aggregate=True)
>>> tree_diff.show()
Downloads
├── Pictures
│ ├── Trip (moved to)
│ │ └── photo2.jpg
│ └── photo1.jpg
├── Trip (moved from)
│ └── photo2.jpg
├── file1.doc
└── file2.doc (added)
# Comparing tree attributes
- (~) will be added to node name if there are differences in tree attributes defined in `attr_list`.
- The node's attributes will be a list of [value in `tree`, value in `other_tree`]
Expand Down Expand Up @@ -361,6 +394,7 @@ def get_tree_diff(
other_tree (Node): tree to be compared with
only_diff (bool): indicator to show all nodes or only nodes that are different (+/-), defaults to True
detail (bool): indicator to differentiate between different types of diff e.g., added or removed or moved
aggregate (bool): indicator to only add difference indicator to parent-level e.g., when shifting subtrees
attr_list (List[str]): tree attributes to check for difference, defaults to empty list
fallback_sep (str): sep to fall back to if tree and other_tree has sep that clashes with symbols "+" / "-" / "~".
All node names in tree and other_tree should not contain this fallback_sep, defaults to "/"
Expand All @@ -383,6 +417,7 @@ def get_tree_diff(

name_col = "name"
path_col = "PATH"
parent_col = "PARENT"
indicator_col = "Exists"
tree_sep = tree.sep

Expand All @@ -391,26 +426,34 @@ def get_tree_diff(
_tree,
name_col=name_col,
path_col=path_col,
parent_col=parent_col,
attr_dict={k: k for k in attr_list},
)
for _tree in (tree, other_tree)
)

# Check tree structure difference
data_both = data[[path_col, name_col] + attr_list].merge(
data_other[[path_col, name_col] + attr_list],
data_both = data[[path_col, name_col, parent_col] + attr_list].merge(
data_other[[path_col, name_col, parent_col] + attr_list],
how="outer",
on=[path_col, name_col],
on=[path_col, name_col, parent_col],
indicator=indicator_col,
)
if aggregate:
data_both_agg = data_both[
(data_both[indicator_col] == "left_only")
| (data_both[indicator_col] == "right_only")
].drop_duplicates(subset=[name_col, parent_col], keep=False)
else:
data_both_agg = data_both

# Handle tree structure difference
nodes_removed = list(data_both[data_both[indicator_col] == "left_only"][path_col])[
::-1
]
nodes_added = list(data_both[data_both[indicator_col] == "right_only"][path_col])[
::-1
]
nodes_removed = list(
data_both_agg[data_both_agg[indicator_col] == "left_only"][path_col]
)[::-1]
nodes_added = list(
data_both_agg[data_both_agg[indicator_col] == "right_only"][path_col]
)[::-1]

moved_from_indicator: List[bool] = [True for _ in range(len(nodes_removed))]
moved_to_indicator: List[bool] = [True for _ in range(len(nodes_added))]
Expand All @@ -432,8 +475,8 @@ def get_tree_diff(

def add_suffix_to_path(
_data: pd.DataFrame, _condition: pd.Series, _original_name: str, _suffix: str
) -> pd.DataFrame:
"""Add suffix to path string
) -> None:
"""Add suffix to path string, in-place
Args:
_data (pd.DataFrame): original data with path column
Expand All @@ -446,35 +489,42 @@ def add_suffix_to_path(
"""
_data.iloc[_condition.values, _data.columns.get_loc(path_col)] = _data.iloc[
_condition.values, _data.columns.get_loc(path_col)
].str.replace(_original_name, f"{_original_name} ({suffix})", regex=True)
return _data

for node_removed, move_indicator in zip(nodes_removed, moved_from_indicator):
if not detail:
suffix = "-"
elif move_indicator:
suffix = "moved from"
else:
suffix = "removed"
condition_node_removed = data_both[path_col].str.endswith(
node_removed
) | data_both[path_col].str.contains(node_removed + tree_sep)
data_both = add_suffix_to_path(
data_both, condition_node_removed, node_removed, suffix
)
for node_added, move_indicator in zip(nodes_added, moved_to_indicator):
if not detail:
suffix = "+"
elif move_indicator:
suffix = "moved to"
else:
suffix = "added"
condition_node_added = data_both[path_col].str.endswith(node_added) | data_both[
path_col
].str.contains(node_added + tree_sep)
data_both = add_suffix_to_path(
data_both, condition_node_added, node_added, suffix
)
].str.replace(_original_name, f"{_original_name} ({_suffix})", regex=True)

def add_suffix_to_data(
_data: pd.DataFrame,
nodes_diff: List[str],
move_indicator: List[bool],
suffix_general: str,
suffix_move: str,
suffix_not_moved: str,
) -> None:
"""Add suffix to data, in-place
Args:
_data (pd.DataFrame): original data with path column
nodes_diff (List[str]): list of paths that were modified (e.g., added/removed)
move_indicator (List[bool]): move indicator to indicate path was moved instead of added/removed
suffix_general (str): path suffix for general case
suffix_move (str): path suffix if path was moved
suffix_not_moved (str): path suffix if path is not moved (e.g., added/removed)
"""
for _node_diff, _move_indicator in zip(nodes_diff, move_indicator):
if not detail:
suffix = suffix_general
else:
suffix = suffix_move if _move_indicator else suffix_not_moved
condition_node_modified = data_both[path_col].str.endswith(
_node_diff
) | data_both[path_col].str.contains(_node_diff + tree_sep)
add_suffix_to_path(data_both, condition_node_modified, _node_diff, suffix)

add_suffix_to_data(
data_both, nodes_removed, moved_from_indicator, "-", "moved from", "removed"
)
add_suffix_to_data(
data_both, nodes_added, moved_to_indicator, "+", "moved to", "added"
)

# Check tree attribute difference
path_changes_list_of_dict: List[Dict[str, Dict[str, Any]]] = []
Expand Down
52 changes: 48 additions & 4 deletions docs/gettingstarted/demo/tree.md
Original file line number Diff line number Diff line change
Expand Up @@ -965,7 +965,11 @@ To compare tree attributes:
- `(~)`: Node has different attributes, only available when comparing attributes

For more details, `(moved from)`, `(moved to)`, `(added)`, and `(removed)` can
be indicated instead if `(+)` and `(-)`.
be indicated instead if `(+)` and `(-)` by passing `detail=True`.

For aggregating the differences at the parent-level instead of having `(+)` and
`(-)` at every child node, pass in `aggregate=True`. This is useful if
subtrees are shifted, and if you want to view the shifting at the parent-level.

=== "Only differences"
```python hl_lines="20"
Expand Down Expand Up @@ -1029,13 +1033,14 @@ be indicated instead if `(+)` and `(-)`.
# └── g (+)
```
=== "With details"
```python hl_lines="21"
```python hl_lines="23"
from bigtree import str_to_tree, get_tree_diff

root = str_to_tree("""
a
├── b
│ ├── d
│ │ └── g
│ └── e
└── c
└── f
Expand All @@ -1044,9 +1049,10 @@ be indicated instead if `(+)` and `(-)`.
root_other = str_to_tree("""
a
├── b
│ └── g
│ └── h
└── c
├── d
│ └── g
└── f
""")

Expand All @@ -1055,10 +1061,48 @@ be indicated instead if `(+)` and `(-)`.
# a
# ├── b
# │ ├── d (moved from)
# │ │ └── g (moved from)
# │ ├── e (removed)
# │ └── h (added)
# └── c
# └── d (moved to)
# └── g (moved to)
```
=== "With aggregated differences"
```python hl_lines="23"
from bigtree import str_to_tree, get_tree_diff

root = str_to_tree("""
a
├── b
│ ├── d
│ │ └── g
│ └── e
└── c
└── f
""")

root_other = str_to_tree("""
a
├── b
│ └── h
└── c
├── d
│ └── g
└── f
""")

tree_diff = get_tree_diff(root, root_other, detail=True, aggregate=True)
tree_diff.show()
# a
# ├── b
# │ ├── d (moved from)
# │ │ └── g
# │ ├── e (removed)
# │ └── g (added)
# │ └── h (added)
# └── c
# └── d (moved to)
# └── g
```
=== "Attribute difference"
```python hl_lines="25"
Expand Down
Loading

0 comments on commit be4779f

Please sign in to comment.