Skip to content

Commit

Permalink
Version 3.27.0b2
Browse files Browse the repository at this point in the history
  • Loading branch information
mborsetti committed Nov 24, 2024
1 parent f01b39e commit 7b56852
Show file tree
Hide file tree
Showing 11 changed files with 120 additions and 44 deletions.
8 changes: 4 additions & 4 deletions .github/workflows/ci-cd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -66,13 +66,13 @@ jobs:
# Python versions at https://raw.githubusercontent.com/actions/python-versions/main/versions-manifest.json
# RCs need to be specified fully, e.g. '3.13.0-rc.3'
python-version: ['3.13', '3.12', '3.11', '3.10']
# python-version: ['3.12']
os: [ubuntu-latest]
disable-gil: [false]
# 29-oct-24 lxml does not build for free-threaded even with libxml2 and libxslt1 development packages
include:
# Free-threaded from https://github.com/actions/setup-python/issues/771#issuecomment-2439954031
- { os: ubuntu-latest, python-version: '3.13', disable-gil: true }
# 23-nov-24 cryptography doesn't build for free-threaded
# include:
# # Free-threaded from https://github.com/actions/setup-python/issues/771#issuecomment-2439954031
# - { os: ubuntu-latest, python-version: '3.13', disable-gil: true }

# Set up Redis per https://docs.github.com/en/actions/guides/creating-redis-service-containers
# If you are using GitHub-hosted runners, you must use an Ubuntu runner
Expand Down
10 changes: 9 additions & 1 deletion CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ can check out the `wish list <https://github.com/mborsetti/webchanges/blob/main/
Internals, for changes that don't affect users. [triggers a minor patch]
Version 3.27.0b1
Version 3.27.0b2
==================
Unreleased

Expand Down Expand Up @@ -69,6 +69,14 @@ Added
such as ``cryptography``, ``msgpack``, ``lxml``, and the optional ``jq``.
* New Sub-directive in ``pypdf`` Filter: Added ``extraction_mode`` sub-directive.
* Now storing error information in snapshot database.
* Added ``-l``/``--log-file`` command line argument to write log to a file. Suggested by `yubiuser
<https://github.com/yubiuser>`__ in `issue #88 <https://github.com/mborsetti/webchanges/issues/88>`__.

Fixed
-----
* The command line argument ``--error`` was yielding different results than when actually running *webchanges*.
Reported by `yubiuser <https://github.com/yubiuser>`__ in `issue #88
<https://github.com/mborsetti/webchanges/issues/88>`__.

Internals
---------
Expand Down
2 changes: 1 addition & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ This project is based on code from `urlwatch 2.21
<https://github.com/thp/urlwatch/tree/346b25914b0418342ffe2fb0529bed702fddc01f>`__ dated 30 July 2020.

You can easily upgrade to **webchanges** from the current version of **urlwatch** using the same job and
configuration files (see `here <https://webchanges.readthedocs.io/en/stable/migration.html>`__) and benefit from many
configuration files (see `here <https://webchanges.readthedocs.io/en/stable/upgrading.html>`__) and benefit from many
improvements, including:

* :underline:`AI-Powered Summaries`: Summary of changes in plain text using generative AI, useful for long documents
Expand Down
16 changes: 10 additions & 6 deletions docs/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,17 +46,21 @@ the configuration:
display:
new: true
error: true
repeated_error: false
unchanged: false
empty-diff: true
empty-diff: true # deprecated
If you set ``repeated_error`` to ``true``, :program:`webchanges` will send repeated notifications for the same error;
otherwise notifications for failed jobs are sent once when an error is first encountered, and additional notifications
will not be sent unless the error resolves or a different error occurs.

If you set ``unchanged`` to ``true``, :program:`webchanges` will always report all pages that are checked but have not
changed.

While the ``empty-diff`` setting is included for backwards-compatibility, :program:`webchanges` uses the easier job
directive :ref:`additions_only` to obtain similar results, which you should use. This deprecated setting controls
what happens if a page is ``changed``, but due to e.g. a ``diff_filter`` the diff is reduced to the empty string. If set
to ``true``, :program:`webchanges`: will report an (empty) change. If set to ``false``, the change will not be included
in the report.
``empty-diff`` is deprecated, and controls what happens if a page is ``changed`` but the notification is reduced to
an empty string e.g. by a ``diff_filter``. If set to ``true``, :program:`webchanges`: will report an (empty) change.
If set to ``false``, the change will not be included in the report. Use the job directive :ref:`additions_only`
instead for similar results.


.. _reports-and-reporters:
Expand Down
1 change: 1 addition & 0 deletions tests/data/config.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
display:
new: true
error: true
repeated_error: false
unchanged: false
empty-diff: false
report:
Expand Down
2 changes: 1 addition & 1 deletion webchanges/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
# * MINOR version when you add functionality in a backwards compatible manner, and
# * MICRO or PATCH version when you make backwards compatible bug fixes. We no longer use '0'
# If unsure on increments, use pkg_resources.parse_version to parse
__version__ = '3.27.0b1'
__version__ = '3.27.0b2'
__description__ = (
'Check web (or command output) for changes since last run and notify.\n'
'\n'
Expand Down
15 changes: 12 additions & 3 deletions webchanges/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ def migrate_from_legacy(
logger.warning(f"You can safely delete '{old_file}'.")


def setup_logger(verbose: int | None = None) -> None:
def setup_logger(verbose: int | None = None, log_file: Path | None = None) -> None:
"""Set up the logger.
:param verbose: the verbosity level (1 = INFO, 2 = ERROR).
Expand All @@ -99,7 +99,16 @@ def setup_logger(verbose: int | None = None) -> None:
if not verbose:
sys.tracebacklimit = 0

logging.basicConfig(format='%(asctime)s %(module)s[%(thread)s] %(levelname)s: %(message)s', level=log_level)
if log_file:
handlers: tuple[logging.Handler, ...] | None = (logging.FileHandler(log_file),)
else:
handlers = None

logging.basicConfig(
format='%(asctime)s %(module)s[%(thread)s] %(levelname)s: %(message)s',
level=log_level,
handlers=handlers,
)
logger.info(f'{__project_name__}: {__version__} {__copyright__}')
logger.info(
f'{platform.python_implementation()}: {platform.python_version()} '
Expand Down Expand Up @@ -358,7 +367,7 @@ def main() -> None: # pragma: no cover
)

# Set up the logger to verbose if needed
setup_logger(command_config.verbose)
setup_logger(command_config.verbose, command_config.log_file)

# For speed, run these here
handle_unitialized_actions(command_config)
Expand Down
55 changes: 50 additions & 5 deletions webchanges/command.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@

from __future__ import annotations

import contextlib
import difflib
import email.utils
import gc
import importlib.metadata
import logging
import os
Expand All @@ -19,6 +19,7 @@
import time
import traceback
from concurrent.futures import ThreadPoolExecutor
from contextlib import ExitStack
from datetime import datetime
from pathlib import Path
from typing import Iterable, Iterator, TYPE_CHECKING
Expand All @@ -29,7 +30,7 @@
from webchanges.differs import DifferBase
from webchanges.filters import FilterBase
from webchanges.handler import JobState, Report
from webchanges.jobs import BrowserJob, JobBase, NotModifiedError, UrlJob
from webchanges.jobs import JobBase, NotModifiedError, UrlJob
from webchanges.mailer import smtp_have_password, smtp_set_password, SMTPMailer
from webchanges.reporters import ReporterBase, xmpp_have_password, xmpp_set_password
from webchanges.util import dur_text, edit_file, import_module_from_source
Expand Down Expand Up @@ -620,9 +621,20 @@ def error_jobs_lines(jobs: Iterable[JobBase]) -> Iterator[str]:
stored data if the website reports no changes in the data since the last time it downloaded it -- see
https://developer.mozilla.org/en-US/docs/Web/HTTP/Conditional_requests).
"""
with contextlib.ExitStack() as stack:
max_workers = min(32, os.cpu_count() or 1) if any(isinstance(job, BrowserJob) for job in jobs) else None
logger.debug(f'Max_workers set to {max_workers}')

def job_runner(
stack: ExitStack,
jobs: Iterable[JobBase],
max_workers: int | None = None,
) -> Iterator[str]:
"""
Modified worker.job_runner that yields error text for jobs who fail with an exception or yield no data.
:param stack: The context manager.
:param jobs: The jobs to run.
:param max_workers: The number of maximum workers for ThreadPoolExecutor.
:return: error text for jobs who fail with an exception or yield no data.
"""
executor = ThreadPoolExecutor(max_workers=max_workers)

for job_state in executor.map(
Expand Down Expand Up @@ -655,6 +667,39 @@ def error_jobs_lines(jobs: Iterable[JobBase]) -> Iterator[str]:
else:
yield f'{job_state.job.index_number:3}: Error "{job_state.exception}": {pretty_name})'

with ExitStack() as stack:
# This code is from worker.run_jobs, modified to yield from job_runner.
from webchanges.worker import get_virt_mem # avoid circular imports

# run non-BrowserJob jobs first
jobs_to_run = [job for job in jobs if not job.__is_browser__]
if jobs_to_run:
logger.debug(
"Running jobs that do not require Chrome (without 'use_browser: true') in parallel with "
"Python's default max_workers."
)
yield from job_runner(stack, jobs_to_run, self.urlwatch_config.max_workers)
else:
logger.debug("Found no jobs that do not require Chrome (i.e. without 'use_browser: true').")

# run BrowserJob jobs after
jobs_to_run = [job for job in jobs if job.__is_browser__]
if jobs_to_run:
gc.collect()
virt_mem = get_virt_mem()
if self.urlwatch_config.max_workers:
max_workers = self.urlwatch_config.max_workers
else:
max_workers = max(int(virt_mem / 200e6), 1)
max_workers = min(max_workers, os.cpu_count() or 1)
logger.debug(
f"Running jobs that require Chrome (i.e. with 'use_browser: true') in parallel with "
f'{max_workers} max_workers.'
)
yield from job_runner(stack, jobs_to_run, max_workers)
else:
logger.debug("Found no jobs that require Chrome (i.e. with 'use_browser: true').")

start = time.perf_counter()
if len(self.urlwatch_config.jobs_files) == 1:
jobs_files = [f'in jobs file {self.urlwatch_config.jobs_files[0]}:']
Expand Down
8 changes: 8 additions & 0 deletions webchanges/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ class CommandConfig(BaseConfig):
joblist: list[str]
jobs_files: list[Path]
list_jobs: bool | str | None
log_file: Path
max_snapshots: int | None
max_workers: int | None
no_headless: bool
Expand Down Expand Up @@ -130,6 +131,13 @@ def parse_args(self, cmdline_args: list[str]) -> argparse.ArgumentParser:
parser.add_argument(
'-v', '--verbose', action='count', help='show logging output; use -vv for maximum verbosity'
)
parser.add_argument(
'-l',
'--log-file',
type=Path,
help='send log to FILE',
metavar='FILE',
)

group = parser.add_argument_group('override file defaults')
group.add_argument(
Expand Down
4 changes: 2 additions & 2 deletions webchanges/handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -213,7 +213,7 @@ def process(self, headless: bool = True) -> JobState:
logger.info(f'{self.job.get_indexed_location()} started processing ({type(self.job).__name__})')
logger.debug(f'Job {self.job.index_number}: {self.job}')

if self.exception:
if self.exception and not isinstance(self.exception, NotModifiedError):
self.new_timestamp = time.time()
self.new_error_data = {
'type': type(self.exception).__name__,
Expand Down Expand Up @@ -247,7 +247,7 @@ def process(self, headless: bool = True) -> JobState:

except Exception as e:
# Job has a chance to format and ignore its error
if self.job.https_proxy and e.args[0].startswith('[SSL:'):
if self.job.https_proxy and str(e.args[0]).startswith('[SSL:'):
args_list = list(e.args)
args_list[0] += f' (Check proxy {self.job.https_proxy})'
e.args = tuple(args_list)
Expand Down
43 changes: 22 additions & 21 deletions webchanges/worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -175,31 +175,11 @@ def job_runner(
job_state.save()
urlwatcher.report.new(job_state)

def get_virt_mem() -> int:
"""Return the amount of virtual memory available, i.e. the memory that can be given instantly to processes
without the system going into swap. Expressed in bytes."""
if isinstance(psutil, str):
raise ImportError(
"Error when loading package 'psutil'; cannot use 'use_browser: true'. Please install "
f"dependencies with 'pip install webchanges[use_browser]'.\n{psutil}"
) from None
try:
virt_mem = psutil.virtual_memory().available
logger.debug(
f'Found {virt_mem / 1e6:,.0f} MB of available physical memory (plus '
f'{psutil.swap_memory().free / 1e6:,.0f} MB of swap).'
)
except psutil.Error as e: # pragma: no cover
virt_mem = 0
logger.debug(f'Could not read memory information: {e}')

return virt_mem

jobs = set(UrlwatchCommand(urlwatcher).jobs_from_joblist())

jobs = insert_delay(jobs)

with ExitStack() as stack:
with ExitStack() as stack: # This code is also present in command.list_error_jobs (change there too!)
# run non-BrowserJob jobs first
jobs_to_run = [job for job in jobs if not job.__is_browser__]
if jobs_to_run:
Expand Down Expand Up @@ -228,3 +208,24 @@ def get_virt_mem() -> int:
job_runner(stack, jobs_to_run, max_workers)
else:
logger.debug("Found no jobs that require Chrome (i.e. with 'use_browser: true').")


def get_virt_mem() -> int:
"""Return the amount of virtual memory available, i.e. the memory that can be given instantly to processes
without the system going into swap. Expressed in bytes."""
if isinstance(psutil, str):
raise ImportError(
"Error when loading package 'psutil'; cannot use 'use_browser: true'. Please install "
f"dependencies with 'pip install webchanges[use_browser]'.\n{psutil}"
) from None
try:
virt_mem = psutil.virtual_memory().available
logger.debug(
f'Found {virt_mem / 1e6:,.0f} MB of available physical memory (plus '
f'{psutil.swap_memory().free / 1e6:,.0f} MB of swap).'
)
except psutil.Error as e: # pragma: no cover
virt_mem = 0
logger.debug(f'Could not read memory information: {e}')

return virt_mem

0 comments on commit 7b56852

Please sign in to comment.