Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use common outgoing connection Session creation code for scraping #817

Merged
merged 3 commits into from
Oct 11, 2024

Conversation

philbudne
Copy link
Contributor

mcmetadata now contains code from story-indexer to make HTTP connections to source sites, use it, and updated sitemap tools in site (re)scrape, AND use connect/read timeouts in scraping to (hopefully) avoid hanging for issue #791

@philbudne philbudne requested a review from Evan-Leon October 10, 2024 17:43
@Evan-Leon Evan-Leon merged commit 0fdf2f8 into mediacloud:main Oct 11, 2024
Evan-Leon added a commit that referenced this pull request Oct 18, 2024
* Add scrape-source and scrape-collection commands to manage.py

Both commands take an id number and an email to send results to.  Both
commands run "in process" for test/debug unless "--queue" is given, in
which case the request is queued as a task.  Make sure CTRL/C causes
immediate termination, and gives a full backtrace.

* Feature add static collections (#810)

* Add static column to collections, add front end work to reflect

* Add chip to header and alert to modify collection

* Add tooltip to static chip

* Use mc-providers v2.2.0 caching argument to allow queries to bypass cache for testing

infrastructure already in place, uncomments a line of code!

* Implement multiple task queues (#813)

* Implement multiple queues for background tasks

* fixes

* fixes

* fix get_pending_tasks docstring paste-o

* cleanup

mcweb/backend/sources/tasks.py: restore docstring
mcweb/backend/util/tasks.py: define SYSTEM_SLOW as 'system-slow'

---------

Authored-by: Phil Budne <[email protected]>

* Use common outgoing connection Session creation code for scraping (#817)

* Update scraping code to use SSL/headers used in other Media Cloud projects.

* mcweb/backend/sources/models.py: add SCRAPE_HTTP_SECONDS, never pass newline to add_line!

* backend/search/views.py: removed commented out requests.Session creation

---------

Authored-by: Phil Budne <[email protected]>

* Change static to managed for collections

* update runtime.txt to python 3.10.15 due to security issues

* Update utils.py - remove prefix wildcards

Prefix wildcards have a huge performance cost. Removing here as a precursor.

* Update utils.py - scheme-safe url-search-string

* Update version and release notes for new relase

* Feature add contributor roles (#824)

* Start roles management command

* Make management command to make groups and assign users

* Add front end role contributor

* Add role permissions to directory

* Fix permissions on upload sources, test contributor role

* Fix save button on a managed collection

* Edit create-groups command name, write docs for management command

* Remove console log

---------

Co-authored-by: Phil Budne <[email protected]>
Co-authored-by: Phil Budne <[email protected]>
Co-authored-by: Paige Gulley <[email protected]>
Evan-Leon added a commit that referenced this pull request Oct 18, 2024
* Add scrape-source and scrape-collection commands to manage.py

Both commands take an id number and an email to send results to.  Both
commands run "in process" for test/debug unless "--queue" is given, in
which case the request is queued as a task.  Make sure CTRL/C causes
immediate termination, and gives a full backtrace.

* Feature add static collections (#810)

* Add static column to collections, add front end work to reflect

* Add chip to header and alert to modify collection

* Add tooltip to static chip

* Use mc-providers v2.2.0 caching argument to allow queries to bypass cache for testing

infrastructure already in place, uncomments a line of code!

* Implement multiple task queues (#813)

* Implement multiple queues for background tasks

* fixes

* fixes

* fix get_pending_tasks docstring paste-o

* cleanup

mcweb/backend/sources/tasks.py: restore docstring
mcweb/backend/util/tasks.py: define SYSTEM_SLOW as 'system-slow'

---------

Authored-by: Phil Budne <[email protected]>

* Use common outgoing connection Session creation code for scraping (#817)

* Update scraping code to use SSL/headers used in other Media Cloud projects.

* mcweb/backend/sources/models.py: add SCRAPE_HTTP_SECONDS, never pass newline to add_line!

* backend/search/views.py: removed commented out requests.Session creation

---------

Authored-by: Phil Budne <[email protected]>

* Change static to managed for collections

* update runtime.txt to python 3.10.15 due to security issues

* Update utils.py - remove prefix wildcards

Prefix wildcards have a huge performance cost. Removing here as a precursor.

* Update utils.py - scheme-safe url-search-string

* Update version and release notes for new relase

* Feature add contributor roles (#824)

* Start roles management command

* Make management command to make groups and assign users

* Add front end role contributor

* Add role permissions to directory

* Fix permissions on upload sources, test contributor role

* Fix save button on a managed collection

* Edit create-groups command name, write docs for management command

* Remove console log

---------

Co-authored-by: Phil Budne <[email protected]>
Co-authored-by: Phil Budne <[email protected]>
Co-authored-by: Paige Gulley <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants