forked from openeduhub/oeh-search-etl
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DMED-119 - Spike: direktes Einbinden der edu-sharing-Suchumgebung in den "Lern-Store" der SVS #47
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- fix for eafCode edge-cases where OEH 'discipline' vocab keys don't line up with "eafsys.txt" eafCodes -- for context, see: openeduhub/oeh-metadata-vocabs#36 - add: additional Lisum shorthand mapping for 'discipline' value "320" (Informatik) to "C-Inf" Signed-off-by: criamos <[email protected]>
- bumped package versions according to Dependabot recommendations from 2023-02-14: -- 'wheel', 'lxml', 'Pillow', 'certifi' - additionally, also bumped 'playwright' and 'requests' package to a newer version Signed-off-by: criamos <[email protected]>
- the 'build-and-publish'-pipeline failed due to a dependency conflict between pyppeteer and playwright - since all crawlers which previously used pyppeteer switched to Playwright a while ago anyway, I removed the obsolete package from our requirements.txt and web_tools.py Signed-off-by: criamos <[email protected]>
2023: Lisum changes & RSS Crawlers
- replaced URLs by uuids because the pipelines expect uuids, 'prefLabel' or 'altLabel' string values Signed-off-by: criamos <[email protected]>
- fix: 'description' (+ fallbacks) - refactor: 'base.sourceId' and 'base.hash' (+ fallbacks) - code cleanup Signed-off-by: criamos <[email protected]>
- fix: 'ResponseItemLoader' call to LomBase -- since serlo_spider already (completely) relies on Playwright for text and screenshot extraction, I replaced the (one) remaining use of Splash (which was called by super().mapResponse) from serlo_spider -- serlo webpages reliably crashed the 'Splash'-container during the start of a crawl Signed-off-by: criamos <[email protected]>
- if the URL found within OERSI's '_source.id'-field is different from the resolved URL by Scrapy, both strings will be saved to 'technical.location' -- this might be necessary for future duplicate detection - this change was made due to KLISUM-212 Signed-off-by: criamos <[email protected]>
- fix 'intendedEndUserRole' mapping for "mentor" (-> "counsellor") -- the 'intendedEndUserRole'-Vocab had a misplaced description string, which was placed below "author" and was actually meant to be the description string of "counsellor" Signed-off-by: criamos <[email protected]>
- discipline '900' -> Lisum shorthand 'B-BCM' (Basiscurriculum Medienbildung) - Lisum "learningResourceType" valuespace should natively support the LRT value 'text' now, therefore the previous mapping is no longer needed
- change: 'license.internal' no longer defaults to 'copyright' -- this change was requested as a temporary workaround until mixed Tutory licenses can be systematically determined - fix: API pagination -- the previous method of paginating through Tutory's API didn't work anymore because the old 'pageSize'-URL-parameter nowadays returns an HTTP Error 502 (Bad Gateway) -- tutory_spider will attempt to crawl through the API pages in iterations of 5000. If the API returns similar HTTP Errors in the future, try lowering the "api_pagesize_limit"-variable. - style: code formatting (via black) Signed-off-by: criamos <[email protected]>
- status: all tests passed - ToDos: cleanup / docs / more test-cases / refactoring - style: code formatting (via black)
- chore: flake8 v5.0.3 -> 6.0.0 - chore: pytest 7.1.1 -> 7.2.1
- change: replace crawler-specific RegEx parsing of license strings by using the newly implemented LicenseMapper -- this should reduce maintenance in the long run and will enable us to properly test edge-cases when they happen - change: try to gather 'description' from json_ld first -- if it's not available, stick to the previously used "teaser"-field - feat: 'lifecycle' authors / metadata_providers - feat: license authors - feat: digitallearninglab_spider overwrites parse() method - fix: unnecessary API calls (dupefilter warning during initial API pagination) - fix: "new_lrt"-mapping -> "Unterrichtsbaustein" (according to item_type) - docs: ToDos for future crawler updates
- add: "CC_BY_NC_ND" and "CC_BY_NC_SA" Signed-off-by: criamos <[email protected]>
Signed-off-by: criamos <[email protected]>
- change: switch to Playwright for HTML extraction -- after Serlo, Tutory is the next website that seems to cause the "Splash"-ccontainer to crash after a while - feat: gathering of 'license.author'-metadata in accordance to the "publishName"-flag of the Tutory API - fix: gather "description"-metadata with additional fallbacks -- the previously used XPath could no longer be found within Tutory's DOM and thousands of items would get dropped while crawling due to missing 'description' fields -- the crawler tries to parse the "description" metadata from the Tutory API first, then falls back to the DOM header meta fields --- if neither one of the preferred "description" fields are available, the crawler will try to grab text within the DOM itself - style: code formatting via black Signed-off-by: criamos <[email protected]>
- this change was necessary for oeh_spider since the "discipline"-Vocab-key for informatik ("320") does not line up with the eafCode for Informatik ("32002") Signed-off-by: criamos <[email protected]>
- feat: 4th fallback for Serlo 'title' -- lots of Serlo items only provide a generic "... - lernen mit Serlo!" title if the user didn't specify a title for his specific exercise --- this occurs more than 2700 times for "Mathe Aufgabe - lernen mit Serlo!" alone -- therefore we're now using the "lernen mit Serlo!"-String as an indicator that the title is most probably a generic one (set by the Serlo CMS) and we try to extract a fallback title from the last breadcrumb label instead Signed-off-by: criamos <[email protected]>
- line length 120
- change: remove hard-coded "discipline"-values for OMA entries and use the new valuespaces field for the "hochschulfaechersystematik"-Vocab instead - add: 13 new metadata providers - change: use LicenseMapper for URL-Parsing (since each Metadataprovider might use slightly different values) - add: 'affiliation'-metadata for persons (OERSI fields 'contributor' or 'author' optionally provide these additional fields) - add: 'datePublished' for lifecycle publishers - fix: 'getHash'-method checks first if 'datePublished' or 'dateCreated' fields exist at all before trying to access its value - fix: edge-case for missing 'technical.location' values - remove: "hcrt"-mapping value "reference_work" -- fixes mixup between hcrt key "index" and its prefLabel "reference work" -- since "index" is available in both hcrt and the "old" learningResourceType, this key doesn't need to appear in our mapping table to "new_lrt" - style: code formatting via black
- uses the corresponding edu-sharing field "ccm:oeh_taxonid_university" - since the "hochschulfaechersystematik" is currently generated as a 'scheme.json' instead of the usual 'index.json' file by SkoHub, a workaround was necessary in the meantime Signed-off-by: criamos <[email protected]>
- previously playwright waited until "networkidle" which resulted in some pages taking excessively long to grab a screenshot / text -- this change might have side-effects on some websites that fire the DOMContentLoaded event slightly too early, but the majority of websites should behave nicely -- waiting until 'networkidle' was a desperate workaround anyway since some websites have almost continuously cause traffic (e.g. due to site metrics or huge videos starting to buffer) Signed-off-by: criamos <[email protected]>
Signed-off-by: criamos <[email protected]>
- activate all OERSI metadata providers for crawling - revert back to thumbnail default behaviour -- use the provided thumbnail URL first, only take a screenshot of the website if no thumbnail was provided -- overwriting generic thumbnails with Playwright Screenshots could be implemented in a future version, if desired Signed-off-by: criamos <[email protected]>
- fix: edge-cases observed during OERSI crawls for license URLs ("deed.DE", "deed.CA") - add: additional test-cases for 2- and 4-char variations of CC license deeds Signed-off-by: criamos <[email protected]>
- improvement: additional metadata fields are considered for 'lifecycle' metadata_provider - (temporarily) deactivate "BC Campus" metadata provider -- reason: the website appears to detect webcrawlers? needs further investigation Signed-off-by: criamos <[email protected]>
- improve: 'general.identifier' takes the "_source.id"-value if available (URL) -- both the un- and -resolved URLs will be saved to 'technical.location' anyway for future duplicate detection routines - add: OERSI "audience" to "intendedEndUserRole" mapping -- "audience" as a field only occurs for "Finnish Library of Open Educational Resources" - add: hard-coded value for "educationalContext" - workaround: temporarily deactivating the crawling of "Finnish Library of Open Educational Resources" -- this specific provider serves malformed URLs which contain a URI fragment ("#" in the middle of the URL string), which cannot be resolved by Scrapy -- URLs containing URI fragments get cut off at the "#", which makes Scrapy shorten the Request and identify each URL as a DuplicateRequest Signed-off-by: criamos <[email protected]>
OERSI: feature: `hochschulfaechersystematik` (and further crawler-updates)
- while checking the results on Staging, a few more URL-paths were identified that should not be crawled because they aren't learning materials: -- bpb.de URLs that end with "/kontakt/", "/impressum/" or "/redaktion/" --- e.g.: https://www.bpb.de/themen/migration-integration/kurzdossiers/172761/impressum/ is not a desired (to be crawled) item in itself, but learners who stumble upon https://www.bpb.de/themen/migration-integration/kurzdossiers/ will be able to reach that information (if they need to) by pressing the "Inhalt"-button
- since the recent workaround for Drupal's BigPipe "no-JS"-cookie seems to have been successful, we can try to increase the crawling throughput again - change / code cleanup: remove "sitemap_rules"-variable (since it is only used in SitemapSpiders)
- both during startup and closing of the crawler, counters will be displayed for the number of: -- unique URLs that were parsed from the sitemaps and are expected to be passed into the "parse()"-method -- unique URLs that are expected to be filtered / dropped according to our deny_list, hash check etc. - this should make it more clear during later crawls how many URLs we expect to crawl and how many of those are filtered out
- implemented an additional URL check that catches URLs ending with known "Impressum"-like substrings -- while the deny_list looks at URL paths that could appear anywhere in the URL, this additional check explicitly only looks for specific substrings at the end of a URL (that would not be picked up by the previous deny_list)
- while debugging bpb_spider there were several license URL edge-cases which weren't handled properly by the license pipeline yet -- while CC 1.0 licenses shouldn't be used anymore and URLs pointing to those deeds are considered legacy URLs (see: https://creativecommons.org/licenses/), the license pipeline should recognize these URLs anyway and save them accordingly - tests: added two test-cases for license URLs from bpb.de
- fix: handling for CC0 edge-cases where the string would not get picked up by the crawler-specific RegEx -- if the crawler-specific RegEx fails to parse/detect a CC pattern, we'll use the (less precise) fallback method of LicenseMapper for string detection
- after a short consultation with Torsten, added the missing DocStrings with regard to ResponseItem properties (especially: 'full text' extraction related information) -- at the moment the only field that's actively used / stored within edu-sharing is 'ResponseItem.text' (which should be used for 'full text' extraction) -- the other fields ('cookies', 'headers', 'har', 'html', 'status', 'url') have never been connected / mapped to individual edu-sharing properties and are therefore (at the moment) not (yet) in use or might be obsolete
- change: update 'discipline'-mapping for the following DiLerTube categories: -- "Gesundheit und Soziales (GuS)" -- "Informatik & Medienbildung" -- "Technik" - feat: use keywords (see: "tags" from https://www.dilertube.de/component/tags/) to: -- match "grundschule" items (-> 'educationalContext') -- match "methoden und erklärvideos" (-> 'new_lrt') - perf: slightly increase Scrapy's Autothrottle "target concurrency" setting
- fix: "erklärvideo" mapping now looks for the value within the lowercase keyword string (instead of checking for string equality) - code cleanup / docs
- fix: use a more precise XPath Selector for license URLs to retrieve the article license -- this fixes the edge-cases where there where multiple license URLs within an article (e.g. PDFs or images with their own license) - feat: title fallback for ambiguous titles / headlines -- during the "Rohdatenprüfung" with Anja we observed articles where titles wouldn't be helpful to users --- e.g. "Literatur" or "Weiterführende Links" -- if we encounter such useless titles, we'll try to use the breadcrumbs navigation bar and build a more precise title from those elements --- example: https://www.bpb.de/themen/medien-journalismus/krieg-in-den-medien/130755/weiterfuehrende-links/ ---- the ambiguous title "Weiterführende Links" would become "Themen > Politik > Medien & Digitales > Medien & Journalismus > Krieg in den Medien > Weiterführende Links" instead
…eature) - docs: lay out necessary steps to be able to handle YouTube captions - style: fix 9 weak warnings by code formatting via black and refactoring method names to be more pythonic
- added channels: "Sehen & Verstehen - Experimente und meeehr", "MathemaTrick", "Christian Spannagel" - fix: change custom_url from YT channel "Sehen & Verstehen" to "YouTube Handle"-URL -- the custom URL format "https://www.youtube.com/c/sehenverstehenexperimenteundmeeehr/" is no longer supported by our YouTube crawler --- by clicking on "Home" / "Videos" once within a browser, YouTube redirects to the new "YouTube Handle" URL: https://www.youtube.com/@Unkauf_MC
- feat: reworked the "request_row()"-method to enable parsing of the "YouTube Handle" URL format -- see: https://support.google.com/youtube/answer/6180214?hl=en&sjid=8649083492401077263-EU and https://support.google.com/youtube/answer/11585688?hl=en&sjid=1154139518236355177-EU - change/remove: the previous "parse_custom_url()"-method relied on a HTTP response body that is no longer (reliably) available, causing crawls to fail silently -- observing youtube_spider in the debugger showed that YouTube redirected our HTTP Requests for custom URLs to a data protection / cookie consent pre-page, which does not contain the necessary channel_id information (which was REQUIRED for subsequent requests) -- before adding custom URLs to csv/youtube.csv always make sure that a "YouTube Handle" URL is used instead! (The crawler will throw a warning if a custom URL is detected that couldn't be handled) - style: fix whitespace in logging message
- this fixes a UnicodeDecodeError thrown by Scrapy's "robots.txt"-parser when trying to download the robots.txt file from YouTube's image host ("i.ytimg.com") at the start of a crawl process
ITSJOINTLY-1323 - add new channels and support "YouTube Handle" URLs
- fix: breadcrumbs-title-fallback omitted the last word of the breadcrumbs list -- title strings assembled from the breadcrumbs list were missing the last word ("Glossar", "Links" etc.) because the last breadcrumbs item uses a different CSS class than the rest of the strings - decrease log level from "getId()"-method from 'warning' to 'debug' -- (lots of items do not provide a stable ID -> throwing a warning for each of them is too spammy in the Kubernetes logs)
…ates Crawler Updates (Q1 2024) - KMap, DiLerTube, BpB, Tutory, YouTube
Merge changes between 2024-01 and 2024-04-10 into master
This was referenced Jul 16, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Links to Tickets or other pull requests
https://ticketsystem.dbildungscloud.de/browse/DMED-119
hpi-schul-cloud/dof_app_deploy#713
hpi-schul-cloud/schulcloud-server#4692
hpi-schul-cloud/schulcloud-client#3387
hpi-schul-cloud/nuxt-client#2933
Links to deployments :
https://dmed-119-integration-of-search-environment.dbc.dbildungscloud.dev/
https://dmed-119-integration-of-search-environment.nbc.dbildungscloud.dev/
https://dmed-119-integration-of-search-environment.brb.dbildungscloud.dev/
Changes
Datasecurity
Deployment
New Repos, NPM pakages or vendor scripts
Screenshots of UI changes
Approval for review