DMED-119 - Spike: direktes Einbinden der edu-sharing-Suchumgebung in den "Lern-Store" der SVS #47

bergatco · 2024-07-16T06:45:41Z

Description

Links to Tickets or other pull requests

https://ticketsystem.dbildungscloud.de/browse/DMED-119
hpi-schul-cloud/dof_app_deploy#713
hpi-schul-cloud/schulcloud-server#4692
hpi-schul-cloud/schulcloud-client#3387
hpi-schul-cloud/nuxt-client#2933

Links to deployments :
https://dmed-119-integration-of-search-environment.dbc.dbildungscloud.dev/
https://dmed-119-integration-of-search-environment.nbc.dbildungscloud.dev/
https://dmed-119-integration-of-search-environment.brb.dbildungscloud.dev/

Changes

Datasecurity

Deployment

New Repos, NPM pakages or vendor scripts

Screenshots of UI changes

Approval for review

All points were discussed with the ticket creator, support-team or product owner. The code upholds all quality guidelines from the PR-template.

Notice: Please remove the WIP label if the PR is ready to review, otherwise nobody will review it.

- fix for eafCode edge-cases where OEH 'discipline' vocab keys don't line up with "eafsys.txt" eafCodes -- for context, see: openeduhub/oeh-metadata-vocabs#36 - add: additional Lisum shorthand mapping for 'discipline' value "320" (Informatik) to "C-Inf" Signed-off-by: criamos <[email protected]>

- bumped package versions according to Dependabot recommendations from 2023-02-14: -- 'wheel', 'lxml', 'Pillow', 'certifi' - additionally, also bumped 'playwright' and 'requests' package to a newer version Signed-off-by: criamos <[email protected]>

- the 'build-and-publish'-pipeline failed due to a dependency conflict between pyppeteer and playwright - since all crawlers which previously used pyppeteer switched to Playwright a while ago anyway, I removed the obsolete package from our requirements.txt and web_tools.py Signed-off-by: criamos <[email protected]>

2023: Lisum changes & RSS Crawlers

- replaced URLs by uuids because the pipelines expect uuids, 'prefLabel' or 'altLabel' string values Signed-off-by: criamos <[email protected]>

- fix: 'description' (+ fallbacks) - refactor: 'base.sourceId' and 'base.hash' (+ fallbacks) - code cleanup Signed-off-by: criamos <[email protected]>

- fix: 'ResponseItemLoader' call to LomBase -- since serlo_spider already (completely) relies on Playwright for text and screenshot extraction, I replaced the (one) remaining use of Splash (which was called by super().mapResponse) from serlo_spider -- serlo webpages reliably crashed the 'Splash'-container during the start of a crawl Signed-off-by: criamos <[email protected]>

- if the URL found within OERSI's '_source.id'-field is different from the resolved URL by Scrapy, both strings will be saved to 'technical.location' -- this might be necessary for future duplicate detection - this change was made due to KLISUM-212 Signed-off-by: criamos <[email protected]>

- fix 'intendedEndUserRole' mapping for "mentor" (-> "counsellor") -- the 'intendedEndUserRole'-Vocab had a misplaced description string, which was placed below "author" and was actually meant to be the description string of "counsellor" Signed-off-by: criamos <[email protected]>

- discipline '900' -> Lisum shorthand 'B-BCM' (Basiscurriculum Medienbildung) - Lisum "learningResourceType" valuespace should natively support the LRT value 'text' now, therefore the previous mapping is no longer needed

- change: 'license.internal' no longer defaults to 'copyright' -- this change was requested as a temporary workaround until mixed Tutory licenses can be systematically determined - fix: API pagination -- the previous method of paginating through Tutory's API didn't work anymore because the old 'pageSize'-URL-parameter nowadays returns an HTTP Error 502 (Bad Gateway) -- tutory_spider will attempt to crawl through the API pages in iterations of 5000. If the API returns similar HTTP Errors in the future, try lowering the "api_pagesize_limit"-variable. - style: code formatting (via black) Signed-off-by: criamos <[email protected]>

- status: all tests passed - ToDos: cleanup / docs / more test-cases / refactoring - style: code formatting (via black)

- chore: flake8 v5.0.3 -> 6.0.0 - chore: pytest 7.1.1 -> 7.2.1

- change: replace crawler-specific RegEx parsing of license strings by using the newly implemented LicenseMapper -- this should reduce maintenance in the long run and will enable us to properly test edge-cases when they happen - change: try to gather 'description' from json_ld first -- if it's not available, stick to the previously used "teaser"-field - feat: 'lifecycle' authors / metadata_providers - feat: license authors - feat: digitallearninglab_spider overwrites parse() method - fix: unnecessary API calls (dupefilter warning during initial API pagination) - fix: "new_lrt"-mapping -> "Unterrichtsbaustein" (according to item_type) - docs: ToDos for future crawler updates

- add: "CC_BY_NC_ND" and "CC_BY_NC_SA" Signed-off-by: criamos <[email protected]>

Signed-off-by: criamos <[email protected]>

- change: switch to Playwright for HTML extraction -- after Serlo, Tutory is the next website that seems to cause the "Splash"-ccontainer to crash after a while - feat: gathering of 'license.author'-metadata in accordance to the "publishName"-flag of the Tutory API - fix: gather "description"-metadata with additional fallbacks -- the previously used XPath could no longer be found within Tutory's DOM and thousands of items would get dropped while crawling due to missing 'description' fields -- the crawler tries to parse the "description" metadata from the Tutory API first, then falls back to the DOM header meta fields --- if neither one of the preferred "description" fields are available, the crawler will try to grab text within the DOM itself - style: code formatting via black Signed-off-by: criamos <[email protected]>

- this change was necessary for oeh_spider since the "discipline"-Vocab-key for informatik ("320") does not line up with the eafCode for Informatik ("32002") Signed-off-by: criamos <[email protected]>

- feat: 4th fallback for Serlo 'title' -- lots of Serlo items only provide a generic "... - lernen mit Serlo!" title if the user didn't specify a title for his specific exercise --- this occurs more than 2700 times for "Mathe Aufgabe - lernen mit Serlo!" alone -- therefore we're now using the "lernen mit Serlo!"-String as an indicator that the title is most probably a generic one (set by the Serlo CMS) and we try to extract a fallback title from the last breadcrumb label instead Signed-off-by: criamos <[email protected]>

- line length 120

- change: remove hard-coded "discipline"-values for OMA entries and use the new valuespaces field for the "hochschulfaechersystematik"-Vocab instead - add: 13 new metadata providers - change: use LicenseMapper for URL-Parsing (since each Metadataprovider might use slightly different values) - add: 'affiliation'-metadata for persons (OERSI fields 'contributor' or 'author' optionally provide these additional fields) - add: 'datePublished' for lifecycle publishers - fix: 'getHash'-method checks first if 'datePublished' or 'dateCreated' fields exist at all before trying to access its value - fix: edge-case for missing 'technical.location' values - remove: "hcrt"-mapping value "reference_work" -- fixes mixup between hcrt key "index" and its prefLabel "reference work" -- since "index" is available in both hcrt and the "old" learningResourceType, this key doesn't need to appear in our mapping table to "new_lrt" - style: code formatting via black

- uses the corresponding edu-sharing field "ccm:oeh_taxonid_university" - since the "hochschulfaechersystematik" is currently generated as a 'scheme.json' instead of the usual 'index.json' file by SkoHub, a workaround was necessary in the meantime Signed-off-by: criamos <[email protected]>

- previously playwright waited until "networkidle" which resulted in some pages taking excessively long to grab a screenshot / text -- this change might have side-effects on some websites that fire the DOMContentLoaded event slightly too early, but the majority of websites should behave nicely -- waiting until 'networkidle' was a desperate workaround anyway since some websites have almost continuously cause traffic (e.g. due to site metrics or huge videos starting to buffer) Signed-off-by: criamos <[email protected]>

Signed-off-by: criamos <[email protected]>

- activate all OERSI metadata providers for crawling - revert back to thumbnail default behaviour -- use the provided thumbnail URL first, only take a screenshot of the website if no thumbnail was provided -- overwriting generic thumbnails with Playwright Screenshots could be implemented in a future version, if desired Signed-off-by: criamos <[email protected]>

- fix: edge-cases observed during OERSI crawls for license URLs ("deed.DE", "deed.CA") - add: additional test-cases for 2- and 4-char variations of CC license deeds Signed-off-by: criamos <[email protected]>

- improvement: additional metadata fields are considered for 'lifecycle' metadata_provider - (temporarily) deactivate "BC Campus" metadata provider -- reason: the website appears to detect webcrawlers? needs further investigation Signed-off-by: criamos <[email protected]>

- improve: 'general.identifier' takes the "_source.id"-value if available (URL) -- both the un- and -resolved URLs will be saved to 'technical.location' anyway for future duplicate detection routines - add: OERSI "audience" to "intendedEndUserRole" mapping -- "audience" as a field only occurs for "Finnish Library of Open Educational Resources" - add: hard-coded value for "educationalContext" - workaround: temporarily deactivating the crawling of "Finnish Library of Open Educational Resources" -- this specific provider serves malformed URLs which contain a URI fragment ("#" in the middle of the URL string), which cannot be resolved by Scrapy -- URLs containing URI fragments get cut off at the "#", which makes Scrapy shorten the Request and identify each URL as a DuplicateRequest Signed-off-by: criamos <[email protected]>

OERSI: feature: `hochschulfaechersystematik` (and further crawler-updates)

- while checking the results on Staging, a few more URL-paths were identified that should not be crawled because they aren't learning materials: -- bpb.de URLs that end with "/kontakt/", "/impressum/" or "/redaktion/" --- e.g.: https://www.bpb.de/themen/migration-integration/kurzdossiers/172761/impressum/ is not a desired (to be crawled) item in itself, but learners who stumble upon https://www.bpb.de/themen/migration-integration/kurzdossiers/ will be able to reach that information (if they need to) by pressing the "Inhalt"-button

- since the recent workaround for Drupal's BigPipe "no-JS"-cookie seems to have been successful, we can try to increase the crawling throughput again - change / code cleanup: remove "sitemap_rules"-variable (since it is only used in SitemapSpiders)

- both during startup and closing of the crawler, counters will be displayed for the number of: -- unique URLs that were parsed from the sitemaps and are expected to be passed into the "parse()"-method -- unique URLs that are expected to be filtered / dropped according to our deny_list, hash check etc. - this should make it more clear during later crawls how many URLs we expect to crawl and how many of those are filtered out

- implemented an additional URL check that catches URLs ending with known "Impressum"-like substrings -- while the deny_list looks at URL paths that could appear anywhere in the URL, this additional check explicitly only looks for specific substrings at the end of a URL (that would not be picked up by the previous deny_list)

- while debugging bpb_spider there were several license URL edge-cases which weren't handled properly by the license pipeline yet -- while CC 1.0 licenses shouldn't be used anymore and URLs pointing to those deeds are considered legacy URLs (see: https://creativecommons.org/licenses/), the license pipeline should recognize these URLs anyway and save them accordingly - tests: added two test-cases for license URLs from bpb.de

- fix: handling for CC0 edge-cases where the string would not get picked up by the crawler-specific RegEx -- if the crawler-specific RegEx fails to parse/detect a CC pattern, we'll use the (less precise) fallback method of LicenseMapper for string detection

- after a short consultation with Torsten, added the missing DocStrings with regard to ResponseItem properties (especially: 'full text' extraction related information) -- at the moment the only field that's actively used / stored within edu-sharing is 'ResponseItem.text' (which should be used for 'full text' extraction) -- the other fields ('cookies', 'headers', 'har', 'html', 'status', 'url') have never been connected / mapped to individual edu-sharing properties and are therefore (at the moment) not (yet) in use or might be obsolete

- change: update 'discipline'-mapping for the following DiLerTube categories: -- "Gesundheit und Soziales (GuS)" -- "Informatik & Medienbildung" -- "Technik" - feat: use keywords (see: "tags" from https://www.dilertube.de/component/tags/) to: -- match "grundschule" items (-> 'educationalContext') -- match "methoden und erklärvideos" (-> 'new_lrt') - perf: slightly increase Scrapy's Autothrottle "target concurrency" setting

- fix: "erklärvideo" mapping now looks for the value within the lowercase keyword string (instead of checking for string equality) - code cleanup / docs

- fix: use a more precise XPath Selector for license URLs to retrieve the article license -- this fixes the edge-cases where there where multiple license URLs within an article (e.g. PDFs or images with their own license) - feat: title fallback for ambiguous titles / headlines -- during the "Rohdatenprüfung" with Anja we observed articles where titles wouldn't be helpful to users --- e.g. "Literatur" or "Weiterführende Links" -- if we encounter such useless titles, we'll try to use the breadcrumbs navigation bar and build a more precise title from those elements --- example: https://www.bpb.de/themen/medien-journalismus/krieg-in-den-medien/130755/weiterfuehrende-links/ ---- the ambiguous title "Weiterführende Links" would become "Themen > Politik > Medien & Digitales > Medien & Journalismus > Krieg in den Medien > Weiterführende Links" instead

…eature) - docs: lay out necessary steps to be able to handle YouTube captions - style: fix 9 weak warnings by code formatting via black and refactoring method names to be more pythonic

- added channels: "Sehen & Verstehen - Experimente und meeehr", "MathemaTrick", "Christian Spannagel" - fix: change custom_url from YT channel "Sehen & Verstehen" to "YouTube Handle"-URL -- the custom URL format "https://www.youtube.com/c/sehenverstehenexperimenteundmeeehr/" is no longer supported by our YouTube crawler --- by clicking on "Home" / "Videos" once within a browser, YouTube redirects to the new "YouTube Handle" URL: https://www.youtube.com/@Unkauf_MC

- feat: reworked the "request_row()"-method to enable parsing of the "YouTube Handle" URL format -- see: https://support.google.com/youtube/answer/6180214?hl=en&sjid=8649083492401077263-EU and https://support.google.com/youtube/answer/11585688?hl=en&sjid=1154139518236355177-EU - change/remove: the previous "parse_custom_url()"-method relied on a HTTP response body that is no longer (reliably) available, causing crawls to fail silently -- observing youtube_spider in the debugger showed that YouTube redirected our HTTP Requests for custom URLs to a data protection / cookie consent pre-page, which does not contain the necessary channel_id information (which was REQUIRED for subsequent requests) -- before adding custom URLs to csv/youtube.csv always make sure that a "YouTube Handle" URL is used instead! (The crawler will throw a warning if a custom URL is detected that couldn't be handled) - style: fix whitespace in logging message

- this fixes a UnicodeDecodeError thrown by Scrapy's "robots.txt"-parser when trying to download the robots.txt file from YouTube's image host ("i.ytimg.com") at the start of a crawl process

ITSJOINTLY-1323 - add new channels and support "YouTube Handle" URLs

- fix: breadcrumbs-title-fallback omitted the last word of the breadcrumbs list -- title strings assembled from the breadcrumbs list were missing the last word ("Glossar", "Links" etc.) because the last breadcrumbs item uses a different CSS class than the rest of the strings - decrease log level from "getId()"-method from 'warning' to 'debug' -- (lots of items do not provide a stable ID -> throwing a warning for each of them is too spammy in the Kubernetes logs)

…ates Crawler Updates (Q1 2024) - KMap, DiLerTube, BpB, Tutory, YouTube

Merge changes between 2024-01 and 2024-04-10 into master

Criamos added 30 commits February 14, 2023 15:58

chore: update requirements.txt

25b49a4

- bumped package versions according to Dependabot recommendations from 2023-02-14: -- 'wheel', 'lxml', 'Pillow', 'certifi' - additionally, also bumped 'playwright' and 'requests' package to a newer version Signed-off-by: criamos <[email protected]>

Merge pull request openeduhub#71 from openeduhub/2023-01-lisum

ab6bb33

2023: Lisum changes & RSS Crawlers

fix: Constants (NEW_LRT)

3dd2183

- replaced URLs by uuids because the pipelines expect uuids, 'prefLabel' or 'altLabel' string values Signed-off-by: criamos <[email protected]>

serlo_spider v0.2.4

cd6f9dd

- fix: 'description' (+ fallbacks) - refactor: 'base.sourceId' and 'base.hash' (+ fallbacks) - code cleanup Signed-off-by: criamos <[email protected]>

LisumPipeline mappings for "Medienbildung" and OEH LRT "text"

1660826

- discipline '900' -> Lisum shorthand 'B-BCM' (Basiscurriculum Medienbildung) - Lisum "learningResourceType" valuespace should natively support the LRT value 'text' now, therefore the previous mapping is no longer needed

feat: license_mapper and tests (squashed)

cb6705a

- status: all tests passed - ToDos: cleanup / docs / more test-cases / refactoring - style: code formatting (via black)

chore: update flake8 & pytest

daa6aa5

- chore: flake8 v5.0.3 -> 6.0.0 - chore: pytest 7.1.1 -> 7.2.1

fix: license['internal']-mapping missing values

feb53c3

- add: "CC_BY_NC_ND" and "CC_BY_NC_SA" Signed-off-by: criamos <[email protected]>

fix: LicenseItem 'author' freetext strings

a04468d

Signed-off-by: criamos <[email protected]>

style: more explicit LicenseMapper debug messages

e543351

fix: LisumPipeline "Informatik" eafCode edge-case

5317669

- this change was necessary for oeh_spider since the "discipline"-Vocab-key for informatik ("320") does not line up with the eafCode for Informatik ("32002") Signed-off-by: criamos <[email protected]>

style: code formatting via black

5546e47

- line length 120

remove: 'scheme.json'-workaround

3acf51e

Signed-off-by: criamos <[email protected]>

update: LicenseMapper utility

a2a9195

- fix: edge-cases observed during OERSI crawls for license URLs ("deed.DE", "deed.CA") - add: additional test-cases for 2- and 4-char variations of CC license deeds Signed-off-by: criamos <[email protected]>

Merge pull request openeduhub#72 from openeduhub/oersi_updates

2a4b9ea

OERSI: feature: `hochschulfaechersystematik` (and further crawler-updates)

Criamos added 21 commits February 16, 2024 12:04

change: use class-specific logger instead of 'root' logging

6173908

tests: add edge-case from DiLerTube to LicenseMapper test-suite

482755b

dilertube_spider v0.0.3

8222d5f

- fix: handling for CC0 edge-cases where the string would not get picked up by the crawler-specific RegEx -- if the crawler-specific RegEx fails to parse/detect a CC pattern, we'll use the (less precise) fallback method of LicenseMapper for string detection

docs: update DocStrings with regard to 'full text' metadata

6ba7be0

dilertube_spider v0.0.5

b6f1921

- fix: "erklärvideo" mapping now looks for the value within the lowercase keyword string (instead of checking for string equality) - code cleanup / docs

docs/style: add ToDos for YouTube captions API (fulltext extraction f…

24d7738

…eature) - docs: lay out necessary steps to be able to handle YouTube captions - style: fix 9 weak warnings by code formatting via black and refactoring method names to be more pythonic

disable "robots.txt" parsing for youtube_spider

a33b88a

- this fixes a UnicodeDecodeError thrown by Scrapy's "robots.txt"-parser when trying to download the robots.txt file from YouTube's image host ("i.ytimg.com") at the start of a crawl process

Merge pull request openeduhub#101 from openeduhub/feat_youtube_handles

409ad8b

ITSJOINTLY-1323 - add new channels and support "YouTube Handle" URLs

Merge pull request openeduhub#102 from openeduhub/2024-01-crawler-upd…

be402ec

…ates Crawler Updates (Q1 2024) - KMap, DiLerTube, BpB, Tutory, YouTube

Merge pull request openeduhub#103 from openeduhub/develop

79fae39

Merge changes between 2024-01 and 2024-04-10 into master

bergatco added the WIP This feature branch is in progress, do not merge it. label Jul 16, 2024

bergatco self-assigned this Jul 16, 2024

Merge branch 'master' into DMED-119-integration-of-search-environment

d0e5f55

bergatco closed this Jul 16, 2024

bergatco deleted the DMED-119-integration-of-search-environment branch July 16, 2024 09:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DMED-119 - Spike: direktes Einbinden der edu-sharing-Suchumgebung in den "Lern-Store" der SVS #47

DMED-119 - Spike: direktes Einbinden der edu-sharing-Suchumgebung in den "Lern-Store" der SVS #47

bergatco commented Jul 16, 2024 •

edited

Loading

DMED-119 - Spike: direktes Einbinden der edu-sharing-Suchumgebung in den "Lern-Store" der SVS #47

DMED-119 - Spike: direktes Einbinden der edu-sharing-Suchumgebung in den "Lern-Store" der SVS #47

Conversation

bergatco commented Jul 16, 2024 • edited Loading

Description

Links to Tickets or other pull requests

Changes

Datasecurity

Deployment

New Repos, NPM pakages or vendor scripts

Screenshots of UI changes

Approval for review

bergatco commented Jul 16, 2024 •

edited

Loading