getting upstream changes #1

ptorru · 2024-01-29T19:55:43Z

No description provided.

**Summary** Pagination of HTML documents is currently unused. The `Page` class and concept were deeply embedding in the legacy organization of HTML partitioning code due to the legacy `Document` (= pages of elements) domain model. Remove this concept from the code such that elements are available directly from the partitioner. **Additional Context** - Pagination can be re-added later if we decide we want it again. A re-implementation would be much simpler and much lower impact to the structure of the code and introduce much less additional complexity, similar to the approach we take in `partition_docx()`.

### Description Add in tqdm support to show progress bar of status of each job when being run. Supported for each mode (serial, async, multiprocess). Also small timing wrapper around jobs to print out how long it took in total.

Update to the evaluation script to handle correct HTML syntax for tables. See Unstructured-IO/unstructured-inference#355 for details. This change: - modifies transforming HTML tables to evaluation internal `cells` format - fixes the indexing of the output (internal format cells) when HTML cells use spans

### Description The main goal of this was to reduce the duplicate code that was being written for each ingest pipeline step to support async and not async functionality. Additional bug fixes found and fixed: * each logger for ingest wasn't being instantiated correctly. This was fixed to instantiate in the beginning of a pipeline run as soon as the verbosity level can be determined. * The `requires_dependencies` wrapper wasn't wrapping async functions correctly. This was fixed so that `asyncio.iscoroutinefunction()` gets trigger correctly.

### Summary - bump unstructured-inference to `0.7.35` which fixed syntax for generated HTML tables - update unit tests and ingest test fixtures to reflect changes in the generated HTML tables - cut a release for `0.14.6` --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: christinestraub <[email protected]>

### Summary Updates the `wolfi` image to pull from the upstream `wolfi-base` base image to avoid maintaining the base layers in both locations. Closes #3105 by pulling in the fix from upstream. ### Testing `test_dockerfile` should continue to pass with the changes.

**Summary** Remove HTML-specific element types and return "regular" elements like `Title` and `NarrativeText` from `partition_html()`. **Additional Context** - An aspect of the legacy HTML partitioner was the use of HTML-specific element types used to track metadata during partitioning. - That role is no longer necessary or desireable. - HTML-specific elements like `HTMLTitle` and `HTMLNarrativeText` were returned from partitioning HTML but also the seven other file-formats that broker partitioning to HTML (convert-to-HTML and partition_html()). This does not cause immediate breakage because these are still `Text` element subtypes, but it produces a confusing developer experience. - Remove the prior metadata roles from HTML-specific elements and remove those element types entirely.

### Summary Version bumps for the week of 2024-06-17. There is a now a pin on `numpy` due to a breaking change in the latest version that we'll need to investigate and remove in a subsequent PR.

### Summary Updates handling of tempfiles so that they work on Windows systems. --------- Co-authored-by: Matt Robinson <[email protected]> Co-authored-by: Matt Robinson <[email protected]>

### Description Choosing to use async needs to be very careful because if a connector is set to use async, the pipeline will not fan out the inputs via multiprocessing but instead it will be limited to run in a single process under the assumption it has more benefit from async due to heavy network traffic. This means the exact same code that is not optimized for async and is blocking will force the pipeline to perform worse than simply never marking the connector to use async since the pipeline will fan that out using multiprocessing. All connectors and processes in the pipeline we revisited to make sure this criteria was met and updated accordingly: * Currently the unstructured client does not support making requests async, so this was moved over to use multiprocessing * fsspec connector was updated to use the async client from the fsspec library. This also required that the client be a `@property` fetched on demand, otherwise the client would break the multiprocessing pool since it maintains a thread lock and that can't be pickled when the fsspec connector doesn't support async. * elasticsearch was also updated to use the async client * weaviate only recently came out with async support in their SDK at a version that is higher than we can use in the open source repo, so a TODO was left but otherwise moved to use multiprocessing * all underlying embedders don't use async to embedder step must be multiprocessing for now. TODO left to update underlying embedder classes to optionally support async. * Chunking parameters were not accurately being passed through from cli to chunker params, this was fixed --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: rbiseck3 <[email protected]>

**Summary** Remove `unstructured.partition.html.convert_and_partition_html()`. Move file-type conversion (to HTML) responsibility to each brokering partitioner that uses that strategy and let them call `partition_html()` for themselves with the result. **Additional Context** Rationale: - `partition_html()` does not want or need to know which partitioners might broker partitioning to it. - Different brokering partitioners have their own methods to convert their format to HTML and quirks that may be involved for their format. Avoid coupling them so they can evolve independently. - The core of the conversion work is already encapsulated in `unstructured.partition.common.convert_file_to_html_text_using_pandoc()`. - `convert_and_partition_html()` represents an additional brokering layer with the entailed complexities of an additional site for default parameter values to be (mis-)applied and/or dropped and is an additional location for new parameters to be added.

This PR aims to remove the download packages step since all of that gets installed in the base images. This PR also updates the base `wolfi` image because the original base image can not be found anymore: https://github.com/Unstructured-IO/unstructured/actions/runs/9555654898/job/26339587945

### Description Migrate elasticsearch destination connector to new v2 ingest framework

This PR aims to fix a docker image publishing error caused by user changes when pulling the `amd64` image from the `unstructured` `wolfi-base` image. (#3213). --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: christinestraub <[email protected]>

This PR exposes functions in evaluation module for easy conversion between tables in Deckerd and HTML formats, which are useful in evalution experiments.

…r` (#3246) The Issue: When extracting images from pdfs, we use the metadata page number to index into a list of the images. However, the metadata page number can now be changed via `starting_page_number`. To get the true page index, we need to subtract this value. Testing: Run this snippet in a python shell. Before the fix, this throws an IndexError. On this branch, it will return the elements. ``` from unstructured.partition.auto import partition filename = "example-docs/layout-parser-paper-with-table.pdf" partition(filename, strategy="hi_res", extract_image_block_types=["Image", "Table"], starting_page_number=20) ``` --------- Co-authored-by: Matt Robinson <[email protected]> Co-authored-by: christinestraub <[email protected]>

### Summary Release PR for `0.14.7`.

This PR aims to fix a `test_dockerfile` job [failure](https://github.com/Unstructured-IO/unstructured/actions/runs/9613636416/job/26517074221?pr=3234) in CI after `base-images` repo update.

### Summary Sets to latest hash in quay.

**Summary** Extract as much mechanical refactoring from the HTML parser change-over into the PR as possible. This leaves the next PR focused on installing the new parser and the ingest-test impact. **Reviewers:** Commits are well groomed and reviewing commit-by-commit is probably easier. **Additional Context** This PR introduces the rewritten HTML parser. Its general design is recursive, consistent with the recursive structure of HTML (tree of elements). It also adds the unit tests for that parser but it does not _install_ the parser. So the behavior of `partition_html()` is unchanged by this PR. The next PR in this series will do that and handle the ingest and other unit test changes required to reflect the dozen or so bug-fixes the new parser provides.

**Summary** Remedy gap where `strategy` argument passed to `partition()` was not forwarded to `partition_pptx()` or `partition_docx()`.

Following the same pattern of #3273 and pass down `strategy` parameter to `partition_ppt` as well.

### Summary Updates the `arm64` build to use the same `Dockerfile` as `amd64`, since there are now upstream base images for `wolfi-base` for both architectures. The legacy `rockylinux-9.4` is now stashed in a subdirectory the `docker` subdirectory and is no longer built in CI, but is available is users would like to build it themselves. Additionally, this PR includes a fix to symlink `python3` to `python3.11`, which had caused a CI failure [here](https://github.com/Unstructured-IO/unstructured/actions/runs/9619486931/job/26535697755). BREAKING CHANGE: the `arm64` image no longer supports `.doc`, `.pptx`, or `.xls` because we do not yet have a `libreoffice` `apk` built for `wolfi-base`. We intend to address that as a follow on. All other filetypes work. ### Testing Successfully docker builds, tests, and smoke tests for [amd64](https://github.com/Unstructured-IO/unstructured/actions/runs/9619458140/job/26535610735?pr=3268) and [arm64](https://github.com/Unstructured-IO/unstructured/actions/runs/9619458140/job/26535610341?pr=3268) on the feature branch (with publish disabled).

@tullytim

Thanks to @tullytim we have a new Kafka source and destination connector. It also works with hosted Kafka via Confluent. Documentation will be added to the Docs repo.

Moving Chroma destination to the V2 version of connectors.

A few embedder types are missing sensitive field annotations.

### Summary - bump unstructured-inference to `0.7.35` which fixed `ValueError` when converting cells to HTML in the table processing subpipeline - cut a release for `0.14.8` --------- Co-authored-by: Matt Robinson <[email protected]> Co-authored-by: Matt Robinson <[email protected]>

### Description In use cases where an external system (such as code being run in a jupyter notebook) already has a running event loop, run the async code in a dedicated thread pool to not conflict with the existing event loop. This also has a variety of fixes that were found when putting together a demo leveraging the elasticsearch destination connector

### Description Migrate the google drive source connector over to the new v2 ingest framework and include a variety of improvements as part of the refactor: * The ID is no longer limited to a drive id but can also be the id of a subfolder within a drive or a file directly and each case is handled appropriately * More metadata is pulled in from google drive to enrich the partitioned elements downstream and now the modified date is being set to not reprocess if the ingest pipeline already has the file cached * timing information is set on the file created when downloaded based on the last modified data retrieved from google drive --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: rbiseck3 <[email protected]>

### Description Move astradb destination connector over to the new v2 ingest framework

…ge and update tests

…narios

Script to render HTML from unstructured elements. NOTE: This script is not intended to be used as a module. NOTE: This script is only intended to be used with outputs with non-empty `metadata.text_as_html`. TODO: It was noted that unstructured_elements_to_ontology func always returns a single page This script is using helper functions to handle multiple pages. I am not sure if this was intended, or it is a bug - if it is a bug it would require bit longer debugging - to make it usable fast I used workarounds. Usage: test with any outputs with non-empty `metadata.text_as_html`. Example files attached. `[Example-Bill-of-Lading-Waste.docx.pdf.json](https://github.com/user-attachments/files/17922898/Example-Bill-of-Lading-Waste.docx.pdf.json)` [Breast_Cancer1-5.pdf.json](https://github.com/user-attachments/files/17922899/Breast_Cancer1-5.pdf.json)

…narios

- per [ticket](https://unstructured-ai.atlassian.net/browse/ML-551), there is a bug in the `unstructured` lib under metrics/evaluate.py that incorrectly retrieves the file extension before the conversion to cct file from paths like '*.pdf.txt' . (see below screenshot) - the current status is in the top example - we should have the correct version in the bottom example of the screenshot. ![image](https://github.com/user-attachments/assets/6d82de85-3b54-4e77-a637-28a27fcb279d) - in addition, i also observe the doctype returned are not aligned, some returning '.*' and some are returning without the dot. - therefore, i just aligned them to be output into the same version which is '.*".

release 0.16.10 so that competitor-eval can install and take advantage of the latest change in the metric calculation

**Summary** Relax table-segregation rule applied during chunking such that a `Table` and `Text`-subtype elements can be combined into a single chunk when the chunking window allows. **Additional Context** Until now, `Table` elements have always been segregated during chunking, i.e. a chunk that contained a table would never contain any other element. In certain scenarios, especially when a large chunking window of say 2000 characters is used, this behavior can reduce retrieval effectiveness by isolating the table from surrounding context. --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: scanny <[email protected]>

I noticed the ipv4 regex is wrong (it only capture one or two-digit octets, e.g. `n.nn.n.nn`). Here's a correction and a bumped test for it. If you wish I can break out the ipv4 test to its own case, so we don't interfere with the existing `EMAIL_META_DATA_INPUT` ipv6 extraction test. Side note: The comment at `unstructured/nlp/patterns.py#95` includes a bad ipv4 address example (last octet is wrongfully left-padded with a zero). I left it as it is because I'm not sure if the intention is to include "non-conventional" ipv4 addresses, like octal or hexadecimal octets.

Release only, no code changes.

**Summary** Prepare auto-partitioning for pluggable partitioners. Move toward a uniform partitioner call signature in `auto/partition()` such that a custom or override partitioner can be registered without requiring code changes. **Additional Context** The central job of `auto/partition()` is to detect the file-type of the given file and use that to dispatch partitioning to the corresponding partitioner function e.g. `partition_pdf()` or `partition_docx()`. In the existing code, each partitioner function is called with parameters "hand-picked" from the available parameters passed to the `partition()` function. This is unnecessary and couples those partitioners tightly with the dispatch function. The desired state is that all available arguments are passed as `kwargs` and the partitioner function "self-selects" the arguments it will be sensitive to, applies its own appropriate default values when the argument is omitted, and simply ignore any arguments it doesn't use. Note that achieving this requires no changes to partitioner functions because they already do precisely this. So the job is to pass all arguments (other than `filename` and `file`) to the partitioner as `kwargs`. This will allow additional or alternate partitioners to be registered at runtime and dispatched to, because as long as they have the signature `partition_x(filename, file, kwargs) -> list[Element]` then they can be dispatched to without customization.

**Summary** CVE-2024-11053 https://curl.se/docs/CVE-2024-11053.html (severity Low) was published on Dec 11, 2024 and began failing CI builds on open-core on Dec 13, 2024 when it appeared in `grype` apparently misclassified as a critical vulnerability. The severity reported on the CVE is "Low" so it should not fail builds. Add a `.grype.yaml` file to ignore this CVE until grype is updated.

**Summary** Remove pin on `ruff` linter and fix the handful of lint errors a newer version catches.

**Summary** Fixes a bug where a CSV file with asserted content-type `application/vnd.ms-excel` was incorrectly identified as an XLS file and failed partitioning. **Additional Context** The `content_type` argument to partitioning is often authored by the client system (e.g. Unstructured SDK) and is both unreliable and outside the control of the user. In this case the `.csv -> XLS` mapping is correct for certain purposes (Excel is often used to load and edit CSV files) but not for partitioning, and the user has no readily available way to override the mapping. XLS files as well as seven other common binary file types can be efficiently detected 100% of the time (at least 99.999%) using code we already have in the file detector. - Promote this direct-inspection strategy to be tried first. - When DOC, DOCX, EPUB, ODT, PPT, PPTX, XLS, or XLSX is detected, use that file-type. - When one of those types is NOT detected, clear the asserted `content_type` when it matches any of those types. This prevents the problem seen in the bug where the asserted content type was used to determine the file-type. - The remaining content_type, guess MIME-type, and filename-extension mapping strategies are tried, in that order, only when direct inspection fails. This is largely the same as it was before. - Fix #3781 while we were in the neighborhood. - Fix #3596 as well, essentially an earlier report of #3781.

Added `CONTRIBUTING.md` from the archived repo as mentioned in the issue: #3540 Co-authored-by: John <[email protected]>

**Summary** Improve element-type mapping for Chinese text. Fixes bug where Chinese text would produce large numbers of false-positive `Title` elements. Fixes #3084 --------- Co-authored-by: scanny <[email protected]> Co-authored-by: ryannikolaidis <[email protected]>

Fixes #3666 --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: scanny <[email protected]>

### Description Add ndjson file type support and treat is the same as json files.

scanny and others added 30 commits June 13, 2024 18:19

feat/tqdm ingest support (#3199)

dadc9c6

### Description Add in tqdm support to show progress bar of status of each job when being run. Supported for each mode (serial, async, multiprocess). Also small timing wrapper around jobs to print out how long it took in total.

build(deps): version bumps for 2024-06-17 (#3220)

2815226

### Summary Version bumps for the week of 2024-06-17. There is a now a pin on `numpy` due to a breaking change in the latest version that we'll need to investigate and remove in a subsequent PR.

enhancement: make tempfiles windows friendly (#3108)

6220633

### Summary Updates handling of tempfiles so that they work on Windows systems. --------- Co-authored-by: Matt Robinson <[email protected]> Co-authored-by: Matt Robinson <[email protected]>

Roman/migrate es dest (#3224)

fd98cf9

### Description Migrate elasticsearch destination connector to new v2 ingest framework

feat: expose converters deckerd -> html and back (#3233)

c3af03d

This PR exposes functions in evaluation module for easy conversion between tables in Deckerd and HTML formats, which are useful in evalution experiments.

build: version bump for release 0.14.7 (#3259)

80abbcd

### Summary Release PR for `0.14.7`.

fix: update base image SHA for amd64 wolfi (#3270)

14f149d

This PR aims to fix a `test_dockerfile` job [failure](https://github.com/Unstructured-IO/unstructured/actions/runs/9613636416/job/26517074221?pr=3234) in CI after `base-images` repo update.

build: fix amd64 image hash (#3272)

e1b7553

### Summary Sets to latest hash in quay.

fix(auto): partition() passes strategy to PPTX,DOCX (#3273)

16df694

**Summary** Remedy gap where `strategy` argument passed to `partition()` was not forwarded to `partition_pptx()` or `partition_docx()`.

Feat/pass down strategy to partition ppt as well (#3274)

edddf9f

Following the same pattern of #3273 and pass down `strategy` parameter to `partition_ppt` as well.

feat: Kafka source and destination connector (#3176)

8610bd3

Thanks to @tullytim we have a new Kafka source and destination connector. It also works with hosted Kafka via Confluent. Documentation will be added to the Docs repo.

rfctr: chroma destination migrated to V2 (#3214)

88b08a7

Moving Chroma destination to the V2 version of connectors.

Fix missing sensitive fields for embedders (#3263)

ce591e2

A few embedder types are missing sensitive field annotations.

feat/migrate astra db (#3294)

a7a53f6

### Description Move astradb destination connector over to the new v2 ingest framework

christinestraub and others added 30 commits December 4, 2024 11:33

Feat: enhance quote standardization with comprehensive Unicode covera…

4e0f7cd

…ge and update tests

test: update string tests for consistent quote handling

c821f12

fix: ensure newline at end of file in standardize_quotes function

9038b88

test: enhance quote standardization tests with additional Unicode sce…

c0c3fd6

…narios

test: enhance quote standardization tests with additional Unicode sce…

bb26603

…narios

test:fix lint errors

5e60942

fix: confict error

00a9991

fix:confict error

f44862a

fix: modify changelog.md

a4be1d6

Merge branch 'main' into ML-593/quote-standardization

7d06c12

feat: enhance quote standardization tests with additional Unicode sce…

ef1c85e

…narios

feat: update standardize_quote()

3bca724

release 0.16.10 (#3816)

59e6cff

release 0.16.10 so that competitor-eval can install and take advantage of the latest change in the metric calculation

fix: resolve mergeing conflict error

9076d56

test: fix lint error

2f06d5a

Update CHANGELOG.md (#3818)

18d6c81

release: prepare release 0.16.11 (#3819)

b981d71

Release only, no code changes.

build: remove ruff version upper bound (#3829)

10f0d54

**Summary** Remove pin on `ruff` linter and fix the handful of lint errors a newer version catches.

Added contributing from archived repo (#3616)

9a9bf4c

Added `CONTRIBUTING.md` from the archived repo as mentioned in the issue: #3540 Co-authored-by: John <[email protected]>

fix: html incorrectly categorizing text (#3841)

b3a2dd4

Fixes #3666 --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: scanny <[email protected]>

feat: add ndjson support (#3845)

50ea6fe

### Description Add ndjson file type support and treat is the same as json files.

minor comment to trigger new container workflow (#3848)

0245661

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getting upstream changes #1

getting upstream changes #1

ptorru commented Jan 29, 2024

getting upstream changes #1

Are you sure you want to change the base?

getting upstream changes #1

Conversation

ptorru commented Jan 29, 2024