Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getting upstream changes #1

Open
wants to merge 507 commits into
base: octoai
Choose a base branch
from
Open

Conversation

ptorru
Copy link

@ptorru ptorru commented Jan 29, 2024

No description provided.

scanny and others added 30 commits June 13, 2024 18:19
**Summary**
Pagination of HTML documents is currently unused. The `Page` class and
concept were deeply embedding in the legacy organization of HTML
partitioning code due to the legacy `Document` (= pages of elements)
domain model. Remove this concept from the code such that elements are
available directly from the partitioner.

**Additional Context**
- Pagination can be re-added later if we decide we want it again. A
re-implementation would be much simpler and much lower impact to the
structure of the code and introduce much less additional complexity,
similar to the approach we take in `partition_docx()`.
### Description
Add in tqdm support to show progress bar of status of each job when
being run. Supported for each mode (serial, async, multiprocess). Also
small timing wrapper around jobs to print out how long it took in total.
Update to the evaluation script to handle correct HTML syntax for
tables.
See Unstructured-IO/unstructured-inference#355
for details.

This change:
- modifies transforming HTML tables to evaluation internal `cells`
format
- fixes the indexing of the output (internal format cells) when HTML
cells use spans
### Description
The main goal of this was to reduce the duplicate code that was being
written for each ingest pipeline step to support async and not async
functionality.

Additional bug fixes found and fixed:
* each logger for ingest wasn't being instantiated correctly. This was
fixed to instantiate in the beginning of a pipeline run as soon as the
verbosity level can be determined.
* The `requires_dependencies` wrapper wasn't wrapping async functions
correctly. This was fixed so that `asyncio.iscoroutinefunction()` gets
trigger correctly.
### Summary
- bump unstructured-inference to `0.7.35` which fixed syntax for
generated HTML tables
- update unit tests and ingest test fixtures to reflect changes in the
generated HTML tables
- cut a release for `0.14.6`

---------

Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: christinestraub <[email protected]>
### Summary

Updates the `wolfi` image to pull from the upstream `wolfi-base` base
image to avoid maintaining the base layers in both locations. Closes
#3105 by pulling in the fix from upstream.

### Testing

`test_dockerfile` should continue to pass with the changes.
**Summary**
Remove HTML-specific element types and return "regular" elements like
`Title` and `NarrativeText` from `partition_html()`.

**Additional Context**
- An aspect of the legacy HTML partitioner was the use of HTML-specific
element types used to track metadata during partitioning.
- That role is no longer necessary or desireable.
- HTML-specific elements like `HTMLTitle` and `HTMLNarrativeText` were
returned from partitioning HTML but also the seven other file-formats
that broker partitioning to HTML (convert-to-HTML and partition_html()).
This does not cause immediate breakage because these are still `Text`
element subtypes, but it produces a confusing developer experience.
- Remove the prior metadata roles from HTML-specific elements and remove
those element types entirely.
### Summary

Version bumps for the week of 2024-06-17. There is a now a pin on
`numpy` due to a breaking change in the latest version that we'll need
to investigate and remove in a subsequent PR.
### Summary

Updates handling of tempfiles so that they work on Windows systems.

---------

Co-authored-by: Matt Robinson <[email protected]>
Co-authored-by: Matt Robinson <[email protected]>
### Description
Choosing to use async needs to be very careful because if a connector is
set to use async, the pipeline will not fan out the inputs via
multiprocessing but instead it will be limited to run in a single
process under the assumption it has more benefit from async due to heavy
network traffic. This means the exact same code that is not optimized
for async and is blocking will force the pipeline to perform worse than
simply never marking the connector to use async since the pipeline will
fan that out using multiprocessing.

All connectors and processes in the pipeline we revisited to make sure
this criteria was met and updated accordingly:
* Currently the unstructured client does not support making requests
async, so this was moved over to use multiprocessing
* fsspec connector was updated to use the async client from the fsspec
library. This also required that the client be a `@property` fetched on
demand, otherwise the client would break the multiprocessing pool since
it maintains a thread lock and that can't be pickled when the fsspec
connector doesn't support async.
* elasticsearch was also updated to use the async client
* weaviate only recently came out with async support in their SDK at a
version that is higher than we can use in the open source repo, so a
TODO was left but otherwise moved to use multiprocessing
* all underlying embedders don't use async to embedder step must be
multiprocessing for now. TODO left to update underlying embedder classes
to optionally support async.
* Chunking parameters were not accurately being passed through from cli
to chunker params, this was fixed

---------

Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: rbiseck3 <[email protected]>
**Summary**
Remove `unstructured.partition.html.convert_and_partition_html()`. Move
file-type conversion (to HTML) responsibility to each brokering
partitioner that uses that strategy and let them call `partition_html()`
for themselves with the result.

**Additional Context**

Rationale:
- `partition_html()` does not want or need to know which partitioners
might broker partitioning to it.
- Different brokering partitioners have their own methods to convert
their format to HTML and quirks that may be involved for their format.
Avoid coupling them so they can evolve independently.
- The core of the conversion work is already encapsulated in
`unstructured.partition.common.convert_file_to_html_text_using_pandoc()`.
- `convert_and_partition_html()` represents an additional brokering
layer with the entailed complexities of an additional site for default
parameter values to be (mis-)applied and/or dropped and is an additional
location for new parameters to be added.
This PR aims to remove the download packages step since all of that gets
installed in the base images. This PR also updates the base `wolfi`
image because the original base image can not be found anymore:
https://github.com/Unstructured-IO/unstructured/actions/runs/9555654898/job/26339587945
### Description
Migrate elasticsearch destination connector to new v2 ingest framework
This PR aims to fix a docker image publishing error caused by user
changes when pulling the `amd64` image from the `unstructured`
`wolfi-base` image.
(#3213).

---------

Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: christinestraub <[email protected]>
This PR exposes functions in evaluation module for easy conversion
between tables in Deckerd and HTML formats, which are useful in
evalution experiments.
…r` (#3246)

The Issue:

When extracting images from pdfs, we use the metadata page number to
index into a list of the images. However, the metadata page number can
now be changed via `starting_page_number`. To get the true page index,
we need to subtract this value.

Testing:

Run this snippet in a python shell. Before the fix, this throws an
IndexError. On this branch, it will return the elements.
```
from unstructured.partition.auto import partition
filename = "example-docs/layout-parser-paper-with-table.pdf"
partition(filename, strategy="hi_res", extract_image_block_types=["Image", "Table"], starting_page_number=20)
```

---------

Co-authored-by: Matt Robinson <[email protected]>
Co-authored-by: christinestraub <[email protected]>
### Summary

Release PR for `0.14.7`.
### Summary

Sets to latest hash in quay.
**Summary**
Extract as much mechanical refactoring from the HTML parser change-over
into the PR as possible. This leaves the next PR focused on installing
the new parser and the ingest-test impact.

**Reviewers:** Commits are well groomed and reviewing commit-by-commit
is probably easier.

**Additional Context**
This PR introduces the rewritten HTML parser. Its general design is
recursive, consistent with the recursive structure of HTML (tree of
elements). It also adds the unit tests for that parser but it does not
_install_ the parser. So the behavior of `partition_html()` is unchanged
by this PR. The next PR in this series will do that and handle the
ingest and other unit test changes required to reflect the dozen or so
bug-fixes the new parser provides.
**Summary**
Remedy gap where `strategy` argument passed to `partition()` was not
forwarded to `partition_pptx()` or `partition_docx()`.
Following the same pattern of
#3273 and pass down
`strategy` parameter to `partition_ppt` as well.
### Summary

Updates the `arm64` build to use the same `Dockerfile` as `amd64`, since
there are now upstream base images for `wolfi-base` for both
architectures. The legacy `rockylinux-9.4` is now stashed in a
subdirectory the `docker` subdirectory and is no longer built in CI, but
is available is users would like to build it themselves.

Additionally, this PR includes a fix to symlink `python3` to
`python3.11`, which had caused a CI failure
[here](https://github.com/Unstructured-IO/unstructured/actions/runs/9619486931/job/26535697755).

BREAKING CHANGE: the `arm64` image no longer supports `.doc`, `.pptx`,
or `.xls` because we do not yet have a `libreoffice` `apk` built for
`wolfi-base`. We intend to address that as a follow on. All other
filetypes work.

### Testing

Successfully docker builds, tests, and smoke tests for
[amd64](https://github.com/Unstructured-IO/unstructured/actions/runs/9619458140/job/26535610735?pr=3268)
and
[arm64](https://github.com/Unstructured-IO/unstructured/actions/runs/9619458140/job/26535610341?pr=3268)
on the feature branch (with publish disabled).
Thanks to @tullytim we have a new Kafka source and destination
connector. It also works with hosted Kafka via Confluent.

Documentation will be added to the Docs repo.
Moving Chroma destination to the V2 version of connectors.
A few embedder types are missing sensitive field annotations.
### Summary
- bump unstructured-inference to `0.7.35` which fixed `ValueError` when
converting cells to HTML in the table processing subpipeline
- cut a release for `0.14.8`

---------

Co-authored-by: Matt Robinson <[email protected]>
Co-authored-by: Matt Robinson <[email protected]>
### Description
In use cases where an external system (such as code being run in a
jupyter notebook) already has a running event loop, run the async code
in a dedicated thread pool to not conflict with the existing event loop.

This also has a variety of fixes that were found when putting together a
demo leveraging the elasticsearch destination connector
### Description
Migrate the google drive source connector over to the new v2 ingest
framework and include a variety of improvements as part of the refactor:
* The ID is no longer limited to a drive id but can also be the id of a
subfolder within a drive or a file directly and each case is handled
appropriately
* More metadata is pulled in from google drive to enrich the partitioned
elements downstream and now the modified date is being set to not
reprocess if the ingest pipeline already has the file cached
* timing information is set on the file created when downloaded based on
the last modified data retrieved from google drive

---------

Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: rbiseck3 <[email protected]>
### Description
Move astradb destination connector over to the new v2 ingest framework
christinestraub and others added 30 commits December 4, 2024 11:33
Script to render HTML from unstructured elements.

NOTE: This script is not intended to be used as a module.
NOTE: This script is only intended to be used with outputs with
non-empty `metadata.text_as_html`.

TODO: It was noted that unstructured_elements_to_ontology func always
returns a single page
This script is using helper functions to handle multiple pages. I am not
sure if this was intended, or it is a bug - if it is a bug it would
require bit longer debugging - to make it usable fast I used
workarounds.

Usage: test with any outputs with non-empty `metadata.text_as_html`.
Example files attached.

`[Example-Bill-of-Lading-Waste.docx.pdf.json](https://github.com/user-attachments/files/17922898/Example-Bill-of-Lading-Waste.docx.pdf.json)`


[Breast_Cancer1-5.pdf.json](https://github.com/user-attachments/files/17922899/Breast_Cancer1-5.pdf.json)
- per [ticket](https://unstructured-ai.atlassian.net/browse/ML-551),
there is a bug in the `unstructured` lib under metrics/evaluate.py that
incorrectly retrieves the file extension before the conversion to cct
file from paths like '*.pdf.txt' . (see below screenshot)
    - the current status is in the top example
- we should have the correct version in the bottom example of the
screenshot.
   

![image](https://github.com/user-attachments/assets/6d82de85-3b54-4e77-a637-28a27fcb279d)

- in addition, i also observe the doctype returned are not aligned, some
returning '.*' and some are returning without the dot.
- therefore, i just aligned them to be output into the same version
which is '.*".
release 0.16.10 so that competitor-eval can install and take advantage
of the latest change in the metric calculation
**Summary**
Relax table-segregation rule applied during chunking such that a `Table`
and `Text`-subtype elements can be combined into a single chunk when the
chunking window allows.

**Additional Context**
Until now, `Table` elements have always been segregated during chunking,
i.e. a chunk that contained a table would never contain any other
element. In certain scenarios, especially when a large chunking window
of say 2000 characters is used, this behavior can reduce retrieval
effectiveness by isolating the table from surrounding context.

---------

Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: scanny <[email protected]>
I noticed the ipv4 regex is wrong (it only capture one or two-digit
octets, e.g. `n.nn.n.nn`). Here's a correction and a bumped test for it.

If you wish I can break out the ipv4 test to its own case, so we don't
interfere with the existing `EMAIL_META_DATA_INPUT` ipv6 extraction
test.

Side note: The comment at `unstructured/nlp/patterns.py#95` includes a
bad ipv4 address example (last octet is wrongfully left-padded with a
zero). I left it as it is because I'm not sure if the intention is to
include "non-conventional" ipv4 addresses, like octal or hexadecimal
octets.
Release only, no code changes.
**Summary**
Prepare auto-partitioning for pluggable partitioners.

Move toward a uniform partitioner call signature in `auto/partition()`
such that a custom or override partitioner can be registered without
requiring code changes.

**Additional Context**
The central job of `auto/partition()` is to detect the file-type of the
given file and use that to dispatch partitioning to the corresponding
partitioner function e.g. `partition_pdf()` or `partition_docx()`.

In the existing code, each partitioner function is called with
parameters "hand-picked" from the available parameters passed to the
`partition()` function. This is unnecessary and couples those
partitioners tightly with the dispatch function. The desired state is
that all available arguments are passed as `kwargs` and the partitioner
function "self-selects" the arguments it will be sensitive to, applies
its own appropriate default values when the argument is omitted, and
simply ignore any arguments it doesn't use. Note that achieving this
requires no changes to partitioner functions because they already do
precisely this.

So the job is to pass all arguments (other than `filename` and `file`)
to the partitioner as `kwargs`. This will allow additional or alternate
partitioners to be registered at runtime and dispatched to, because as
long as they have the signature `partition_x(filename, file, kwargs) ->
list[Element]` then they can be dispatched to without customization.
**Summary**
CVE-2024-11053 https://curl.se/docs/CVE-2024-11053.html (severity Low)
was published on Dec 11, 2024 and began failing CI builds on open-core
on Dec 13, 2024 when it appeared in `grype` apparently misclassified as
a critical vulnerability.

The severity reported on the CVE is "Low" so it should not fail builds.
Add a `.grype.yaml` file to ignore this CVE until grype is updated.
**Summary**
Remove pin on `ruff` linter and fix the handful of lint errors a newer
version catches.
**Summary**
Fixes a bug where a CSV file with asserted content-type
`application/vnd.ms-excel` was incorrectly identified as an XLS file and
failed partitioning.

**Additional Context**
The `content_type` argument to partitioning is often authored by the
client system (e.g. Unstructured SDK) and is both unreliable and outside
the control of the user. In this case the `.csv -> XLS` mapping is
correct for certain purposes (Excel is often used to load and edit CSV
files) but not for partitioning, and the user has no readily available
way to override the mapping.

XLS files as well as seven other common binary file types can be
efficiently detected 100% of the time (at least 99.999%) using code we
already have in the file detector.

- Promote this direct-inspection strategy to be tried first.
- When DOC, DOCX, EPUB, ODT, PPT, PPTX, XLS, or XLSX is detected, use
that file-type.
- When one of those types is NOT detected, clear the asserted
`content_type` when it matches any of those types. This prevents the
problem seen in the bug where the asserted content type was used to
determine the file-type.
- The remaining content_type, guess MIME-type, and filename-extension
mapping strategies are tried, in that order, only when direct inspection
fails. This is largely the same as it was before.
- Fix #3781 while we were in the neighborhood.
- Fix #3596 as well, essentially an earlier report of #3781.
Added `CONTRIBUTING.md` from the archived repo as mentioned in the
issue: #3540

Co-authored-by: John <[email protected]>
**Summary**
Improve element-type mapping for Chinese text. Fixes bug where Chinese
text would produce large numbers of false-positive `Title` elements.

Fixes #3084

---------

Co-authored-by: scanny <[email protected]>
Co-authored-by: ryannikolaidis <[email protected]>
Fixes #3666

---------

Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: scanny <[email protected]>
### Description
Add ndjson file type support and treat is the same as json files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.