Skip to content

Commit

Permalink
fix: refine filetype detection (Unstructured-IO#3828)
Browse files Browse the repository at this point in the history
**Summary**
Fixes a bug where a CSV file with asserted content-type
`application/vnd.ms-excel` was incorrectly identified as an XLS file and
failed partitioning.

**Additional Context**
The `content_type` argument to partitioning is often authored by the
client system (e.g. Unstructured SDK) and is both unreliable and outside
the control of the user. In this case the `.csv -> XLS` mapping is
correct for certain purposes (Excel is often used to load and edit CSV
files) but not for partitioning, and the user has no readily available
way to override the mapping.

XLS files as well as seven other common binary file types can be
efficiently detected 100% of the time (at least 99.999%) using code we
already have in the file detector.

- Promote this direct-inspection strategy to be tried first.
- When DOC, DOCX, EPUB, ODT, PPT, PPTX, XLS, or XLSX is detected, use
that file-type.
- When one of those types is NOT detected, clear the asserted
`content_type` when it matches any of those types. This prevents the
problem seen in the bug where the asserted content type was used to
determine the file-type.
- The remaining content_type, guess MIME-type, and filename-extension
mapping strategies are tried, in that order, only when direct inspection
fails. This is largely the same as it was before.
- Fix Unstructured-IO#3781 while we were in the neighborhood.
- Fix Unstructured-IO#3596 as well, essentially an earlier report of Unstructured-IO#3781.
  • Loading branch information
scanny authored Dec 17, 2024
1 parent 10f0d54 commit b5ff79d
Show file tree
Hide file tree
Showing 4 changed files with 224 additions and 479 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## 0.16.12-dev2
## 0.16.12-dev3

### Enhancements

Expand All @@ -9,6 +9,7 @@
### Fixes

- **Upgrade ruff to latest.** Previously the ruff version was pinned to <0.5. Remove that pin and fix the handful of lint items that resulted.
- **CSV with asserted XLS content-type is correctly identified as CSV.** Resolves a bug where a CSV file with an asserted content-type of `application/vnd.ms-excel` was incorrectly identified as an XLS file.

## 0.16.11

Expand Down
Loading

0 comments on commit b5ff79d

Please sign in to comment.