Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: refine filetype detection (Unstructured-IO#3828)
**Summary** Fixes a bug where a CSV file with asserted content-type `application/vnd.ms-excel` was incorrectly identified as an XLS file and failed partitioning. **Additional Context** The `content_type` argument to partitioning is often authored by the client system (e.g. Unstructured SDK) and is both unreliable and outside the control of the user. In this case the `.csv -> XLS` mapping is correct for certain purposes (Excel is often used to load and edit CSV files) but not for partitioning, and the user has no readily available way to override the mapping. XLS files as well as seven other common binary file types can be efficiently detected 100% of the time (at least 99.999%) using code we already have in the file detector. - Promote this direct-inspection strategy to be tried first. - When DOC, DOCX, EPUB, ODT, PPT, PPTX, XLS, or XLSX is detected, use that file-type. - When one of those types is NOT detected, clear the asserted `content_type` when it matches any of those types. This prevents the problem seen in the bug where the asserted content type was used to determine the file-type. - The remaining content_type, guess MIME-type, and filename-extension mapping strategies are tried, in that order, only when direct inspection fails. This is largely the same as it was before. - Fix Unstructured-IO#3781 while we were in the neighborhood. - Fix Unstructured-IO#3596 as well, essentially an earlier report of Unstructured-IO#3781.
- Loading branch information