Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: when convert doc to docx, UnicodeDecodeError may be raised #3830

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

YooshiJay
Copy link

Below is part Error report:

File d:\programs\anaconda3\envs\ragv3-env\lib\site-packages\unstructured\partition\doc.py:74, in partition_doc(filename, file, metadata_filename, metadata_last_modified, libre_office_filter, **kwargs)
     70         f.write(file.read())
     72 # -- convert the .doc file to .docx. The resulting file takes the same base-name as the
     73 # -- source file and is written to `target_dir`.
---> 74 convert_office_doc(
     75     source_file_path,
     76     target_dir,
     77     target_format="docx",
     78     target_filter=libre_office_filter,
     79 )
     81 # -- compute the path of the resulting .docx document --
     82 _, filename_no_path = os.path.split(os.path.abspath(source_file_path))

File d:\programs\anaconda3\envs\ragv3-env\lib\site-packages\unstructured\partition\common\common.py:299, in convert_office_doc(input_filename, output_directory, target_format, target_filter, wait_for_soffice_ready_time_out)
    297 sleep_time = 0.1
    298 output = subprocess.run(command, capture_output=True)
--> 299 message = output.stdout.decode().strip()
    300 # we can't rely on returncode unfortunately because on macOS it would return 0 even when the
    301 # command failed to run; instead we have to rely on the stdout being empty as a sign of the
    302 # process failed
    303 while (wait_time < wait_for_soffice_ready_time_out) and (message == ""):

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd6 in position 60: invalid continuation byte

The core reason is that my document is written by Chinese, so its encoding method is not utf-8. Actually, when I simply modify error code to "message = output.stdout.decode(“gbk”).strip()", it works!

Thus, I simply add a process to check the document‘s encoding method. Hope it helps! :)
(It's my first time PR, hope that I didn't do anything wrong)

@cragwolfe
Copy link
Contributor

@YooshiJay , thanks for contributing this PR. is there any chance you have a 1-page .doc that has the issue that could be used in a unittest?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants