fix: when convert doc to docx, UnicodeDecodeError may be raised #3830

YooshiJay · 2024-12-14T09:10:57Z

Below is part Error report:

File d:\programs\anaconda3\envs\ragv3-env\lib\site-packages\unstructured\partition\doc.py:74, in partition_doc(filename, file, metadata_filename, metadata_last_modified, libre_office_filter, **kwargs)
     70         f.write(file.read())
     72 # -- convert the .doc file to .docx. The resulting file takes the same base-name as the
     73 # -- source file and is written to `target_dir`.
---> 74 convert_office_doc(
     75     source_file_path,
     76     target_dir,
     77     target_format="docx",
     78     target_filter=libre_office_filter,
     79 )
     81 # -- compute the path of the resulting .docx document --
     82 _, filename_no_path = os.path.split(os.path.abspath(source_file_path))

File d:\programs\anaconda3\envs\ragv3-env\lib\site-packages\unstructured\partition\common\common.py:299, in convert_office_doc(input_filename, output_directory, target_format, target_filter, wait_for_soffice_ready_time_out)
    297 sleep_time = 0.1
    298 output = subprocess.run(command, capture_output=True)
--> 299 message = output.stdout.decode().strip()
    300 # we can't rely on returncode unfortunately because on macOS it would return 0 even when the
    301 # command failed to run; instead we have to rely on the stdout being empty as a sign of the
    302 # process failed
    303 while (wait_time < wait_for_soffice_ready_time_out) and (message == ""):

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd6 in position 60: invalid continuation byte

The core reason is that my document is written by Chinese, so its encoding method is not utf-8. Actually, when I simply modify error code to "message = output.stdout.decode(“gbk”).strip()", it works!

Thus, I simply add a process to check the document‘s encoding method. Hope it helps! :)
(It's my first time PR, hope that I didn't do anything wrong)

cragwolfe · 2024-12-14T18:33:48Z

@YooshiJay , thanks for contributing this PR. is there any chance you have a 1-page .doc that has the issue that could be used in a unittest?

detect encoding method

c99b84c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: when convert doc to docx, UnicodeDecodeError may be raised #3830

fix: when convert doc to docx, UnicodeDecodeError may be raised #3830

YooshiJay commented Dec 14, 2024

cragwolfe commented Dec 14, 2024

fix: when convert doc to docx, UnicodeDecodeError may be raised #3830

Are you sure you want to change the base?

fix: when convert doc to docx, UnicodeDecodeError may be raised #3830

Conversation

YooshiJay commented Dec 14, 2024

cragwolfe commented Dec 14, 2024