-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not able to parse files: HTTP ERROR 403 #199
Comments
Here is the basic code: from megaparse import MegaParse parser = UnstructuredParser() response = megaparse.load("./ylh-20240277-dch.pdf") #this is the pdf I want to read/parse |
I got exactly the same issue on macos Sonoma with python 3.11 with sample from readme |
Same. With Dockerfile.gpu build. Using /v1/file endpoint. tried .pdf and .docx files.
|
Hmmm - Something spooky is going on! I discovered that it worked fine with some pdf files, but not with others. |
I had the same issue As far as I found, this is an issue with unstructured and the way they handled ntlk libraries download. So I think updating unstructured version should resolve this issue |
I tried pip install unstructured==0.16.11, but that did not solve the issue. C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\pydantic_internal_fields.py:132: UserWarning: Field "model_name" in UploadFileConfig has conflict with protected namespace "model_". You may be able to resolve this warning by setting Traceback (most recent call last): During handling of the above exception, another exception occurred: Traceback (most recent call last): (And yes - I do have gpt-4o access - since the code runs with megaparce==0.0.49) |
**I did a pip install megaparse (tried both in Windows and Linux) in a clean Pyton 3.11 enviroment. I then tried to run the basic examples from the home page to parse a pdf. The error(s) listing below occured:
I believe there might be two seperate issues here:
The first in the warning about "...conflict with protected namespace "model_"."
The second problem is that I get "...ERROR 403: Forbidden"**
python .\parse_shiptool.py
C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\pydantic_internal_fields.py:132: UserWarning: Field "model_name" in UploadFileConfig has conflict with protected namespace "model_".
You may be able to resolve this warning by setting
model_config['protected_namespaces'] = ()
.warnings.warn(
Traceback (most recent call last):
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\megaparse\megaparse.py", line 89, in aload
parsed_document = await parser.convert(
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\megaparse\parser\unstructured_parser.py", line 110, in convert
elements = partition(
^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\partition\auto.py", line 438, in partition
elements = _partition_pdf(
^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\documents\elements.py", line 593, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\file_utils\filetype.py", line 429, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\file_utils\filetype.py", line 385, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\chunking\dispatch.py", line 74, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\partition\pdf.py", line 208, in partition_pdf
return partition_pdf_or_image(
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\partition\pdf.py", line 355, in partition_pdf_or_image
out_elements = _process_uncategorized_text_elements(elements)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\partition\pdf.py", line 930, in _process_uncategorized_text_elements
new_el = element_from_text(cast(Text, el).text)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\partition\text.py", line 295, in element_from_text
elif is_possible_narrative_text(text):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\partition\text_type.py", line 80, in is_possible_narrative_text
if exceeds_cap_ratio(text, threshold=cap_threshold):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\partition\text_type.py", line 276, in exceeds_cap_ratio
if sentence_count(text, 3) > 1:
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\partition\text_type.py", line 225, in sentence_count
sentences = sent_tokenize(text)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\nlp\tokenize.py", line 136, in sent_tokenize
_download_nltk_packages_if_not_present()
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\nlp\tokenize.py", line 130, in _download_nltk_packages_if_not_present
download_nltk_packages()
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\unstructured\nlp\tokenize.py", line 88, in download_nltk_packages
urllib.request.urlretrieve(NLTK_DATA_URL, tgz_file_path)
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\urllib\request.py", line 241, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\urllib\request.py", line 216, in urlopen
return opener.open(url, data, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\urllib\request.py", line 525, in open
response = meth(req, response)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\urllib\request.py", line 634, in http_response
response = self.parent.error(
^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\urllib\request.py", line 563, in error
return self._call_chain(*args)
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\urllib\request.py", line 496, in _call_chain
result = func(*args)
^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\urllib\request.py", line 643, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\telboth\OneDrive - Shearwater Geoservices Norway AS\Documents\Python\chatbot\MegaParse\parse_shiptool.py", line 31, in
response = megaparse.load("./ylh-20240277-dch.pdf")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\megaparse\megaparse.py", line 109, in load
return loop.run_until_complete(
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\asyncio\base_events.py", line 654, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "C:\Users\telboth\Anaconda3\envs\parse_test\Lib\site-packages\megaparse\megaparse.py", line 98, in aload
raise ParsingException(
megaparse.exceptions.base.ParsingException: Error while parsing file ./ylh-20240277-dch.pdf, file_extension: FileExtension.PDF: HTTP Error 403: Forbidden
The text was updated successfully, but these errors were encountered: