Skip to content

Commit

Permalink
feat: custom auto (#131)
Browse files Browse the repository at this point in the history
* add: custom auto

* fix: add file extension

* fix: switch to pdfminer.six

* fix: gitignore

* add : dist

* feat: refacto megaparse for service (#132)

* full refacto wip

* moved schema to sdk and client

* fix ruff

* sssl in config

* wip

* fix config param

* fix CI + docker

---------

Co-authored-by: aminediro <[email protected]>

* chore(main): release megaparse 0.0.46 (#133)

* feat: release plz (#134)

* full refacto wip

* moved schema to sdk and client

* fix ruff

* sssl in config

* wip

* fix config param

* fix CI + docker

* version megaparse

* fix settings env

* docker fix

* fix manifest

---------

Co-authored-by: aminediro <[email protected]>

* feat: release plz (#136)

* full refacto wip

* moved schema to sdk and client

* fix ruff

* sssl in config

* wip

* fix config param

* fix CI + docker

* version megaparse

* fix settings env

* docker fix

* fix manifest

* release confgi

* release confg

---------

Co-authored-by: aminediro <[email protected]>

* feat: release plz (#138)

* full refacto wip

* moved schema to sdk and client

* fix ruff

* sssl in config

* wip

* fix config param

* fix CI + docker

* version megaparse

* fix settings env

* docker fix

* fix manifest

* release confgi

* release confg

* release plzzzz

---------

Co-authored-by: aminediro <[email protected]>

* feat: broker megaparse (#139)

* full refacto wip

* moved schema to sdk and client

* fix ruff

* sssl in config

* wip

* fix config param

* fix CI + docker

* version megaparse

* fix settings env

* docker fix

* fix manifest

* release confgi

* release confg

* release plzzzz

* release plz

---------

Co-authored-by: aminediro <[email protected]>

* feat: broker megaparse (#140)

* full refacto wip

* moved schema to sdk and client

* fix ruff

* sssl in config

* wip

* fix config param

* fix CI + docker

* version megaparse

* fix settings env

* docker fix

* fix manifest

* release confgi

* release confg

* release plzzzz

* release plz

* release plz

---------

Co-authored-by: aminediro <[email protected]>

* feat: broker megaparse (#141)

* full refacto wip

* moved schema to sdk and client

* fix ruff

* sssl in config

* wip

* fix config param

* fix CI + docker

* version megaparse

* fix settings env

* docker fix

* fix manifest

* release confgi

* release confg

* release plzzzz

* release plz

* release plz

* release plz

---------

Co-authored-by: aminediro <[email protected]>

* release plz (#142)

Co-authored-by: aminediro <[email protected]>

* feat: release plz1 (#143)

* release plz

* release plz

* release plz

---------

Co-authored-by: aminediro <[email protected]>

* feat: release plz1 (#144)

* release plz

* release plz

* release plz

* release plz

---------

Co-authored-by: aminediro <[email protected]>

* chore(main): release megaparse 0.0.47 (#145)

* chore(main): release megaparse-sdk 0.1.5 (#146)

* chore(main): release megaparse-sdk 0.1.5

* up

* feat: megaparse sdk tests (#148)

* working parsing

* tests timeouts

* tests errors

* test CI

* any worker

* runs on

* ci

* ci

* ci

* ci

* ci

* ci

* ci

* ci

* ci

* ci

* ci

* ci

* ci

* ci

* ci

* CI timeout tests

* added sudo

* ci

* remove tls CI

---------

Co-authored-by: aminediro <[email protected]>

* chore(main): release megaparse-sdk 0.1.6 (#149)

* fix megaparse_sdk no ssl (#150)

Co-authored-by: aminediro <[email protected]>

* fix megaparse_sdk no ssl (#151)

Co-authored-by: aminediro <[email protected]>

* readme (#152)

Co-authored-by: aminediro <[email protected]>

* sdk file_name (#153)

Co-authored-by: aminediro <[email protected]>

* fix: Update README.md (#154)

* chore(main): release megaparse-sdk 0.1.7 (#155)

* fix: auto strategy

* fix: process time import

* fix: auto comments

---------

Co-authored-by: AmineDiro <[email protected]>
Co-authored-by: aminediro <[email protected]>
Co-authored-by: Stan Girard <[email protected]>
  • Loading branch information
4 people authored Dec 10, 2024
1 parent 546fd34 commit 3cb5be4
Show file tree
Hide file tree
Showing 14 changed files with 207 additions and 73 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,6 @@ venv
*.DS_Store
.tool-versions
megaparse/sdk/examples/only_pdfs/*
benchmark/hi_res/*
benchmark/auto/*

52 changes: 0 additions & 52 deletions benchmark/process_time.py

This file was deleted.

69 changes: 69 additions & 0 deletions benchmark/test_quality_sim.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
import os
import difflib
from pathlib import Path

auto_dir = Path("benchmark/auto")
hi_res_dir = Path("benchmark/hi_res")


def jaccard_similarity(str1, str2):
if len(str1) == 0 and len(str2) == 0:
return 1
# Tokenize the strings into sets of words
words1 = set(str1.split())
words2 = set(str2.split())

# Find intersection and union of the word sets
intersection = words1.intersection(words2)
union = words1.union(words2)

# Compute Jaccard similarity
return len(intersection) / len(union) if len(union) != 0 else 0


def compare_files(file_name):
file_path_auto = auto_dir / f"{file_name}.md"
file_path_hi_res = hi_res_dir / f"{file_name}.md"

with open(file_path_auto, "r") as f:
auto_content = f.read()

with open(file_path_hi_res, "r") as f:
hi_res_content = f.read()

if len(auto_content) == 0 and len(hi_res_content) == 0:
return 1

similarity = difflib.SequenceMatcher(None, auto_content, hi_res_content).ratio()
# similarity = jaccard_similarity(auto_content, hi_res_content)

return similarity


def main():
files = os.listdir(hi_res_dir)
print(f"Comparing {len(files)} files...")
similarity_dict = {}
for file in files:
file_name = Path(file).stem
similarity = compare_files(file_name)
similarity_dict[file_name] = similarity

avg_similarity = sum(similarity_dict.values()) / len(similarity_dict)
print(f"\nAverage similarity: {avg_similarity}\n")

pass_rate = sum(
[similarity > 0.9 for similarity in similarity_dict.values()]
) / len(similarity_dict)

print(f"Pass rate: {pass_rate}\n")

print("Under 0.9 similarity documents:")
print("-------------------------------")
for file_name, similarity in similarity_dict.items():
if similarity < 0.9:
print(f"{file_name}: {similarity}")


if __name__ == "__main__":
main()
2 changes: 2 additions & 0 deletions libs/megaparse/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@ dependencies = [
"langchain-core>=0.2.38",
"llama-parse>=0.4.0",
"pydantic-settings>=2.6.1",
"pypdfium2>=4.30.0",

]

[project.optional-dependencies]
Expand Down
16 changes: 16 additions & 0 deletions libs/megaparse/src/megaparse/examples/parse_file.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
from megaparse import MegaParse
from megaparse.parser.unstructured_parser import UnstructuredParser


def main():
parser = UnstructuredParser()
megaparse = MegaParse(parser=parser)

file_path = "somewhere/only_pdfs/4 The Language of Medicine 2024.07.21.pdf"
parsed_file = megaparse.load(file_path)
print(f"\n----- File Response : {file_path} -----\n")
print(parsed_file)


if __name__ == "__main__":
main()
47 changes: 35 additions & 12 deletions libs/megaparse/src/megaparse/megaparse.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,22 +60,44 @@ async def aload(

try:
parsed_document: str = await self.parser.convert(
file_path=file_path, file=file
file_path=file_path, file=file, file_extensions=str(file_extension)
)
# @chloe FIXME: format_checker needs unstructured Elements as input which is to change
# if self.format_checker:
# parsed_document: str = await self.format_checker.check(parsed_document)

except Exception as e:
raise ParsingException(f"Error while parsing {file_path}: {e}")
raise ParsingException(f"Error while parsing {file_extension}: {e}")

self.last_parsed_document = parsed_document
return parsed_document

def load(self, file_path: Path | str) -> str:
if isinstance(file_path, str):
file_path = Path(file_path)
file_extension: str = file_path.suffix
def load(
self,
file_path: Path | str | None = None,
file: IO[bytes] | None = None,
file_extension: str | None = "",
) -> str:
if not (file_path or file):
raise ValueError("Either file_path or file should be provided")
if file_path and file:
raise ValueError("Only one of file_path or file should be provided")

if file_path:
if isinstance(file_path, str):
file_path = Path(file_path)
file_extension = file_path.suffix
elif file:
if not file_extension:
raise ValueError(
"file_extension should be provided when given file argument"
)
file.seek(0)

try:
FileExtension(file_extension)
except ValueError:
raise ValueError(f"Unsupported file extension: {file_extension}")

if file_extension != ".pdf":
if self.format_checker:
Expand All @@ -84,22 +106,23 @@ def load(self, file_path: Path | str) -> str:
)
if not isinstance(self.parser, UnstructuredParser):
raise ValueError(
f"Parser {self.parser}: Unsupported file extension: {file_extension}"
f" Unsupported file extension : Parser {self.parser} do not support {file_extension}"
)

try:
loop = asyncio.get_event_loop()
parsed_document: str = loop.run_until_complete(
self.parser.convert(file_path)
self.parser.convert(
file_path=file_path, file=file, file_extensions=str(file_extension)
)
)

# @chloe FIXME: format_checker needs unstructured Elements as input which is to change
# if self.format_checker:
# parsed_document: str = loop.run_until_complete(
# self.format_checker.check(parsed_document)
# )
# parsed_document: str = await self.format_checker.check(parsed_document)

except Exception as e:
raise ValueError(f"Error while parsing {file_path}: {e}")
raise ParsingException(f"Error while parsing {file_extension}: {e}")

self.last_parsed_document = parsed_document
return parsed_document
Expand Down
3 changes: 3 additions & 0 deletions libs/megaparse/src/megaparse/parser/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
from pathlib import Path
from typing import IO

from megaparse_sdk.schema.extensions import FileExtension


class BaseParser(ABC):
"""Mother Class for all the parsers [Unstructured, LlamaParse, MegaParseVision]"""
Expand All @@ -11,6 +13,7 @@ async def convert(
self,
file_path: str | Path | None = None,
file: IO[bytes] | None = None,
file_extensions: str | FileExtension = "",
**kwargs,
) -> str:
"""
Expand Down
2 changes: 2 additions & 0 deletions libs/megaparse/src/megaparse/parser/llama.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from llama_parse.utils import Language, ResultType

from megaparse.parser import BaseParser
from megaparse_sdk.schema.extensions import FileExtension


class LlamaParser(BaseParser):
Expand All @@ -30,6 +31,7 @@ async def convert(
self,
file_path: str | Path | None = None,
file: IO[bytes] | None = None,
file_extensions: str | FileExtension = "",
**kwargs,
) -> str:
if not file_path:
Expand Down
1 change: 1 addition & 0 deletions libs/megaparse/src/megaparse/parser/megaparse_vision.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,7 @@ async def convert(
self,
file_path: str | Path | None = None,
file: IO[bytes] | None = None,
file_extensions: str = "",
batch_size: int = 3,
**kwargs,
) -> str:
Expand Down
62 changes: 61 additions & 1 deletion libs/megaparse/src/megaparse/parser/unstructured_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,14 @@
from pathlib import Path
from typing import IO

import numpy as np
import pypdfium2 as pdfium
from dotenv import load_dotenv
from langchain_core.language_models.chat_models import BaseChatModel
from langchain_core.prompts import ChatPromptTemplate
from megaparse_sdk.schema.parser_config import StrategyEnum
from pypdfium2._helpers.page import PdfPage
from pypdfium2._helpers.pageobjects import PdfImage
from unstructured.partition.auto import partition

from megaparse.parser import BaseParser
Expand Down Expand Up @@ -98,18 +102,74 @@ def get_markdown_line(self, el: dict):

return markdown_line

def get_strategy(
self,
page: PdfPage,
threshold=0.5,
) -> StrategyEnum:
if self.strategy != StrategyEnum.AUTO:
raise ValueError("Strategy must be AUTO to use get_strategy")

# Get the dim of the page
total_page_area = page.get_width() * page.get_height()
total_image_area = 0
images_coords = []
# Get all the images in the page
for obj in page.get_objects():
if isinstance(obj, PdfImage):
images_coords.append(obj.get_pos())

canva = np.zeros((int(page.get_height()), int(page.get_width())))
for coords in images_coords:
p_width, p_height = int(page.get_width()), int(page.get_height())
x1 = max(0, min(p_width, int(coords[0])))
y1 = max(0, min(p_height, int(coords[1])))
x2 = max(0, min(p_width, int(coords[2])))
y2 = max(0, min(p_height, int(coords[3])))
canva[y1:y2, x1:x2] = 1
# Get the total area of the images
total_image_area = np.sum(canva)

if total_image_area / total_page_area > threshold:
return StrategyEnum.HI_RES
return StrategyEnum.FAST

async def convert(
self,
file_path: str | Path | None = None,
file: IO[bytes] | None = None,
file_extensions: str = "",
**kwargs,
) -> str:
strategies = {}
if file_extensions == ".pdf" and self.strategy == StrategyEnum.AUTO:
print("Determining strategy...")
document = (
pdfium.PdfDocument(file_path) if file_path else pdfium.PdfDocument(file)
)
for i, page in enumerate(document):
strategies[i] = self.get_strategy(page)

# count number of pages needing HI_RES
num_hi_res = len(
[
strategies[i]
for i in strategies
if strategies[i] == StrategyEnum.HI_RES
]
)
if num_hi_res / len(strategies) > 0.2:
print("Using HI_RES strategy")
self.strategy = StrategyEnum.HI_RES
else:
print("Using FAST strategy")
self.strategy = StrategyEnum.FAST

# Partition the PDF
elements = partition(
filename=str(file_path) if file_path else None,
file=file,
strategy=self.strategy,
skip_infer_table_types=[],
)
elements_dict = [el.to_dict() for el in elements]
markdown_content = self.convert_to_markdown(elements_dict)
Expand Down
17 changes: 10 additions & 7 deletions libs/megaparse_sdk/examples/usage_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,19 +8,22 @@ async def main():
api_key = str(os.getenv("MEGAPARSE_API_KEY"))
megaparse = MegaParseSDK(api_key)

url = "https://www.quivr.com"
# url = "https://www.quivr.com"

# Upload a URL
url_response = await megaparse.url.upload(url)
print(f"\n----- URL Response : {url} -----\n")
print(url_response)
# # Upload a URL
# url_response = await megaparse.url.upload(url)
# print(f"\n----- URL Response : {url} -----\n")
# print(url_response)

file_path = "megaparse/sdk/pdf/MegaFake_report.pdf"
# file_path = "megaparse/sdk/pdf/MegaFake_report.pdf"
file_path = (
"megaparse/sdk/examples/only_pdfs/4 The Language of Medicine 2024.07.21.pdf"
)
# Upload a file
response = await megaparse.file.upload(
file_path=file_path,
method="unstructured", # type: ignore # unstructured, llama_parser, megaparse_vision
strategy="fast",
strategy="auto", # type: ignore # fast, auto, hi_res
)
print(f"\n----- File Response : {file_path} -----\n")
print(response)
Expand Down
Loading

0 comments on commit 3cb5be4

Please sign in to comment.