feat: custom auto (#131)

* add: custom auto * fix: add file extension * fix: switch to pdfminer.six * fix: gitignore * add : dist * feat: refacto megaparse for service (#132) * full refacto wip * moved schema to sdk and client * fix ruff * sssl in config * wip * fix config param * fix CI + docker --------- Co-authored-by: aminediro <[email protected]> * chore(main): release megaparse 0.0.46 (#133) * feat: release plz (#134) * full refacto wip * moved schema to sdk and client * fix ruff * sssl in config * wip * fix config param * fix CI + docker * version megaparse * fix settings env * docker fix * fix manifest --------- Co-authored-by: aminediro <[email protected]> * feat: release plz (#136) * full refacto wip * moved schema to sdk and client * fix ruff * sssl in config * wip * fix config param * fix CI + docker * version megaparse * fix settings env * docker fix * fix manifest * release confgi * release confg --------- Co-authored-by: aminediro <[email protected]> * feat: release plz (#138) * full refacto wip * moved schema to sdk and client * fix ruff * sssl in config * wip * fix config param * fix CI + docker * version megaparse * fix settings env * docker fix * fix manifest * release confgi * release confg * release plzzzz --------- Co-authored-by: aminediro <[email protected]> * feat: broker megaparse (#139) * full refacto wip * moved schema to sdk and client * fix ruff * sssl in config * wip * fix config param * fix CI + docker * version megaparse * fix settings env * docker fix * fix manifest * release confgi * release confg * release plzzzz * release plz --------- Co-authored-by: aminediro <[email protected]> * feat: broker megaparse (#140) * full refacto wip * moved schema to sdk and client * fix ruff * sssl in config * wip * fix config param * fix CI + docker * version megaparse * fix settings env * docker fix * fix manifest * release confgi * release confg * release plzzzz * release plz * release plz --------- Co-authored-by: aminediro <[email protected]> * feat: broker megaparse (#141) * full refacto wip * moved schema to sdk and client * fix ruff * sssl in config * wip * fix config param * fix CI + docker * version megaparse * fix settings env * docker fix * fix manifest * release confgi * release confg * release plzzzz * release plz * release plz * release plz --------- Co-authored-by: aminediro <[email protected]> * release plz (#142) Co-authored-by: aminediro <[email protected]> * feat: release plz1 (#143) * release plz * release plz * release plz --------- Co-authored-by: aminediro <[email protected]> * feat: release plz1 (#144) * release plz * release plz * release plz * release plz --------- Co-authored-by: aminediro <[email protected]> * chore(main): release megaparse 0.0.47 (#145) * chore(main): release megaparse-sdk 0.1.5 (#146) * chore(main): release megaparse-sdk 0.1.5 * up * feat: megaparse sdk tests (#148) * working parsing * tests timeouts * tests errors * test CI * any worker * runs on * ci * ci * ci * ci * ci * ci * ci * ci * ci * ci * ci * ci * ci * ci * ci * CI timeout tests * added sudo * ci * remove tls CI --------- Co-authored-by: aminediro <[email protected]> * chore(main): release megaparse-sdk 0.1.6 (#149) * fix megaparse_sdk no ssl (#150) Co-authored-by: aminediro <[email protected]> * fix megaparse_sdk no ssl (#151) Co-authored-by: aminediro <[email protected]> * readme (#152) Co-authored-by: aminediro <[email protected]> * sdk file_name (#153) Co-authored-by: aminediro <[email protected]> * fix: Update README.md (#154) * chore(main): release megaparse-sdk 0.1.7 (#155) * fix: auto strategy * fix: process time import * fix: auto comments --------- Co-authored-by: AmineDiro <[email protected]> Co-authored-by: aminediro <[email protected]> Co-authored-by: Stan Girard <[email protected]>
QuivrHQ · Dec 10, 2024 · 3cb5be4 · 3cb5be4
1 parent 546fd34
commit 3cb5be4
Show file tree

Hide file tree

Showing 14 changed files with 207 additions and 73 deletions.
diff --git a/.gitignore b/.gitignore
@@ -16,3 +16,6 @@ venv
 *.DS_Store
 .tool-versions
 megaparse/sdk/examples/only_pdfs/*
+benchmark/hi_res/*
+benchmark/auto/*
+
diff --git a/benchmark/process_time.py b/benchmark/process_time.py
diff --git a/benchmark/test_quality_sim.py b/benchmark/test_quality_sim.py
@@ -0,0 +1,69 @@
+import os
+import difflib
+from pathlib import Path
+
+auto_dir = Path("benchmark/auto")
+hi_res_dir = Path("benchmark/hi_res")
+
+
+def jaccard_similarity(str1, str2):
+    if len(str1) == 0 and len(str2) == 0:
+        return 1
+    # Tokenize the strings into sets of words
+    words1 = set(str1.split())
+    words2 = set(str2.split())
+
+    # Find intersection and union of the word sets
+    intersection = words1.intersection(words2)
+    union = words1.union(words2)
+
+    # Compute Jaccard similarity
+    return len(intersection) / len(union) if len(union) != 0 else 0
+
+
+def compare_files(file_name):
+    file_path_auto = auto_dir / f"{file_name}.md"
+    file_path_hi_res = hi_res_dir / f"{file_name}.md"
+
+    with open(file_path_auto, "r") as f:
+        auto_content = f.read()
+
+    with open(file_path_hi_res, "r") as f:
+        hi_res_content = f.read()
+
+    if len(auto_content) == 0 and len(hi_res_content) == 0:
+        return 1
+
+    similarity = difflib.SequenceMatcher(None, auto_content, hi_res_content).ratio()
+    # similarity = jaccard_similarity(auto_content, hi_res_content)
+
+    return similarity
+
+
+def main():
+    files = os.listdir(hi_res_dir)
+    print(f"Comparing {len(files)} files...")
+    similarity_dict = {}
+    for file in files:
+        file_name = Path(file).stem
+        similarity = compare_files(file_name)
+        similarity_dict[file_name] = similarity
+
+    avg_similarity = sum(similarity_dict.values()) / len(similarity_dict)
+    print(f"\nAverage similarity: {avg_similarity}\n")
+
+    pass_rate = sum(
+        [similarity > 0.9 for similarity in similarity_dict.values()]
+    ) / len(similarity_dict)
+
+    print(f"Pass rate: {pass_rate}\n")
+
+    print("Under 0.9 similarity documents:")
+    print("-------------------------------")
+    for file_name, similarity in similarity_dict.items():
+        if similarity < 0.9:
+            print(f"{file_name}: {similarity}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/libs/megaparse/pyproject.toml b/libs/megaparse/pyproject.toml
@@ -29,6 +29,8 @@ dependencies = [
     "langchain-core>=0.2.38",
     "llama-parse>=0.4.0",
     "pydantic-settings>=2.6.1",
+    "pypdfium2>=4.30.0",
+
 ]
 
 [project.optional-dependencies]

diff --git a/libs/megaparse/src/megaparse/examples/parse_file.py b/libs/megaparse/src/megaparse/examples/parse_file.py
@@ -0,0 +1,16 @@
+from megaparse import MegaParse
+from megaparse.parser.unstructured_parser import UnstructuredParser
+
+
+def main():
+    parser = UnstructuredParser()
+    megaparse = MegaParse(parser=parser)
+
+    file_path = "somewhere/only_pdfs/4 The Language of Medicine  2024.07.21.pdf"
+    parsed_file = megaparse.load(file_path)
+    print(f"\n----- File Response : {file_path} -----\n")
+    print(parsed_file)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/libs/megaparse/src/megaparse/megaparse.py b/libs/megaparse/src/megaparse/megaparse.py
@@ -60,22 +60,44 @@ async def aload(
 
         try:
             parsed_document: str = await self.parser.convert(
-                file_path=file_path, file=file
+                file_path=file_path, file=file, file_extensions=str(file_extension)
             )
             # @chloe FIXME: format_checker needs unstructured Elements as input which is to change
             # if self.format_checker:
             #     parsed_document: str = await self.format_checker.check(parsed_document)
 
         except Exception as e:
-            raise ParsingException(f"Error while parsing {file_path}: {e}")
+            raise ParsingException(f"Error while parsing {file_extension}: {e}")
 
         self.last_parsed_document = parsed_document
         return parsed_document
 
-    def load(self, file_path: Path | str) -> str:
-        if isinstance(file_path, str):
-            file_path = Path(file_path)
-        file_extension: str = file_path.suffix
+    def load(
+        self,
+        file_path: Path | str | None = None,
+        file: IO[bytes] | None = None,
+        file_extension: str | None = "",
+    ) -> str:
+        if not (file_path or file):
+            raise ValueError("Either file_path or file should be provided")
+        if file_path and file:
+            raise ValueError("Only one of file_path or file should be provided")
+
+        if file_path:
+            if isinstance(file_path, str):
+                file_path = Path(file_path)
+            file_extension = file_path.suffix
+        elif file:
+            if not file_extension:
+                raise ValueError(
+                    "file_extension should be provided when given file argument"
+                )
+            file.seek(0)
+
+        try:
+            FileExtension(file_extension)
+        except ValueError:
+            raise ValueError(f"Unsupported file extension: {file_extension}")
 
         if file_extension != ".pdf":
             if self.format_checker:
@@ -84,22 +106,23 @@ def load(self, file_path: Path | str) -> str:
                 )
             if not isinstance(self.parser, UnstructuredParser):
                 raise ValueError(
-                    f"Parser {self.parser}: Unsupported file extension: {file_extension}"
+                    f" Unsupported file extension : Parser {self.parser} do not support {file_extension}"
                 )
 
         try:
             loop = asyncio.get_event_loop()
             parsed_document: str = loop.run_until_complete(
-                self.parser.convert(file_path)
+                self.parser.convert(
+                    file_path=file_path, file=file, file_extensions=str(file_extension)
+                )
             )
+
             # @chloe FIXME: format_checker needs unstructured Elements as input which is to change
             # if self.format_checker:
-            #     parsed_document: str = loop.run_until_complete(
-            #         self.format_checker.check(parsed_document)
-            #     )
+            #     parsed_document: str = await self.format_checker.check(parsed_document)
 
         except Exception as e:
-            raise ValueError(f"Error while parsing {file_path}: {e}")
+            raise ParsingException(f"Error while parsing {file_extension}: {e}")
 
         self.last_parsed_document = parsed_document
         return parsed_document

diff --git a/libs/megaparse/src/megaparse/parser/base.py b/libs/megaparse/src/megaparse/parser/base.py
@@ -2,6 +2,8 @@
 from pathlib import Path
 from typing import IO
 
+from megaparse_sdk.schema.extensions import FileExtension
+
 
 class BaseParser(ABC):
     """Mother Class for all the parsers [Unstructured, LlamaParse, MegaParseVision]"""
@@ -11,6 +13,7 @@ async def convert(
         self,
         file_path: str | Path | None = None,
         file: IO[bytes] | None = None,
+        file_extensions: str | FileExtension = "",
         **kwargs,
     ) -> str:
         """

diff --git a/libs/megaparse/src/megaparse/parser/llama.py b/libs/megaparse/src/megaparse/parser/llama.py
@@ -6,6 +6,7 @@
 from llama_parse.utils import Language, ResultType
 
 from megaparse.parser import BaseParser
+from megaparse_sdk.schema.extensions import FileExtension
 
 
 class LlamaParser(BaseParser):
@@ -30,6 +31,7 @@ async def convert(
         self,
         file_path: str | Path | None = None,
         file: IO[bytes] | None = None,
+        file_extensions: str | FileExtension = "",
         **kwargs,
     ) -> str:
         if not file_path:

diff --git a/libs/megaparse/src/megaparse/parser/megaparse_vision.py b/libs/megaparse/src/megaparse/parser/megaparse_vision.py
@@ -118,6 +118,7 @@ async def convert(
         self,
         file_path: str | Path | None = None,
         file: IO[bytes] | None = None,
+        file_extensions: str = "",
         batch_size: int = 3,
         **kwargs,
     ) -> str:

diff --git a/libs/megaparse/src/megaparse/parser/unstructured_parser.py b/libs/megaparse/src/megaparse/parser/unstructured_parser.py
@@ -2,10 +2,14 @@
 from pathlib import Path
 from typing import IO
 
+import numpy as np
+import pypdfium2 as pdfium
 from dotenv import load_dotenv
 from langchain_core.language_models.chat_models import BaseChatModel
 from langchain_core.prompts import ChatPromptTemplate
 from megaparse_sdk.schema.parser_config import StrategyEnum
+from pypdfium2._helpers.page import PdfPage
+from pypdfium2._helpers.pageobjects import PdfImage
 from unstructured.partition.auto import partition
 
 from megaparse.parser import BaseParser
@@ -98,18 +102,74 @@ def get_markdown_line(self, el: dict):
 
         return markdown_line
 
+    def get_strategy(
+        self,
+        page: PdfPage,
+        threshold=0.5,
+    ) -> StrategyEnum:
+        if self.strategy != StrategyEnum.AUTO:
+            raise ValueError("Strategy must be AUTO to use get_strategy")
+
+        # Get the dim of the page
+        total_page_area = page.get_width() * page.get_height()
+        total_image_area = 0
+        images_coords = []
+        # Get all the images in the page
+        for obj in page.get_objects():
+            if isinstance(obj, PdfImage):
+                images_coords.append(obj.get_pos())
+
+        canva = np.zeros((int(page.get_height()), int(page.get_width())))
+        for coords in images_coords:
+            p_width, p_height = int(page.get_width()), int(page.get_height())
+            x1 = max(0, min(p_width, int(coords[0])))
+            y1 = max(0, min(p_height, int(coords[1])))
+            x2 = max(0, min(p_width, int(coords[2])))
+            y2 = max(0, min(p_height, int(coords[3])))
+            canva[y1:y2, x1:x2] = 1
+        # Get the total area of the images
+        total_image_area = np.sum(canva)
+
+        if total_image_area / total_page_area > threshold:
+            return StrategyEnum.HI_RES
+        return StrategyEnum.FAST
+
     async def convert(
         self,
         file_path: str | Path | None = None,
         file: IO[bytes] | None = None,
+        file_extensions: str = "",
         **kwargs,
     ) -> str:
+        strategies = {}
+        if file_extensions == ".pdf" and self.strategy == StrategyEnum.AUTO:
+            print("Determining strategy...")
+            document = (
+                pdfium.PdfDocument(file_path) if file_path else pdfium.PdfDocument(file)
+            )
+            for i, page in enumerate(document):
+                strategies[i] = self.get_strategy(page)
+
+            # count number of pages needing HI_RES
+            num_hi_res = len(
+                [
+                    strategies[i]
+                    for i in strategies
+                    if strategies[i] == StrategyEnum.HI_RES
+                ]
+            )
+            if num_hi_res / len(strategies) > 0.2:
+                print("Using HI_RES strategy")
+                self.strategy = StrategyEnum.HI_RES
+            else:
+                print("Using FAST strategy")
+                self.strategy = StrategyEnum.FAST
+
         # Partition the PDF
         elements = partition(
             filename=str(file_path) if file_path else None,
             file=file,
             strategy=self.strategy,
-            skip_infer_table_types=[],
         )
         elements_dict = [el.to_dict() for el in elements]
         markdown_content = self.convert_to_markdown(elements_dict)

diff --git a/libs/megaparse_sdk/examples/usage_example.py b/libs/megaparse_sdk/examples/usage_example.py
@@ -8,19 +8,22 @@ async def main():
     api_key = str(os.getenv("MEGAPARSE_API_KEY"))
     megaparse = MegaParseSDK(api_key)
 
-    url = "https://www.quivr.com"
+    # url = "https://www.quivr.com"
 
-    # Upload a URL
-    url_response = await megaparse.url.upload(url)
-    print(f"\n----- URL Response : {url} -----\n")
-    print(url_response)
+    # # Upload a URL
+    # url_response = await megaparse.url.upload(url)
+    # print(f"\n----- URL Response : {url} -----\n")
+    # print(url_response)
 
-    file_path = "megaparse/sdk/pdf/MegaFake_report.pdf"
+    # file_path = "megaparse/sdk/pdf/MegaFake_report.pdf"
+    file_path = (
+        "megaparse/sdk/examples/only_pdfs/4 The Language of Medicine  2024.07.21.pdf"
+    )
     # Upload a file
     response = await megaparse.file.upload(
         file_path=file_path,
         method="unstructured",  # type: ignore  # unstructured, llama_parser, megaparse_vision
-        strategy="fast",
+        strategy="auto",  # type: ignore  # fast, auto, hi_res
     )
     print(f"\n----- File Response : {file_path} -----\n")
     print(response)