-
Notifications
You must be signed in to change notification settings - Fork 233
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* add: custom auto * fix: add file extension * fix: switch to pdfminer.six * fix: gitignore * add : dist * feat: refacto megaparse for service (#132) * full refacto wip * moved schema to sdk and client * fix ruff * sssl in config * wip * fix config param * fix CI + docker --------- Co-authored-by: aminediro <[email protected]> * chore(main): release megaparse 0.0.46 (#133) * feat: release plz (#134) * full refacto wip * moved schema to sdk and client * fix ruff * sssl in config * wip * fix config param * fix CI + docker * version megaparse * fix settings env * docker fix * fix manifest --------- Co-authored-by: aminediro <[email protected]> * feat: release plz (#136) * full refacto wip * moved schema to sdk and client * fix ruff * sssl in config * wip * fix config param * fix CI + docker * version megaparse * fix settings env * docker fix * fix manifest * release confgi * release confg --------- Co-authored-by: aminediro <[email protected]> * feat: release plz (#138) * full refacto wip * moved schema to sdk and client * fix ruff * sssl in config * wip * fix config param * fix CI + docker * version megaparse * fix settings env * docker fix * fix manifest * release confgi * release confg * release plzzzz --------- Co-authored-by: aminediro <[email protected]> * feat: broker megaparse (#139) * full refacto wip * moved schema to sdk and client * fix ruff * sssl in config * wip * fix config param * fix CI + docker * version megaparse * fix settings env * docker fix * fix manifest * release confgi * release confg * release plzzzz * release plz --------- Co-authored-by: aminediro <[email protected]> * feat: broker megaparse (#140) * full refacto wip * moved schema to sdk and client * fix ruff * sssl in config * wip * fix config param * fix CI + docker * version megaparse * fix settings env * docker fix * fix manifest * release confgi * release confg * release plzzzz * release plz * release plz --------- Co-authored-by: aminediro <[email protected]> * feat: broker megaparse (#141) * full refacto wip * moved schema to sdk and client * fix ruff * sssl in config * wip * fix config param * fix CI + docker * version megaparse * fix settings env * docker fix * fix manifest * release confgi * release confg * release plzzzz * release plz * release plz * release plz --------- Co-authored-by: aminediro <[email protected]> * release plz (#142) Co-authored-by: aminediro <[email protected]> * feat: release plz1 (#143) * release plz * release plz * release plz --------- Co-authored-by: aminediro <[email protected]> * feat: release plz1 (#144) * release plz * release plz * release plz * release plz --------- Co-authored-by: aminediro <[email protected]> * chore(main): release megaparse 0.0.47 (#145) * chore(main): release megaparse-sdk 0.1.5 (#146) * chore(main): release megaparse-sdk 0.1.5 * up * feat: megaparse sdk tests (#148) * working parsing * tests timeouts * tests errors * test CI * any worker * runs on * ci * ci * ci * ci * ci * ci * ci * ci * ci * ci * ci * ci * ci * ci * ci * CI timeout tests * added sudo * ci * remove tls CI --------- Co-authored-by: aminediro <[email protected]> * chore(main): release megaparse-sdk 0.1.6 (#149) * fix megaparse_sdk no ssl (#150) Co-authored-by: aminediro <[email protected]> * fix megaparse_sdk no ssl (#151) Co-authored-by: aminediro <[email protected]> * readme (#152) Co-authored-by: aminediro <[email protected]> * sdk file_name (#153) Co-authored-by: aminediro <[email protected]> * fix: Update README.md (#154) * chore(main): release megaparse-sdk 0.1.7 (#155) * fix: auto strategy * fix: process time import * fix: auto comments --------- Co-authored-by: AmineDiro <[email protected]> Co-authored-by: aminediro <[email protected]> Co-authored-by: Stan Girard <[email protected]>
- Loading branch information
1 parent
546fd34
commit 3cb5be4
Showing
14 changed files
with
207 additions
and
73 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,3 +16,6 @@ venv | |
*.DS_Store | ||
.tool-versions | ||
megaparse/sdk/examples/only_pdfs/* | ||
benchmark/hi_res/* | ||
benchmark/auto/* | ||
|
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
import os | ||
import difflib | ||
from pathlib import Path | ||
|
||
auto_dir = Path("benchmark/auto") | ||
hi_res_dir = Path("benchmark/hi_res") | ||
|
||
|
||
def jaccard_similarity(str1, str2): | ||
if len(str1) == 0 and len(str2) == 0: | ||
return 1 | ||
# Tokenize the strings into sets of words | ||
words1 = set(str1.split()) | ||
words2 = set(str2.split()) | ||
|
||
# Find intersection and union of the word sets | ||
intersection = words1.intersection(words2) | ||
union = words1.union(words2) | ||
|
||
# Compute Jaccard similarity | ||
return len(intersection) / len(union) if len(union) != 0 else 0 | ||
|
||
|
||
def compare_files(file_name): | ||
file_path_auto = auto_dir / f"{file_name}.md" | ||
file_path_hi_res = hi_res_dir / f"{file_name}.md" | ||
|
||
with open(file_path_auto, "r") as f: | ||
auto_content = f.read() | ||
|
||
with open(file_path_hi_res, "r") as f: | ||
hi_res_content = f.read() | ||
|
||
if len(auto_content) == 0 and len(hi_res_content) == 0: | ||
return 1 | ||
|
||
similarity = difflib.SequenceMatcher(None, auto_content, hi_res_content).ratio() | ||
# similarity = jaccard_similarity(auto_content, hi_res_content) | ||
|
||
return similarity | ||
|
||
|
||
def main(): | ||
files = os.listdir(hi_res_dir) | ||
print(f"Comparing {len(files)} files...") | ||
similarity_dict = {} | ||
for file in files: | ||
file_name = Path(file).stem | ||
similarity = compare_files(file_name) | ||
similarity_dict[file_name] = similarity | ||
|
||
avg_similarity = sum(similarity_dict.values()) / len(similarity_dict) | ||
print(f"\nAverage similarity: {avg_similarity}\n") | ||
|
||
pass_rate = sum( | ||
[similarity > 0.9 for similarity in similarity_dict.values()] | ||
) / len(similarity_dict) | ||
|
||
print(f"Pass rate: {pass_rate}\n") | ||
|
||
print("Under 0.9 similarity documents:") | ||
print("-------------------------------") | ||
for file_name, similarity in similarity_dict.items(): | ||
if similarity < 0.9: | ||
print(f"{file_name}: {similarity}") | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
from megaparse import MegaParse | ||
from megaparse.parser.unstructured_parser import UnstructuredParser | ||
|
||
|
||
def main(): | ||
parser = UnstructuredParser() | ||
megaparse = MegaParse(parser=parser) | ||
|
||
file_path = "somewhere/only_pdfs/4 The Language of Medicine 2024.07.21.pdf" | ||
parsed_file = megaparse.load(file_path) | ||
print(f"\n----- File Response : {file_path} -----\n") | ||
print(parsed_file) | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.