Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the Linker, Ranker, Recogniser and Pipeline #282

Open
wants to merge 101 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
101 commits
Select commit Hold shift + click to select a range
00a3607
Ignore .DS_Store files
thobson88 Sep 10, 2024
a3c8c7a
Refactor linking methods as Linker subclasses
thobson88 Sep 11, 2024
a114f2c
Add RelDisamb Linker subclass & fix tests
thobson88 Oct 30, 2024
769488e
Move rel_params to the RelDisambLinker class
thobson88 Oct 30, 2024
5422243
Remove superfluous return value
thobson88 Oct 30, 2024
e8282fb
Refactor Ranker into subclasses
thobson88 Oct 31, 2024
7a1e7a5
Modify Ranker class hierarchy; fix unit tests
thobson88 Nov 1, 2024
b42f6e5
Fix method_name in DeezyMatchRanker
thobson88 Nov 1, 2024
3bf6fa6
Fix integration tests
thobson88 Nov 1, 2024
936cc21
Move train method to DeezyMatchRanker
thobson88 Nov 5, 2024
5219598
Fix docstrings
thobson88 Nov 5, 2024
c94c166
Move deezy-specific code to Ranker subclass
thobson88 Nov 5, 2024
c3cf63b
Update docstrings
thobson88 Nov 5, 2024
daf97ad
Refactor the Recogniser class
thobson88 Nov 5, 2024
b31d07e
Simplify variable names & rename module to avoid clash
thobson88 Nov 6, 2024
2e7d53d
Simplify Pipeline constructor
thobson88 Nov 6, 2024
dc97af8
Remove superfluous return value
thobson88 Nov 6, 2024
ee02a65
Add Pipeline constructor unit test
thobson88 Nov 6, 2024
e13743f
Remove inconsistency in treating place_wqid
thobson88 Nov 6, 2024
d28f5ad
Remove superfluous return value
thobson88 Nov 6, 2024
7c39f6d
Add dataclasses to represent ranking candidates
thobson88 Nov 7, 2024
e42192d
Integrate dataclasses into the Ranker
thobson88 Nov 8, 2024
2088beb
Integrate dataclasses into Linker & Pipeline
thobson88 Nov 12, 2024
29de527
Add method return types
thobson88 Nov 12, 2024
1394718
Move dataclasses to own module
thobson88 Nov 12, 2024
6d8fa19
Add new dataclasses for Ranker & Linker output
thobson88 Nov 13, 2024
34260f0
Update dataclasses and integrate into Linker & Pipeline
thobson88 Nov 17, 2024
004bd53
Reorder dataclasses module
thobson88 Nov 18, 2024
d54392a
Rename Ranker cache & load method; Edit docstrings
thobson88 Nov 18, 2024
ffe8a59
Replace Ranker method_name() with class attribute
thobson88 Nov 18, 2024
e837037
Replace Linker method_name() with class attribute
thobson88 Nov 18, 2024
9cc84f9
Rename Ranker string matching method
thobson88 Nov 18, 2024
66a5a96
Rename ranker run method argument
thobson88 Nov 18, 2024
93e8276
Remove obsolete Linker method
thobson88 Nov 18, 2024
d9f6558
Improve Linker run method signature
thobson88 Nov 20, 2024
8ea624d
Add geo coords & Wikidata class fields in WikidataLink
thobson88 Nov 21, 2024
1a6c468
Add Recogniser dataclasses & run method
thobson88 Nov 21, 2024
6e1392d
Refactor pipeline to remove linking logic
thobson88 Dec 3, 2024
a30d4f6
Move REL predict method call to Linker
thobson88 Dec 4, 2024
9a0cc56
Handle empty candidates case
thobson88 Dec 4, 2024
2a87c97
Handle mentions with empty candidates
thobson88 Dec 4, 2024
0f55cff
Tidy up
thobson88 Dec 4, 2024
8293bf2
Add modular/stepwise pipeline methods
thobson88 Dec 5, 2024
7956f7b
Adapt REL model training to new dataclasses
thobson88 Dec 9, 2024
ada5cb7
Adapt experiments to new dataclasses
thobson88 Dec 9, 2024
7d3b3ff
Add best_disambiguation_score method
thobson88 Dec 9, 2024
4e584dd
Add text() method in TextCandidates dataclass
thobson88 Dec 9, 2024
43cda06
Rename Linker & Recogniser methods to load()
thobson88 Dec 9, 2024
98ff3cc
Move dataclasses module to utils folder
thobson88 Dec 9, 2024
7742e88
Add methods to Predictions dataclass
thobson88 Dec 10, 2024
1c21032
Add static constructor in Ranker class
thobson88 Dec 10, 2024
ec82ab3
Add static constructor in Linker class
thobson88 Dec 10, 2024
70e0fb4
Fix bug in Linker __str__ method
thobson88 Dec 10, 2024
a651960
Use refactored Ranker & Linker in experiments
thobson88 Dec 10, 2024
edf6ced
Rename ner module
thobson88 Dec 11, 2024
7c6557f
Update pipeline run_disambiguation signature
thobson88 Dec 11, 2024
f8d049f
Fix import
thobson88 Dec 11, 2024
10b9afe
More robust guard clause in Linker training
thobson88 Dec 11, 2024
6bdfc89
Distinguish empty predictions from empty candidates
thobson88 Dec 11, 2024
3db22d4
Fix tests in test_experiments.py
thobson88 Dec 12, 2024
3805762
Rename dataclasses
thobson88 Dec 12, 2024
098a240
Add pretty print method for Predictions
thobson88 Dec 12, 2024
8536d61
Add pretty print methods for SentenceMentions
thobson88 Dec 12, 2024
478bf48
Move dataclass tests to test_dataclasses.py
thobson88 Dec 12, 2024
bcf7c59
Add pretty print methods for CandidateLinks
thobson88 Dec 12, 2024
ca75801
Add pretty print method for Candidates
thobson88 Dec 13, 2024
dbcf6e6
Tweak pretty printing
thobson88 Dec 13, 2024
40086f1
Update example notebook: basic pipeline
thobson88 Dec 13, 2024
f56a212
Update example notebook: Deezy mostpopular
thobson88 Dec 13, 2024
8be92c0
Update example notebooks: Deezy REL
thobson88 Dec 13, 2024
6454d0a
Update example notebook: Pipeline modular
thobson88 Dec 13, 2024
dff4d2d
Update example notebook: Perfect mostpopular
thobson88 Dec 13, 2024
6a35cbd
Add guard clauses
thobson88 Dec 13, 2024
bdad46c
Update example notebook: Load & use NER model
thobson88 Dec 13, 2024
0a39158
Remove obsolete comment
thobson88 Dec 13, 2024
aa183e3
Update app config
thobson88 Dec 13, 2024
e60481f
Update app: run_ner endpoint
thobson88 Dec 14, 2024
efe9d5d
Update app: run_candidate_selection
thobson88 Dec 15, 2024
65469dd
Update app: run_disambiguation & pipeline
thobson88 Dec 15, 2024
6092dcc
Remove obsolete code
thobson88 Dec 15, 2024
676d823
Update ci.yml
thobson88 Dec 17, 2024
6530d2e
Update ci.yml
thobson88 Dec 17, 2024
771fcae
Granular test parameters
thobson88 Dec 17, 2024
45446db
Register test markers in pyproject.toml
thobson88 Dec 17, 2024
99a51ab
Move gazetteer to ByDistanceLinker subclass
thobson88 Dec 17, 2024
a437048
Remove ranker argument
thobson88 Dec 17, 2024
a87f89a
Simplify logic
thobson88 Dec 17, 2024
f9469cf
Simplify logic
thobson88 Dec 17, 2024
791280d
Refactor Linker run method
thobson88 Dec 18, 2024
0ff4746
Remove superfluous field from MentionCandidates
thobson88 Dec 18, 2024
58b39c3
Simplify logic
thobson88 Dec 18, 2024
e84bd55
Simplify logic in reldisamb score calculation
thobson88 Dec 18, 2024
2a41a29
Simplify logic in PartialMatchRanker
thobson88 Dec 18, 2024
db6c7aa
Align test assertions with canonical resources
thobson88 Dec 18, 2024
cea01f5
Reorder Pipeline run method args
thobson88 Dec 19, 2024
6d7d48c
Use test resources in pipeline test
thobson88 Dec 19, 2024
0fe401d
Handle empty candidates in Predictions dataclass
thobson88 Dec 20, 2024
ff3d2d8
Update comments
thobson88 Dec 20, 2024
589303b
Add method to access interim (prior) predictions
thobson88 Dec 20, 2024
946533c
Rename APIQuery class
thobson88 Dec 20, 2024
95714c5
Update API app_template.py to new pipeline
thobson88 Dec 23, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,4 +53,4 @@ jobs:

- name: Test with pytest
run: |
poetry run pytest
poetry run pytest tests
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,9 @@ coverage.xml
*.mo
*.pot

# OS stuff:
.DS_Store

# Django stuff:
*.log
local_settings.py
Expand Down Expand Up @@ -150,4 +153,4 @@ preprocessing/toponymmatching/experiments/

# Docs
_build
test.ipynb
test.ipynb
123 changes: 60 additions & 63 deletions app/app_template.py
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,35 +1,45 @@
import importlib
import os
import sys
import time
from pathlib import Path
from typing import Union, Optional, List
from typing import List, Optional, Union

import uvicorn
from fastapi import FastAPI, Request
from pydantic import BaseModel

from config import CONFIG as pipeline_config

from t_res.geoparser import pipeline
from t_res.utils.dataclasses import SentenceMentions, Candidates

os.environ["APP_CONFIG_NAME"] = "t-res_deezy_reldisamb-wpubl-wmtops"

config_mod = importlib.import_module(
".t-res_deezy_reldisamb-wpubl-wmtops", "app.configs"
)
pipeline_config = config_mod.CONFIG

geoparser = pipeline.Pipeline(**pipeline_config)


class APIQuery(BaseModel):
class RecognitionAPIQuery(BaseModel):
text: str
place: Optional[Union[str, None]] = None
place_wqid: Optional[Union[str, None]] = None


class CandidatesAPIQuery(BaseModel):
toponyms: List[dict]
sentence_mentions: List[dict]
place_of_pub_wqid: Optional[str] = None
place_of_pub: Optional[str] = None


class DisambiguationAPIQuery(BaseModel):
dataset: List[dict]
wk_cands: dict
place: Optional[Union[str, None]] = None
place_wqid: Optional[Union[str, None]] = None
candidates: dict


class PipelineAPIQuery(BaseModel):
text: str
place_of_pub_wqid: Optional[str] = None
place_of_pub: Optional[str] = None


app_config_name = os.environ["APP_CONFIG_NAME"]
Expand All @@ -38,74 +48,61 @@ class DisambiguationAPIQuery(BaseModel):

@app.get("/")
async def read_root(request: Request):
return {"Welcome to T-Res!": request.app.title}


@app.get("/test")
async def test_pipeline():
resolved = geoparser.run_sentence(
"Harvey, from London;Thomas and Elizabeth, Barnett.",
place="Manchester",
place_wqid="Q18125",
)

return resolved


@app.get("/resolve_sentence")
async def run_sentence(api_query: APIQuery, request_id: Union[str, None] = None):
place = "" if api_query.place is None else api_query.place
place_wqid = "" if api_query.place_wqid is None else api_query.place_wqid
resolved = geoparser.run_sentence(
api_query.text, place=place, place_wqid=place_wqid
)

return resolved


@app.get("/resolve_full_text")
async def run_text(api_query: APIQuery):

place = "" if api_query.place is None else api_query.place
place_wqid = "" if api_query.place_wqid is None else api_query.place_wqid
resolved = geoparser.run_text(api_query.text, place=place, place_wqid=place_wqid)

return resolved

return {
"Title": request.app.title,
"request.url": request.url,
"request.query_params": request.query_params,
"root_path": request.scope.get("root_path"),
"request.client": request.client,
"hostname": os.uname()[1],
"worker_id": os.getpid(),
}

@app.get("/run_ner")
async def run_ner(api_query: APIQuery):

place = "" if api_query.place is None else api_query.place
place_wqid = "" if api_query.place_wqid is None else api_query.place_wqid
async def run_ner(api_query: RecognitionAPIQuery):
ner_output = geoparser.run_text_recognition(
api_query.text, place=place, place_wqid=place_wqid
api_query.text
)

return ner_output


@app.get("/run_candidate_selection")
async def run_candidate_selection(cand_api_query: CandidatesAPIQuery):

wk_cands = geoparser.run_candidate_selection(cand_api_query.toponyms)
return wk_cands

sentence_mentions = SentenceMentions.from_json(cand_api_query.sentence_mentions)
candidates = geoparser.run_candidate_selection(
sentence_mentions,
place_of_pub_wqid=cand_api_query.place_of_pub_wqid,
place_of_pub=cand_api_query.place_of_pub,
)
return candidates

@app.get("/run_disambiguation")
async def run_disambiguation(api_query: DisambiguationAPIQuery):
place = "" if api_query.place is None else api_query.place
place_wqid = "" if api_query.place_wqid is None else api_query.place_wqid
disamb_output = geoparser.run_disambiguation(
api_query.dataset, api_query.wk_cands, place, place_wqid
candidates = Candidates.from_dict(api_query.candidates)
predictions = geoparser.run_disambiguation(candidates)
return predictions

@app.get("/run_pipeline")
async def run_pipeline(api_query: PipelineAPIQuery):
predictions = geoparser.run(
text=api_query.text,
place_of_pub_wqid=api_query.place_of_pub_wqid,
place_of_pub=api_query.place_of_pub,
)
return disamb_output
return predictions

@app.get("/test")
async def test_pipeline():
predictions = geoparser.run(
"Harvey, from London;Thomas and Elizabeth, Barnett.",
place_of_pub_wqid="Q18125",
place_of_pub="Manchester",
)
return predictions

@app.get("/health")
async def healthcheck():
return {"status": "ok"}


if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)
# poetry run uvicorn app.run_local_app:app --port 8123
uvicorn.run(app, host="0.0.0.0", port=8123)
111 changes: 111 additions & 0 deletions app/app_template_old.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
import os
import sys
import time
from pathlib import Path
from typing import Union, Optional, List

import uvicorn
from fastapi import FastAPI, Request
from pydantic import BaseModel

from config import CONFIG as pipeline_config

from t_res.geoparser import pipeline

geoparser = pipeline.Pipeline(**pipeline_config)


class APIQuery(BaseModel):
text: str
place: Optional[Union[str, None]] = None
place_wqid: Optional[Union[str, None]] = None


class CandidatesAPIQuery(BaseModel):
toponyms: List[dict]


class DisambiguationAPIQuery(BaseModel):
dataset: List[dict]
wk_cands: dict
place: Optional[Union[str, None]] = None
place_wqid: Optional[Union[str, None]] = None


app_config_name = os.environ["APP_CONFIG_NAME"]
app = FastAPI(title=f"Toponym Resolution Pipeline API ({app_config_name})")


@app.get("/")
async def read_root(request: Request):
return {"Welcome to T-Res!": request.app.title}


@app.get("/test")
async def test_pipeline():
resolved = geoparser.run_sentence_deprecated(
"Harvey, from London;Thomas and Elizabeth, Barnett.",
place="Manchester",
place_wqid="Q18125",
)

return resolved


@app.get("/resolve_sentence")
async def run_sentence(api_query: APIQuery, request_id: Union[str, None] = None):
place = "" if api_query.place is None else api_query.place
place_wqid = "" if api_query.place_wqid is None else api_query.place_wqid
resolved = geoparser.run_sentence_deprecated(
api_query.text, place=place, place_wqid=place_wqid
)

return resolved


@app.get("/resolve_full_text")
async def run_text(api_query: APIQuery):

place = "" if api_query.place is None else api_query.place
place_wqid = "" if api_query.place_wqid is None else api_query.place_wqid
resolved = geoparser.run_text_deprecated(api_query.text, place=place, place_wqid=place_wqid)

return resolved


@app.get("/run_ner")
async def run_ner(api_query: APIQuery):

place = "" if api_query.place is None else api_query.place
place_wqid = "" if api_query.place_wqid is None else api_query.place_wqid
ner_output = geoparser.run_text_recognition_deprecated(
api_query.text, place=place, place_wqid=place_wqid
)

return ner_output


@app.get("/run_candidate_selection")
async def run_candidate_selection(cand_api_query: CandidatesAPIQuery):

wk_cands = geoparser.run_candidate_selection_deprecated(cand_api_query.toponyms)
return wk_cands


@app.get("/run_disambiguation")
async def run_disambiguation(api_query: DisambiguationAPIQuery):
place = "" if api_query.place is None else api_query.place
place_wqid = "" if api_query.place_wqid is None else api_query.place_wqid
disamb_output = geoparser.run_disambiguation_deprecated(
api_query.dataset, api_query.wk_cands, place, place_wqid
)
return disamb_output


@app.get("/health")
async def healthcheck():
return {"status": "ok"}


if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)
11 changes: 5 additions & 6 deletions app/configs/t-res_deezy_reldisamb-wpubl-wmtops.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,7 @@

# --------------------------------------
# Instantiate the ranker:
myranker = ranking.Ranker(
method="deezymatch",
ranker = ranking.DeezyMatchRanker(
resources_path="./resources/",
strvar_parameters={
# Parameters to create the string pair dataset:
Expand Down Expand Up @@ -37,9 +36,9 @@

with sqlite3.connect("./resources/rel_db/embeddings_database.db") as conn:
cursor = conn.cursor()
mylinker = linking.Linker(
method="reldisamb",
linker = linking.RelDisambLinker(
resources_path="./resources/",
ranker=ranker,
experiments_path="./experiments/",
linking_resources=dict(),
rel_params={
Expand All @@ -56,5 +55,5 @@
overwrite_training=False,
)

# geoparser = pipeline.Pipeline(myranker=myranker, mylinker=mylinker)
CONFIG = {"myranker": myranker, "mylinker": mylinker}
# geoparser = pipeline.Pipeline(ranker=ranker, linker=linker)
CONFIG = {"ranker": ranker, "linker": linker}
Loading
Loading