mysql://sshcyber:<password>@mysql.science.uoit.ca:3306/sshcyber
All documents were pre-processed and catalogued in the document
table. The paths for the raw documents in document.path
are located on the vialab compute server. The document.id
column is used across the entire system to reference a specific document. Many of the tables in the database consist of cleaned and extracted data, but are unused in the actual functioning of the system.
- journal
- document
- meta
- titre
The documents with the document.cleanpath
column filled are the documents used within the system, and indicates in which document aggregate file the document was saved in. Actual text processing was done using several scripts within topic_model.py
, and the stop word set used is labeled as adam2
in the stopword
table.
We used sklearn
and treetagger
to vectorize and POS tag documents respectively to generate our TFIDF model. The python model was saved and gzipped in /model/tm.gzip
, which continues to be used for modelling user uploaded documents. The TFIDF model, as well as the transformed word vectors for each document is also saved in the database, which are the main components used for performing document search queries.
- tfidf
- term
- doctfidf
- (user)dochash
- userdoctfidf
The Oxford Historical Thesaurus is a very complicated hierarchical tree of the english language. Words/phrases are classified and tiered in such way that concepts can be grouped together by meaning. More information can be found at OED.
Our OHT data originally came in the form of CSV, and was directly saved to the oht
table. For purposes of real-time use, we normalized the data into several other tables as well as aggregated tier columns into a single column.
Within the OHT hierarchy, a single node in the tree is called a heading
and can be found as such in a table. Headings are furthed classified using thematicheading
where themes are not bounded by a headings position in the tree (for this reason, we are not using thematic heading). Each heading consists of 7 tiers found in heading.t1
, heading.t2
, heading.t3
, ... heading.t7
. These columns will always have a value, and is filled with NA if the tier is not applicable for the heading.
Within the tree, the main headings for each tier are always nouns, where other POS' are only held adjacent to the noun and do not affect the structure of the tree. Each heading also has a bag of words associated to it.
- oht
- thematicheading
- heading
- word
- wordsize
This is not meant to be an explanation of the code. However, if all else fails, this is a general overview of how the code is supposed to work.
-
Home
- Keyword Search
- Recent Searches
- Upload / Recent Uploads
-
Document Search
- Search Results
- Search Terms
- Journal Visualization
- OHT Visualization
-
Journal Search
- Upload / Recent Uploads
Page templates were created with jinja2 and can be found in the /templates/
folder. It is important to have these templates up to date for the initial rendering of a page, however most of our pages have javascript functions that will re-render new data asynchronously. E.g. Running sequential searches will not reload the page, but rather rebuild the results using javascript.
In early iterations of the system, a lot of data was being saved in the session of an individual user. For this reason, we decided to create a python pickle-based interface for Flask to save session data on disk. This has the benefit of fitting much more data in the session, but ultimately is much slower. As we are no longer holding as much data in the session, this can be revisited if it poses any security/performances disadvantages.
static/py/pickle_session.py
The home page is where a search query begins. A user can enter a keyword to search directly in OHT and start building the search query immediately. Alternatively, a user can upload a document by dragging and dropping a document into the dropper area provided.
Searches and documents are saved in the database in series, and available to the user in the home page where applicable. Documents uploaded will redirect the user to the search page, where the top search terms are used as the base of the search query.
templates/index.html
file_upload.py
static/py/topic_model.py
(document processing)
The main component that is initialized on page-load are search terms extracted from either a document, or a prior search. Upon completion of load, the main search function is asynchronously called automatically. After the search function has been completed, the journal document distribution widget is loaded and generated using d3.
Clicking on a search term or adding a search term will open the OHT visualization. The data used in the OHT visualization is a splice of the entire hierarchy with a specific heading set as the root. From the root, two levels further down are shown.
Single clicking on a heading will asynchronously load the adjacent POS headings available at that tier/heading, as well as the bag of words associated to the selected heading. These words can then be added to the search query. Only the words that exist in the corpus can be selected.
Double clicking on a heading will asynchronously reload the visualization with the selected heading set as the new root. If the selected root is already the root, it will reposition the hierarchy it's immediate parent.
Document relevance ranking is completely based off of the sum of their TFIDF scores in relation to the search terms provided (logically an 'OR' operation). This can cause very heavy skews for documents that are overly weighted in a specific search term. As a slight workaround to the simplicity of the algorithm, we also allow the capability of forcing the inclusion of a search term (logically an 'AND' operation).
In code, we emulate this by cascading SQL queries where forced search terms are evaluated first.
templates/analyzer.html
static/js/events.js
static/js/hooks.js
static/js/analyzer.js
static/js/query.js
static/js/vis.js
static/js/widget.js
static/py/erudit_corpus.py
static/py/oht.py
Journal search can only currently be done using a document. By uploading a document or selecting a recently uploaded document, we perform the same SQL query used in document matching, aggregated by journal. The TFIDF scores are then summed in relation to the key search terms found in the document in context.
templates/journal.html
static/js/journal.js