-
Notifications
You must be signed in to change notification settings - Fork 0
PackageHfst
...
List of contents of package hfst
Item | Description |
---|---|
class AttReader | A class for reading input in AT&T text format and converting it into transducer(s). |
class PrologReader | A class for reading input in prolog text format and converting it into transducer(s). |
class HfstIterableTransducer | A simple transducer class with tropical weights. |
class HfstTransition | A transition class that consists of a target state, input and output symbols and a a tropical weight. |
class HfstTransducer | A synchronous finite-state transducer. |
class HfstInputStream | A stream for reading HFST binary transducers. |
class HfstOutputStream | A stream for writing HFST binary transducers. |
class MultiCharSymbolTrie | ??? |
class HfstTokenizer | A tokenizer for creating transducers from UTF-8 strings. |
class LexcCompiler | A compiler holding information contained in lexc style lexicons. |
class XreCompiler | A regular expression compiler. |
class PmatchContainer | A class for performing pattern matching. |
class ImplementationType | Back-end implementations. |
set_default_fst_type | Set default transducer implementation type. |
get_default_fst_type | Get default transducer implementation type. |
fst_type_to_string | Get a string representation of transducer implementation type. |
EPSILON | The string for epsilon symbol. |
UNKNOWN | The string for unknown symbol. |
IDENTITY | The string for identity symbol. |
fst | Get a transducer that recognizes one or more paths. |
fst_to_fsa | Get an automaton representation of a tranducer. |
fsa_to_fst | Get a transducer representation of an automaton. |
tokenized_fst | Get a transducer that recognizes the concatenation of symbols or symbol pairs. |
empty_fst | Get an empty transducer. |
epsilon_fst | Get an epsilon transducer. |
regex | Get a transducer as defined by regular expression. |
compile_sfst_file | Compile sfst file into a transducer. |
compile_lexc_file | Compile lexc file into a transducer. |
compile_xfst_file | Compile (is 'run' a better term?) xfst file. |
compile_pmatch_file | Compile pmatch expressions as defined in file and return a tuple of transducers. |
compile_twolc_file | Compile twolc file and store the result to file. |
compile_pmatch_expression | Compile a pmatch expression into a tuple of transducers. |
start_xfst | Start interactive xfst compiler. |
read_att_input | Read AT&T input from the user and return a transducer. |
read_att_string | Read a multiline AT&T string and return a transducer. |
read_att_transducer | Read next transducer from file in AT&T format. |
read_prolog_transducer | Read next transducer from file in prolog format. |
read_prolog_transducer | Read next transducer from file in prolog format keeping track of lines. |
concatenate | Return a concatenation of transducers. |
disjunct | Return a union of transducers. |
intersect | Return an intersection of transducers. |
compose | Return a composition of transducers. |
cross_product | Return a cross product of transducers. |
is_diacritic | Whether a symbol is flag diacritic. |
Set default transducer implementation type.
Set the implementation type (SFST_TYPE, TROPICAL_OPENFST_TYPE, FOMA_TYPE) that is used by default by all operations that create transducers. The default value is TROPICAL_OPENFST_TYPE
-
impl
An hfst.ImplementationType.
Get default transducer implementation type.
If the default type is not set, it defaults to TROPICAL_OPENFST_TYPE.
Get a string representation of transducer implementation type type
.
-
type
An hfst.ImplementationType.
The string for epsilon symbol.
An example:
fsm = hfst.HfstIterableTransducer()
fsm.add_state(1)
fsm.set_final_weight(1, 2.0)
fsm.add_transition(0, 1, "foo", hfst.EPSILON)
if not hfst.HfstTransducer(fsm).compare(hfst.regex('foo:0::2.0')):
raise RuntimeError('')
Note: In regular expressions, "0" is used for the epsilon. See also: Symbols
The string for unknown symbol.
An example:
fsm = hfst.HfstIterableTransducer()
fsm.add_state(1)
fsm.set_final_weight(1, -0.5)
fsm.add_transition(0, 1, "foo", hfst.UNKNOWN)
fsm.add_transition(0, 1, "foo", "foo")
if not hfst.HfstTransducer(fsm).compare(hfst.regex('foo:?::-0.5')):
raise RuntimeError('')
Note: In regular expressions, "?" on either or both sides of a transition is used for the unknown symbol. See also: Symbols
The string for identity symbol.
An example:
fsm = hfst.HfstIterableTransducer()
fsm.add_state(1)
fsm.set_final_weight(1, 1.5)
fsm.add_transition(0, 1, hfst.IDENTITY, hfst.IDENTITY)
if not hfst.HfstTransducer(fsm).compare(hfst.regex('?::1.5')):
raise RuntimeError('')
Note: In regular expressions, a single "?" is used for the identity symbol. See also: Symbols
Get a transducer that recognizes one or more paths.
-
arg
See example below
Possible inputs:
One unweighted identity path:
'foo' -> [f o o]
Weighted path: a tuple of string and number, e.g.
('foo',1.4)
('bar',-3)
('baz',0)
Several paths: a list or a tuple of paths and/or weighted paths, e.g.
['foo', 'bar']
('foo', ('bar',5.0))
('foo', ('bar',5.0), 'baz', 'Foo', ('Bar',2.4))
[('foo',-1), ('bar',0), ('baz',3.5)]
A dictionary mapping strings to any of the above cases:
{'foo':'foo', 'bar':('foo',1.4), 'baz':(('foo',-1),'BAZ')}
Get a transducer (automaton) where each transition symbol pair isymbol:osymbol of fst
is replaced with a transition isymbolosymbol:isymbolosymbol, adding separator
between isymbol and osymbol.
-
fst
The transducer. -
separator
The separator symbol inserted between input and output symbols.
Examples
import hfst
foo2bar = hfst.fst({'foo':'bar'})
foobar = hfst.fst_to_fsa(foo2bar)
foobar = hfst.fst_to_fsa(foo2bar, '^')
See also: hfst.fsa_to_fst
Get a transducer where each transition isymbolSosymbol:isymbolSosymbol of fsa
is replaced a transition isymbol:osymbol, if separator
is S.
-
fsa
The transducer. Must be an automaton, i.e. for each transition, the input and output symbols must be the same. Else, a TransducerIsNotAutomatonException is thrown. -
separator
The symbol separating input and output symbol parts infsa.
If it is the empty string, length of each symbol infsa
(excluding special symbols of form "@...@") must be exactly 2. Else, a RuntimeError is thrown.
Examples:
import hfst
foo2bar = hfst.fst({'foo':'bar'}) # creates transducer [f:b o:a o:r]
foobar = hfst.fst_to_fsa(foo2bar, '^')
foo2bar = hfst.fsa_to_fst(foobar, '^')
Get a transducer that recognizes the concatenation of symbols or symbol pairs in arg
.
-
arg
The symbols or symbol pairs that form the path to be recognized.
Example
import hfst
tok = hfst.HfstTokenizer()
tok.add_multichar_symbol('foo')
tok.add_multichar_symbol('bar')
tr = hfst.tokenized_fst(tok.tokenize('foobar', 'foobaz'))
Get an empty transducer.
Empty transducer has one state that is not final, i.e. it does not recognize any string.
Get an epsilon transducer.
Epsilon transducer has one state that is final (with final weight weight),
i.e. it recognizes the empty string.
-
weight
The weight of the final state.
Get a transducer as defined by regular expression regexp
.
-
regexp
The regular expression defined with Xerox transducer notation. -
kwargs
Arguments recognized are: error. -
error
Where warnings and errors are printed. Possible values are sys.stdout, sys.stderr (the default), a StringIO or None, indicating a quiet mode.
Regular expression operators:
~ complement
\ term complement
& intersection
- minus
$. contains once
$? contains optionally
$ contains once or more
( ) optionality
+ Kleene plus
* Kleene star
./. ignore internally (not yet implemented)
/ ignoring
| union
<> shuffle
< before
> after
.o. composition
.O. lenient composition
.m>. merge right
.<m. merge left
.x. cross product
.P. input priority union
.p. output priority union
.-u. input minus
.-l. output minus
`[ ] substitute
^n,k catenate from n to k times, inclusive
^>n catenate more than n times
^>n catenate less than n times
^n catenate n times
.r reverse
.i invert
.u input side
.l output side
\\\ left quotient
Two-level rules:
\<= left restriction
<=> left and right arrow
<= left arrow
=> right arrow
Replace rules:
-> replace right
(->) optionally replace right
<- replace left
(<-) optionally replace left
<-> replace left and right
(<->) optionally replace left and right
@-> left-to-right longest match
@> left-to-right shortest match
->@ right-to-left longest match
>@ right-to-left shortest match
Rule contexts, markers and separators:
|| match contexts on input sides
// match left context on output side and right context on input side
\\ match left context on input side and right context on output side
\/ match contexts on output sides
_ center marker
... markup marker
,, rule separator in parallel rules
, context separator
[. .] match epsilons only once
Read from file:
@bin" " read binary transducer
@txt" " read transducer in att text format
@stxt" " read spaced text
@pl" " read transducer in prolog text format
@re" " read regular expression
Symbols:
.#. word boundary symbol in replacements, restrictions
0 the epsilon
? any token
% escape character
{ } concatenate symbols
" " quote symbol
: pair separator
:: weight
; end of expression
! starts a comment until end of line
# starts a comment until end of line
Compile sfst file filename
into a transducer.
-
filename
The name of the sfst file. -
kwargs
Arguments recognized are: verbose, output. -
verbose
Whether sfst file is processed in verbose mode, defaults to False. -
output
TODO: Where output is printed. Possible values are sys.stdout, sys.stderr, a StringI0, sys.stderr being the default. Return: On success the resulting transducer, else None.
Compile lexc file filename
into a transducer.
-
filename
The name of the lexc file. -
kwargs
Arguments recognized are: verbosity, with_flags, output. -
verbosity
The verbosity of the compiler, defaults to 0 (silent). Possible values are: 0, 1, 2. -
with_flags
Whether lexc flags are used when compiling, defaults to False. -
output
Where output is printed. Possible values are sys.stdout, sys.stderr, a StringIO, sys.stderr being the default? Return: On success the resulting transducer, else None.
Examples of input (given in file filename
):
The following example recognizes the words "cat", "dog", "cats" and "dogs" with respective weights 2, 3, 5 and 6:
Multichar_Symbols +Sg +Pl
LEXICON Root
cat Num "weight: 1" ;
dog Num "weight: 2" ;
LEXICON Num
+Sg: # "weight: 1" ;
+Pl:s # "weight: 4" ;
The following example recognizes any number of consecutive cats with a weight equal to the number of cats, i.e. "cat" with weight 1, "catcat" with weight 2, etc:
LEXICON Root
<[cat::1]+> # ;
Using weights has an effect only if FORMAT
is weighted, i.e. one of { openfst-tropical, openfst-log, optimized-lookup-weighted }
.
For more information on using weights in regular expressions, see Weights.
Compile (is 'run' a better term?) xfst file filename
.
-
filename
The name of the xfst file. -
kwargs
Arguments recognized are: verbosity, quit_on_fail, output, type. -
verbosity
The verbosity of the compiler, defaults to 0 (silent). Possible values are: 0, 1, 2. -
quit_on_fail
Whether the script is exited on any error, defaults to True. -
output
Where output is printed. Possible values are sys.stdout, sys.stderr, a StringIO, sys.stderr being the default? -
type
Implementation type of the compiler, defaults to hfst.get_default_fst_type(). Return: On success 0, else an integer greater than 0.
Compile pmatch expressions as defined in filename
and return a tuple of transducers.
An example:
If we have a file named streets.txt that contains:
define CapWord UppercaseAlpha Alpha* ;
define StreetWordFr [{avenue} | {boulevard} | {rue}] ;
define DeFr [ [{de} | {du} | {des} | {de la}] Whitespace ] | [{d'} | {l'}] ;
define StreetFr StreetWordFr (Whitespace DeFr) CapWord+ ;
regex StreetFr EndTag(FrenchStreetName) ;
we can run:
defs = hfst.compile_pmatch_file('streets.txt')
cont = hfst.PmatchContainer(defs)
assert cont.match("Je marche seul dans l'avenue des Ternes.") == "Je marche seul dans l'<FrenchStreetName>avenue des Ternes</FrenchStreetName>."
See also: hfst.PmatchContainer.match, hfst.PmatchContainer.__init__
Compile twolc file inputfilename
and store the result to file outputfilename
.
-
inputfilename
The name of the twolc input file. -
outputfilename
The name of the transducer output file. -
kvargs
Arguments recognized are: silent, verbose, resolve_right_conflicts, resolve_left_conflicts, type. -
silent
Whether compilation is performed in silent mode, defaults to False. -
verbose
Whether compilation is performed in verbose mode, defaults to False. -
resolve_right_conflicts
Whether right arrow conflicts are resolved, defaults to True. -
resolve_left_conflicts
Whether left arrow conflicts are resolved, defaults to False. -
type
Implementation type of the compiler, defaults to hfst.get_default_fst_type(). Return: On success zero, else an integer other than zero.
Compile a pmatch expression into a tuple of transducers.
-
expr
A string defining how pmatch is done.
See also: hfst.compile_pmatch_file
Start interactive xfst compiler. The compiler runs with a read-eval-print loop (REPL). That means that each command given is executed and the output is printed, and a new prompt is displayed. Compiler can be exited with the command exit
.
-
kwargs
Arguments recognized are: type, quit_on_fail. -
quit_on_fail
Whether the compiler exits on any error, defaults to False. -
type
Implementation type of the compiler, defaults to hfst.get_default_fst_type().
For information about xfst commands, see wiki page of command line tool hfst-xfst.
Read AT&T input from the user and return a transducer.
Return: An HfstTransducer whose type is hfst.get_default_fst_type().
Read one AT&T line at a time from standard input and finally return an equivalent transducer. An empty line signals the end of input.
Read a multiline string att
and return a transducer.
-
att
A string in AT&& format that defines the transducer. Return: An HfstTransducer whose type is hfst.get_default_fst_type().
Read att
and create a transducer as defined in it.
Read next transducer from AT&T file pointed by f.
epsilonstr
defines the symbol used for epsilon in the file.
-
f
A python file -
epsilonstr
How epsilon is represented in the file. By default, "@EPSILON_SYMBOL@" and "@0@" are both recognized.
If the file contains several transducers, they must be separated by "--" lines. In AT&T format, the transition lines are of the form:
[0-9]+[\w]+[0-9]+[\w]+[^\w]+[\w]+[^\w]([\w]+(-)[0-9]+(\.[0-9]+))
and final state lines:
[0-9]+[\w]+([\w]+(-)[0-9]+(\.[0-9]+))
If several transducers are listed in the same file, they are separated by lines of two consecutive hyphens "--". If the weight
([\w]+(-)[0-9]+(\.[0-9]+))
NOTE: If transition symbols contains spaces, they must be escaped as '@SPACE@' because spaces are used as field separators. Both '@0@' and '@EPSILON_SYMBOL@' are always interpreted as epsilons.
An example:
0 1 foo bar 0.3
1 0.5
--
0 0.0
--
--
0 0.0
0 0 a <eps> 0.2
The example lists four transducers in AT&T format: one transducer accepting the string pair <'foo','bar'>, one epsilon transducer, one empty transducer and one transducer that accepts any number of 'a's and produces an empty string in all cases. The transducers can be read with the following commands (from a file named 'testfile.att'):
transducers = []
ifile = open('testfile.att', 'r')
try:
while (True):
t = hfst.read_att_transducer(ifile, '<eps>')
transducers.append(t)
print("read one transducer")
except hfst.exceptions.NotValidAttFormatException as e:
print("Error reading transducer: not valid AT&T format.")
except hfst.exceptions.EndOfStreamException as e:
ifile.close()
print("Read %i transducers in total" % len(transducers))
Epsilon will be represented as hfst.EPSILON in the resulting transducer.
The argument epsilon_symbol
only denotes how epsilons are represented
in ifile
.
Known bugs: Empty transducers are in theory represented as empty strings in AT&T format. However, this sometimes results in them getting interpreted as end-of-file. To avoid this, use an empty line instead, i.e. a single newline character.
Throws:
- hfst.exceptions.NotValidAttFormatException
- hfst.exceptions.StreamNotReadableException
- hfst.exceptions.StreamIsClosedException
- hfst.exceptions.EndOfStreamException
See also: #write_att
Read next transducer from prolog file pointed by f.
- f A python file.
If the file contains several transducers, they must be separated by empty lines.
Create a transducer as defined in prolog format in file f
. linecount
keeps track of the current line in the file.
Return a concatenation of transducers.
-
transducers
An iterable object of transducers.
Return a union of transducers.
-
transducers
An iterable object of transducers.
Return an intersection of transducers.
-
transducers
An iterable object of transducers.
Return a composition of transducers.
-
transducers
An iterable object of transducers.
Return a cross product of transducers.
-
transducers
An iterable object of transducers.
Whether symbol symbol
is a flag diacritic.
Flag diacritics are of the form
@[PNDRCU][.][A-Z]+([.][A-Z]+)?@
Package hfst
- AttReader
- PrologReader
- HfstIterableTransducer
- HfstTransition
- HfstTransducer
- HfstInputStream
- HfstOutputStream
- MultiCharSymbolTrie
- HfstTokenizer
- LexcCompiler
- XreCompiler
- PmatchContainer
- ImplementationType