forked from fizx/libbow-osx
-
Notifications
You must be signed in to change notification settings - Fork 0
/
readme.texi
121 lines (77 loc) · 2.53 KB
/
readme.texi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
@chapter Bag Of Words Library README
@c set the vars BOW_VERSION
@include version.texi
@samp{libbow}, version @value{BOWVERSION}.
@include libbow-desc.texi
@section Rainbow
@samp{Rainbow} is a standalone program that does document
classification. Here are some examples:
@itemize @bullet
@item
@example
rainbow -i ./training/positive ./training/negative
@end example
Using the text files found under the directories
@file{./positive} and @file{./negative},
tokenize, build word vectors, and write the resulting data structures
to disk.
@item
@example
rainbow --query=./testing/254
@end example
Tokenize the text document @file{./testing/254}, and classify it,
producing output like:
@example
/home/mccallum/training/positive 0.72
/home/mccallum/training/negative 0.28
@end example
@item
@example
rainbow --test-set=0.5 -t 5
@end example
Perform 5 trials, each consisting of a new random test/train split and
outputs of the classification of the test documents.
@end itemize
Typing @samp{rainbow --help} will give list of all rainbow options.
After you have compiled @samp{libbow} and @samp{rainbow}, you can run
the shell script @file{./demo/script} to see an annotated demonstration
of the classifier in action.
More information and documentation is available at
http://www.cs.cmu.edu/~mccallum/bow
@format
Rainbow improvements coming eventually:
Better documentation.
Incremental model training.
@end format
@section Arrow
@samp{Arrow} is a standalone program that does document retrieval by
TFIDF.
Index all the documents in directory @samp{foo} by typing
@example
arrow --index foo
@end example
Make a single query by typing
@example
arrow --query
@end example
then typing your query, and pressing Control-D.
If you want to make many queries, it will be more efficient to run arrow
as a server, and query it multiple times without restarts by
communicating through a socket. Type, for example,
@example
arrow --query-server=9876
@end example
And access it through port number 9876. For example:
@example
telnet localhost 9876
@end example
In this mode there is no need to press Control-D to end a query. Simply
type your query on one line, and press return.
@section Crossbow
@samp{Crossbow} is a standalone program that does document clustering.
Sorry, there is no documentation yet.
@section Archer
@samp{Archer} is a standalone program that does document retrieval with
AltaVista-type queries, using +, -, "", etc. The commands in the
"arrow" examples above also work for archer. See "archer --help" for
more information.