Skip to content

Latest commit

 

History

History
446 lines (334 loc) · 13.9 KB

README.md

File metadata and controls

446 lines (334 loc) · 13.9 KB

Tests Status Gem Version

CompSci

Provided are some toy implementations for some basic computer science problems.

Node classes

Node

A Node provides a tree structure by assigning other Nodes to @children.

  • Node

    • @value
    • @children
    • #[] - child node at index
    • #[]= - set child node at index
    • display - display full tree with all descendents
  • examples/tree.rb

  • test/node.rb

KeyNode

A KeyNode adds @key and allows a comparison based search on the key. Binary search trees are supported, and the @duplicated flag determines whether duplicate keys are allowed to be inserted. Any duplicates will not be returned from #search. A Ternary search tree is also supported, and it inherently allows duplicates keys. #search returns an array of KeyNode, possibly empty.

ChildNode

A ChildNode adds reference to its @parent.

CompleteTree classes

Efficient Array implementation of a complete tree uses arithmetic to determine parent/child relationships.

Heap class

CompleteTree implementation. Both minheaps and maxheaps are supported. Any number of children may be provided via child_slots. The primary operations are Heap#push and Heap#pop. My basic Vagrant VM gets over 500k pushes per second, constant up past 1M pushes.

BitSet class

Basics

A bitmap, or binary value, used as a data structure to hold a large number of bits. These bits can track the on/off status for various items. If we have 1000 students, then a 1024-bit bitmap can track whether each student attended school on a particular day, so long as we can assign a student to each bit somehow.

Implementation

An array of integers, typically 32-bit or 64-bit integers. The fundamental operations revolve around providing a bit index to change or query a bit's status. Some light arithmentic is needed to address different bits across different array slots.

Example

require 'compsci/bit_set'

include CompSci

# 512 bits  /  64 bytes  /  8 integers (64 bit)
b = BitSet.new(512)

# turn on the first 100 bits
100.times { |i| b.add i }

b.include?(400) # => false
b.include?(0)   # => true
b.include?(95)  # => true

b.ratio         # => (25/128)
b.storage       # => [18446744073709551615, 68719476735, 0, 0, 0, 0, 0, 0]
b.to_s          # => "19.5% positive (64.0 B)"

Basics

A Bloom filter is a probabilistic data structure which always yields true negatives and has a variable false positive rate. It is like a mathematical set, in that it tracks whether items have been encountered previously. Occasionally, it will report that it has seen an item when truly it has confused the queried item with a different item that it had seen previously.

Operation

The fundamental operation of a Bloom filter is to convert a string into a short series of numbers: bit indices. For example, "asdf" might convert to (1,2,3,4) or even (79, 954, 22). Hashing algorithms are used in this conversion process. These bit indices indicate which bits in a bitmap (a long string of 1s and 0s) to "turn on". The bitmap starts out empty, or zeroed out. Maybe it has 1024 bits. If "asdf" converts to (79, 954, 22), then we will turn on those corresponding bits.

Now, we want to check if we have seen the string "qwerty". "qwerty" converts, hypothetically, to (82, 105, 25). We check if those bits are "on" in the bitmap and respond to the query accordingly.

False Positives

There are two distinct sources of false positives:

  1. Two different strings map to the exact same bit indices
  2. So many bits have been flipped on, from other additions, that the queried bits have already been flipped on

As the filter "fills up", the false positive rate goes up until all bits are 1 and every query results in a positive, false or not.

Performance

Performance has 2 main aspects:

  1. Logical performance: does it meet or exceed requirements?
  2. Runtime performance: at what cost in storage / time / cycles?

A Bloom filter that is too small will fill up quickly and stop meeting false postive rate (FPR) goals. If there are too many hashing rounds, then extra work is being done, and the filter fills up quicker with more bits being set.

Some rules of thumb:

  • 1% is a reasonable FPR
  • 7 hashes are required to meet 1% FPR for the fewest bits
  • N items require N bytes (N * 2^3 bits) of storage around 1% FPR
  • 5 or 6 hashes might be preferable at the cost of additional storage

When choosing Bloom filter parameters, it is best to start with requirements, typically formulated in terms of:

  • item count
  • acceptable false positive rate
  • hashing limits (e.g. 5 rounds instead of 7)
  • storage limits

Hashing

When hashing a string, we ultimately want an integer value which, modulo num_bits, will be a bit index into an array of bits. The ideal hashing algorithm will approximate the uniform distribution of possible hash values while prioritizing speed and minimizing storage and cycles. A single bit flip, or byte increment, in the input string should result in a large change in the output hash value.

By default, this library uses the CRC32 algorithm, which is not actually a hashing algorithm but instead a checksum algorithm. It has less uniformity guarantees than most hashing algorithms, but it still may yield a wide enough distribution for our purposes.

We cannot use the entire fidelity of the hash value, which may be something like 64 bytes. If the bitmap is only 16 bytes, then modulo is already shrinking our domain of values. CRC32 yields a 32-bit value, which is 4 bytes.

In general, this library uses the last (least significant) 32 bits of any given hash value as the integer to be modulo'd. As long as your Bloom filter is under half a gigabyte, the modulo arithmetic dominates the domain reduction.

Usage

Included Scripts

Basics

  • Zlib.crc32 is used by default
    • not true hashing but very fast and close enough
    • able to feed the hash state forward for cyclic hashing
    • Ruby's stdlib
  • Digest:MD5 and friends are available
    • cryptographic hashes are not ideal for Bloom filters
    • they beat CRC32 on uniformity but lose on performance
    • Ruby's stdlib
    • require 'compsci/bloom_filter/digest'
    • CompSci::BloomFilter::Digest.new
  • OpenSSL::Digest is available
    • more algos and supposedly more (thread)safe than Digest
    • slightly slower than Digest
    • Ruby's stdlib
    • require 'compsci/bloom_filter/openssl'
    • CompSci::BloomFilter::OpenSSL.new

Organizing the library this way keeps the default path very small and fast:

require 'compsci/bloom_filter'
# or, require 'compsci/bloom_filter/digest'   # for digest
# or, require 'compsci/bloom_filter/openssl'  # for both digest and openssl

bf = CompSci::BloomFilter.new
# or, bf = CompSci::BloomFilter::Digest.new  # or, BloomFilter::OpenSSL.new

# add strings: "1" up to "999"
999.times { |i|
  bf.add(i.to_s)
}

bf.include?('asdf')
#=> false
bf
#=> 65536 bits (8.0 kB, 5 hashes) 7% full; FPR: 0.000%
bf.fpr
#=> 1.97087379660843e-06
bf.include?('123')
#=> true
bf['123']
#=> 0.9999980291262034
bf['asdf']
#=> 0

Fibonacci module

  • Fibonacci.classic(n) - naive, recursive

  • Fibonacci.cache_recursive(n) - as above, caching already computed results

  • Fibonacci.cache_iterative(n) - as above but iterative

  • Fibonacci.dynamic(n) - as above but without a cache structure

  • Fibonacci.matrix(n) - matrix is magic; beats dynamic around n=500

  • test/fibonacci.rb

  • test/bench/fibonacci.rb

Timer module

require 'compsci/timer'

include CompSci

overall_start = Timer.now

start = Timer.now
print "running sleep 0.01 (50x): "
_answer, each_et = Timer.loop_avg(count: 50) {
  print '.'
  sleep 0.01
}
puts
puts "each: %0.3f" % each_et
puts "elapsed: %0.3f" % Timer.since(start)
puts "cumulative: %0.3f" % Timer.since(overall_start)
puts


start = Timer.now
print "running sleep 0.02 (0.3 s): "
_answer, each_et = Timer.loop_avg(seconds: 0.3) {
  print '.'
  sleep 0.02
}
puts
puts "each: %0.3f" % each_et
puts "elapsed: %0.3f" % Timer.since(start)
puts "cumulative: %0.3f" % Timer.since(overall_start)
puts
running sleep 0.01 (50x): ..................................................
each: 0.010
elapsed: 0.524
cumulative: 0.524

running sleep 0.02 (0.3 s): ...............
each: 0.020
elapsed: 0.304
cumulative: 0.828

Fit module

  • Fit.sigma - sums the result of a block applied to array values

  • Fit.error - returns a generic r^2 value, the coefficient of determination

  • Fit.constant - fits y = a + 0x; returns the mean and variance

  • Fit.logarithmic - fits y = a + b*ln(x); returns a, b, r^2

  • Fit.linear - fits y = a + bx; returns a, b, r^2

  • Fit.exponential - fits y = ae^(bx); returns a, b, r^2

  • Fit.power - fits y = ax^b; returns a, b, r^2

  • Fit.best - applies known fits; returns the fit with highest r^2

  • test/bench/complete_tree.rb

  • test/fit.rb

Names module

This helps map a range of small integers to friendly names, typically in alphabetical order.

  • UPPER LOWER SYMBOLS CHAR_MAP LATIN_SYMBOLS SYMBOLS26
  • Names::Greek.upper
  • Names::Greek.lower
  • Names::Greek.sym
  • Names::Pokemon.array
  • Names::Pokemon.hash
  • Names::Pokemon.grep
  • Names::Pokemon.sample

Simplex class

WORK IN PROGRESS; DO NOT USE

The Simplex algorithm is a technique for Linear programming. Typically the problem is to maximize some linear expression of variables given some constraints on those variables in terms of linear inequalities.

  • Parse.tokenize - convert a string to an array of tokens

  • Parse.term - parse certain tokens into [coefficient, varname]

  • Parse.expression - parse a string representing a sum of terms

  • Parse.inequality - parse a string like "#{expression} <= #{const}"

  • test/simplex_parse.rb

With Simplex::Parse, one can obtain solutions via:

  • Simplex.maximize - takes an expression to maximize followed by a variable number of constraints / inequalities; returns a solution
  • Simplex.problem - a more general form of Simplex.maximize; returns a Simplex object
require 'compsci/simplex/parse'

include CompSci

Simplex.maximize('x + y',
                 '2x + y <= 4',
		 'x + 2y <= 3')

# => [1.6666666666666667, 0.6666666666666666]



s = Simplex.problem(maximize: 'x + y',
                    constraints: ['2x + y <= 4',
                                  'x + 2y <= 3'])

# => #<CompSci::Simplex:0x0055b2deadbeef
#      @max_pivots=10000,
#      @num_non_slack_vars=2,
#      @num_constraints=2,
#      @num_vars=4,
#      @c=[-1.0, -1.0, 0, 0],
#      @a=[[2.0, 1.0, 1, 0], [1.0, 2.0, 0, 1]],
#      @b=[4.0, 3.0],
#      @basic_vars=[2, 3],
#      @x=[0, 0, 4.0, 3.0]>

s.solution

# => [1.6666666666666667, 0.6666666666666666]