Matrix Sketching

The aim of the project is to be a fully reproducible and usable library for various matrix sketching tasks. The initial problem is to investigate how different random transforms can be used as preconditioners for Convex Constrained Least Squares problems (CCLSQ), particularly with a view to exploiting sparse embeddings for fast-to-compute preconditioners.

TO-DO:

Add plots that vary the sketch size (as a function of d) and measure the error for a fixed matrix A. Note that even if we can guarantee that at a certain level a sketch is a subspace embedding we can potentially use a smaller one in the IHS which will make for interesting comparison. Do this for real datasets as well as synthetic data and on synthetic data use different distributions to alter the leverage distribution.

Installation:

git clone the repo
cd matrix_sketching
pip install -r requirements.txt
cd matrix_sketching/lib
git clone https://bitbucket.org/vegarant/fastwht.git --> then run install in here by cd python, python setup.py, python test.py
Get the directory path for fastwht/python which should be your_path_name = */matrix_sketching/lib/fastwht/python
Find .bash_profile or equivalent and add export PYTHONPATH=$PYTHONPATH:your_path_name, at the final line of the bash_profile, finally save then source .bash_profile.
Open ipython, do import sys --> sys.path and check that your_path_name is displayed.
Go back to the matrix_sketching directory and run the tests.

Roadmap:

~~Get IHS LASSO solver working.~~ [26/07/2018] DONE - ~~ now incorporate into class based method.~~ DONE [Early Aug] Now include experiments with real data.
Get IHS SVM solver working.
~~Add timing functionality to the sketching objects.~~ DONE [Early Aug]
Include sparseJLT and countGauss methods.

Notes

Very few iterations are needed to achieve error < 10^-10 for the IHS lasso
The scaling for the constrained QP is very severe as d increases (no surprises given that QP are d^3 in the worst case) and in this increased d domain it may not be favourable to use the sketching technique.
However, if instead one can accept a heuristic solution then the penalised form can be used.
How does varying the sketch size affect the convergence of the problem? So far it seems that very small sketches work fine.
At what point, if any, do we need to switch from the fast sparse Count Sketch implementation to the dense one?
I tested against the PyRLA implementation (hence the extra commented code in the class definition) of the SRHT but mine seemed to be faster so used that one throughout.
using the _countsketch_fast function is faster than the dense function despite having to iterate over more values.

Experiments

verify_ihs_paper.py -- script to show how Count Sketch fits into the IHS sketching scheme. [x] - Unconstrained regression as n grows and d and sketch dim are fixed. PLOTTING Done [x] - Unconstrained regression as the number of iterations grows, n,d and sketch size fixed but the sketches and the sketching dimension are varied. Plotting Done [x] - Unconstrained regression with all sketches. Vary d and fix a corresponding n. Sketch size and num iters are fixed. plotting done. [] - Tidy up axes and labels etc.
sketching_lasso_synthetic.py -- the sketched lasso problem on various synthetic datasets.

Datasets

Run the download script to get all datasets.

YearPredictionMSD - taken from UCI ML repo. Download and usage from get_data.py and shell script in data repo.
Susy
KDDCup -- nb. this is saved as object array, need to do again and save with dtype=np.float.
rail2586 - Taken from Florida Sparse Matrix Repo. Download MATLAB file and save X = (Problem.A)' .
census.mat downloaded from https://github.com/chocjy/randomized-quantile-regression-solvers/blob/master/matlab/data/census_data.mat -Rucci.mat - downloaded from https://sparse.tamu.edu/Rucci/Rucci1
The mat files are there in case of needing to read in again.

Credits

I have used code from https://bitbucket.org/vegarant/fastwht to construct the SRHT and used https://github.com/wangshusen/PyRLA as inspiration, although more features and test suites have been added.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
data/get_data		data/get_data
figures		figures
lib		lib
paper		paper
tests		tests
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
datasets_config.py		datasets_config.py
experiment_parameter_grid.py		experiment_parameter_grid.py
lasso_synthetic_experiment.py		lasso_synthetic_experiment.py
matplotlib_config.py		matplotlib_config.py
my_plot_styles.py		my_plot_styles.py
requirements.txt		requirements.txt
subspace_embedding_dimension.py		subspace_embedding_dimension.py
summary_time_experiments.py		summary_time_experiments.py
synthetic_data_functions.py		synthetic_data_functions.py
verify_ihs_paper.py		verify_ihs_paper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Matrix Sketching

TO-DO:

Installation:

Roadmap:

Notes

Experiments

Datasets

Credits

About

Releases

Packages

Languages

License

c-dickens/sketching-ccls

Folders and files

Latest commit

History

Repository files navigation

Matrix Sketching

TO-DO:

Installation:

Roadmap:

Notes

Experiments

Datasets

Credits

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages