A set of Vagrant VirtualBox provisioning scripts for a lightweight Lubuntu based VM targeted primarily at Azure Databricks data development using Python, R, Scala and SQL. Optionally includes a curated set of additional development tools and utilities.
The code has been tested and working using Vagrant 2.2.7 and VirtualBox 6.1.10 on a Windows 10 machine.
Skip the step where you need to become a Linux System Administrator before you can write your first line of code and get started quickly with all the key tools already pre-installed. This is also a safe, self-contained and throwaway environment for experimenting without affecting your host system.
-
Make sure you have a recent version of Vagrant and VirtualBox installed and working on your system.
-
Clone or download this repository.
git clone git://github.com/artislismanis/lubuntu-datadev-vm.git
-
While the default configuration will produce a fully working development environment it is a good practice to review the Vagrantfile, adjust VM settings and provisioning options to suit your preferences. Your check-list could look like this:
- Set VM name and configure CPU, RAM and graphics settings to suit your host hardware.
- Specify the Databricks Runtime environment you are targeting, defaults to the current LTS version.
- Review available provisioning steps and uncomment / comment out any features as required. Note that R packages are compiled from source and the process can take considerable time. If you specify additional packages, you need to ensure you also install any system dependencies.
- Review provided user customisations script which provides some examples of how this could be used to personalise the VM.
-
Open command line in the root of the project folder and run:
vagrant up
You will be prompted to install
vagrant-vbguest
Vagrant plugin if it hasn't been installed already. This is a useful plugin to keep Virtual Box Guest Additions up to date and in line with your version of Virtual Box. You will need to runvagrant up
again after the plugin has been installed. -
Wait for the provisioning to finish. End to end provisioning with provided defaults takes around 20 minutes, provisioning all features takes around 1.5h. Once the system has been provisioned use the default vagrant log-in details (vagrant:vagrant) to access and use the system.
The main concept behind this VM is to target system environment as similar to specific Databricks Runtime as possible (see release notes). This is achieved using various development environment management tools like Miniconda, SDKMan!, NVM and RVM which can also be used to easily adapt this VM your more general software development needs.
Tools like Databricks Connect and Databricks CLI are pre-installed in a Python environment that targets the Databricks Runtime specified during provisioning. This can be accesses by running the following on the command line:
conda activate databricks
Read relevant documentation to understand how to configure these tools to work with your Databricks cluster.
More detail what's included and step-by-step getting started guides will be provided over time in the project wiki. In the meantime check out 'Useful Resources' section below if you get stuck.