Carolina Higuera*, Akash Sharma*, Chaithanya Krishna Bodduluri, Taosha Fan, Patrick Lancaster, Mrinal Kalakrishnan, Michael Kaess, Byron Boots, Mike Lambeta, Tingfan Wu, Mustafa Mukadam
*Equal contribution
AI at Meta, FAIR; The Robotics Institute, CMU; University of Washington
Sparsh is a family of general touch representations trained via self-supervision algorithms such as MAE, DINO and JEPA. Sparsh is able to generate useful representations for DIGIT, Gelsight'17 and Gelsight Mini. It outperforms end-to-end models in the downstream tasks proposed in TacBench by a large margin, and can enable data efficient training for new downstream tasks.This repository contains the pytorch implementation, pre-trained models, and datasets released with Sparsh.
Clone this repository:
git clone https://github.com/facebookresearch/sparsh.git
cd sparsh
and create a conda environment with dependencies:
mamba env create -f environment.yml
mamba activate tactile_ssl
Pretrained model weights are available for download from our Hugging Face: facebook/sparsh
model | small | base |
---|---|---|
Sparsh (MAE) | backbone | backbone only |
Sparsh (DINO) | backbone | backbone |
Sparsh (DINOv2) | β | backbone |
Sparsh (IJEPA) | backbone | backbone |
Sparsh (VJEPA) | backbone | backbone |
For pretraining, we curate datasets from multiple sources containing unlabeled data from DIGIT and GelSight sensors.
For DIGIT, the dataset is a mixture of the YCB-Slide dataset and in-house collected data: Touch-Slide. It contains approximately 360k samples of tactile images with a diverse set of no-contact images or backgrounds.
For GelSight, we use open source datasets available online, Touch and Go and ObjectFolder-Real.
To download the dataset, please edit path_dataset
in the bash script scripts/download_digitv1_dataset.sh
. This will download and extract the data in path_dataset
for both YCB-Slide and Touch-Slide datasets.
The structure of the dataset is:
digitv1/Object-Slide
βββ object_0 # eg: 004_sugar_box
β βββ dataset_0.pkl
βββ dataset_1.pkl
βββ dataset_2.pkl
βββ dataset_3.pkl
βββ dataset_4.pkl
βββ object_1 # eg: bread
...
βββ bgs
βββ bg_0.jpg
...
βββ bg_18.jpg
In the bgs/
folder there are images of several no-contact images or backgrounds from different DIGIT sensors. This is necessary for pre-processing the data. Please add to these folder background images from your sensor in case you're adding new tactile data.
To load this dataset, use tactile_ssl/data/vision_tactile.py
We use Touch and Go to pretrain on GelSight'17 (with markers). The dataset consists of short videoclips making contact with in-the-wild objects. We use all frames from those videoclips, including no-contact frames. We do not perform any preprocessing since the markers contain relevant static shear information.
We also use sequences from the ObjectFolder-Real dataset for pre-training. We preprocess the data by extracting only the tactile images (GelSight Mini), as we do not use the other modalities.
We provide a script to download the preprocessed and compatible version of these datasets with our pipeline. To do so, run the bash script scripts/download_gelsight_dataset.sh
. This will download and extract the data. Don't forget to edit path_dataset
in the script.
The structure of the dataset is:
gelsight/touch_go
βββ 20220410_031604.pkl
βββ 20220410_031843.pkl
βββ ...
βββ 20220607_133934.pkl
gelsight/object_folder
βββ 001.pkl
βββ 002.pkl
...
βββ 051.pkl
If you would like to download the data directly from Touch and Go and ObjectFolder-Real, you can also do so. Data can be downloaded by running the bash scripts scripts/download_datasets_scratch/download_gelsight_object_folder.sh
and scripts/download_datasets_scratch/download_gelsight_touchgo.sh
. Then, you can process the data to make it compatible with our pipeline by running the Python scripts scripts/download_datasets_scratch/compress_object_folder.py
and scripts/download_datasets_scratch/compress_touch_go.py
. Please modify the corresponding paths in all scripts accordingly.
To load this dataset, use tactile_ssl/data/vision_tactile.py
We open-source the data that we collected in-house for force estimation, slip detection and pose estimation downstream tasks. The datasets can be downloaded from the Sparsh collection in Hugging Face:
- Force estimation and Slip detection: DIGIT, GelSight Mini
- Pose estimation: DIGIT
Please locate these datasets in a directory designated for hosting all downstream task datasets.
This dataset contains paired tactile and force data, intended for use in predicting 3-axis normal and shear forces applied to the sensor's elastomer. We used three different indenter shapes to collect force-labeled data: hemisphere, sharp, and flat. To measure force ground truths, we employed the ATI nano17 force/torque sensor. The protocol consisted of applying a random normal load followed by a shear load, achieved by sliding the probe 2mm on the sensor's elastomer.
The dataset consists a collection of normal/shear load trajectories for each probe. The structure is as follows (example for DIGIT dataset):
T1_force/digit/sphere
βββ batch_1
β βββ dataset_digit_00.pkl
β βββ ...
β βββ dataset_digit_03.pkl
β βββ dataset_slip_forces.pkl
βββ batch_2
β βββ ...
T1_force/digit/flat
βββ batch_1
β βββ dataset_digit_00.pkl
β βββ ...
β βββ dataset_digit_03.pkl
β βββ dataset_slip_forces.pkl
β ...
T1_force/digit/sharp
βββ ....
For each batch:
dataset_digit_xy.pkl
: contains the binarized tactile images only.dataset_slip_forces.pkl
: it's a dictionary where each key represents a sliding trajectory. Each trajectory has the corresponding force and slip labels.
To load this dataset (DIGIT and GelSight Mini), use tactile_ssl/data/vision_based_forces_slip_probes.py
This dataset contains time-synchronized pairs of DIGIT images and SE(3) object poses. In our setup, the robot hand is stationary with its palm facing downwards and pressing against the object on a table. The robot hand has DIGIT sensors mounted on the index, middle, and ring fingertips, all of which are in contact with the object. A human manually perturbs the object's pose by translating and rotating it in SE(2). We use tag tracking to obtain the object's pose. We collect data using two objects: a Pringles can and the YCB sugar box, both of which have a tag fixed to their top surfaces.
The dataset is a collection of sequences where a human manually perturbs the object's pose. We collect data using two objects: a Pringles can and the YCB sugar box. Each sequence corresponds to a pickle file containing the following labeled data:
- DIGIT tactile images for index, middle and ring fingers
- Object pose tracked from tag in format (x, y, z, qw, qx, qy, qz)
- Robot hand joint positions
object_index_rel_pose_n5
: the pose change within the last 5 samples as a transformation matrix. The object pose is with respect to the index finger.object_middle_rel_pose_n5
: the pose change within the last 5 samples as a transformation matrix. The object pose is with respect to the middle finger.object_ring_rel_pose_n5
: the pose change within the last 5 samples as a transformation matrix. The object pose is with respect to the ring finger. We also provide reference (no contact) images for each of the DIGITs to facilitate pre-processing such as background subtraction.
T3_pose/digit/train
βββ pringles
β βββ bag_00.pkl
β βββ ...
β βββ bag_37.pkl
β βββ bag_38.pkl
βββ sugar
β βββ ...
T3_pose/digit/test
βββ pringles
β βββ bag_00.pkl
β βββ ...
β βββ bag_05.pkl
β βββ bag_06.pkl
βββ sugar
β βββ ...
T3_pose/digit/bgs
βββ digit_index.png
βββ digit_index.png
βββ digit_index.png
To load this dataset use tactile_ssl/data/vision_based_pose_probes.py
We use the Feeling of Success dataset. It contains approximately 9k grasp trials over 100k objects using GelSight'17 sensors mounted on a parallel gripper.
You can download the data directly from the webpage or run the bash script scripts/download_datasets_scratch/download_gelsight_feeling_success.sh
to download the data and the Python script scripts/download_datasets_scratch/compress_feeling_success.py
to preprocess the dataset compatible with our pipeline. Please update the paths in the scripts accordingly.
Please download the Clothing dataset. The dataset consist of 4467 short video clips (10-25 frames), of a robot with a GelSight'17 grasping several types of textile (20 classes), such as leather, cotton, polyester, etc.
We use hydra for configuration management in this repository. The configuration files are located in config/
.
The config folder is organized as follows:
βββ config
β βββ data # contains dataset configs
β βββ experiment # contains full config for a specific experiment (eg. Sparsh(DINO) or downstream task)
β βββ model # contains configs for each ssl algorithm
β βββ paths # add your config here with paths to datasets / checkpoint / outputs / etc.
β βββ task # contains downstream_task configs for each downstream task in TacBench
β βββ wandb # add your wandb config here for experiment tracking
β βββ default.yaml # The SSL training default config is overridden by experiment/dino_vit.yaml and the like
Following are the instructions to train a Sparsh model:
- Setup the pretraining datasets according to the instructions above
- add
paths/${YOUR_PATHS}.yaml
similar to existing examples and point to the data root accordingly - similarly add
wandb/${YOUR_CONFIG}.yaml
- Then choose an experiment, for example:
dino_vit.yaml
and use the following script
You may need to adjust batch size according to your GPU. All training experiments were done with 8 A100-80GB GPUs.
python train.py +experiment=${YOUR_EXP_NAME} paths=${YOUR_PATHS} wandb=${YOUR_CONFIG}
For training downstream tasks, in our paper we largely follow frozen evaluation, where we freeze the weights of the Sparsh encoder, and only train a lightweight decoder for each downstream task.
Training downstream tasks is quite similar to the above instructions but additionally requires a pre-trained model checkpoint checkpoint_encoder
which can be specified by updating the task.checkpoint_encoder
field in the config. Downstream tasks also need a labeled dataset for the corresponding downstream task.
Use the following script to train downstream tasks:
python train_task.py --config-name=experiment/downstream_task/${EXPERIMENT} paths=${YOUR_PATH_CONFIG} wandb=${YOUR_WANDB_CONFIG}
Once you complete training downstream task decoders for each task, you can also test the checkpoints using the test_task.py
script which essentially follows the same format as above. For convenience, we also provide a submit_task.sh
bash script to train and test downstream tasks if you're using a SLURM based cluster.
Finally, we also tacbench_report.ipynb
where we compute metrics for all the downstream tasks, once the data is computed from the test_task.py
script.
For testing Sparsh(DINO) + force field decoder live, you only need one DIGIT or GelSight Mini sensor. Follow these steps to run the demo:
- Create a folder for downloading the task checkpoints. For example,
${YOUR_PATH}/outputs_sparsh/checkpoints
.
-
Download the decoder checkpoints from Hugging Face for DIGIT and GelSight Mini.
-
Connect the sensor to your PC. In case of DIGIT, please make sure you have digit-interface installed.
-
Make sure the device is recognized by the OS (you can use Cheese in Linux to see the video that the sensor is streaming).
-
Running the demo for DIGIT:
python demo_forcefield.py +experiment=digit/downstream_task/forcefield/digit_dino paths=${YOUR_PATH_CONFIG} paths.output_dir=${YOUR_PATH}/outputs_sparsh/checkpoints/ test.demo.digit_serial=${YOUR_DIGIT_SERIAL}`
The DIGIT serial number is printed on the back of the sensor and has the format DXXXXX
.
- Running the demo for GelSight Mini:
python demo_forcefield.py +experiment=digit/downstream_task/forcefield/gelsight_dino paths=${YOUR_PATH_CONFIG} paths.output_dir=${YOUR_PATH}/outputs_sparsh/checkpoints/ test.demo.gelsight_device_id=${YOUR_GELSIGHT_VIDEO_ID}`
The GelSight Mini is recognized as a webcam. You can get the video ID by checking in a terminal ls -l /dev/video*
.
- Take the sensor and slide it across the edge of a table, or across objects with interesting textures! Look at the normal field to localize where you're making contact on the sensor's surface. Look at the shear field to gather an intuition about the direction of the shear force that you applied while sliding the sensor. For example, slide the sensor over an edge up and down to get translational shear or rotate the sensor in place to see torsional slip!
This project is licensed under LICENSE.
If you find this repository useful, please consider giving a star β and citation:
@inproceedings{
higuera2024sparsh,
title={Sparsh: Self-supervised touch representations for vision-based tactile sensing},
author={Carolina Higuera and Akash Sharma and Chaithanya Krishna Bodduluri and Taosha Fan and Patrick Lancaster and Mrinal Kalakrishnan and Michael Kaess and Byron Boots and Mike Lambeta and Tingfan Wu and Mustafa Mukadam},
booktitle={8th Annual Conference on Robot Learning},
year={2024},
url={https://openreview.net/forum?id=xYJn2e1uu8}
}
We thank Ishan Misra, Mahmoud Assran for insightful discussions on SSL for vision that informed this work, and Changhao Wang, Dhruv Batra, Jitendra Malik, Luis Pineda, Tess Hellebrekers for helpful discussions on the research.
We also thank the team behind datasets like YCB-Slide, Touch and Go, ObjectFolder-Real, Feeling of Success and Clothing dataset for contributing to the research community by open-sourcing their tactile data.