Read the paper here. (now online!)
Explore our results interactively here.
The automatic classification of X-ray detections is a necessary step in extracting astrophysical information from compiled catalogs of astrophysical sources. Classification is useful for the study of individual objects, statistics for population studies, as well as for anomaly detection, i.e., the identification of new unexplored phenomena, including transients and spectrally extreme sources. Despite the importance of this task, classification remains challenging in X-ray astronomy due to the lack of optical counterparts and representative training sets. We develop an alternative methodology that employs an unsupervised machine learning approach to provide probabilistic classes to Chandra Source Catalog sources with a limited number of labeled sources, and without ancillary information from optical and infrared catalogs. We provide a catalog of probabilistic classes for 8,756 sources, comprising a total of 14,507 detections, and demonstrate the success of the method at identifying emission from young stellar objects, as well as distinguishing between small-scale and large-scale compact accretors with a significant level of confidence. We investigate the consistency between the distribution of features among classified objects and well-established astrophysical hypotheses such as the unified AGN model. This provides interpretability to the probabilistic classifier. Code and tables are available publicly through GitHub. We provide a web playground for readers to explore our final classification at playground.
Authors:
- Víctor Samuel Pérez-Díaz1,2*
E-mail: [email protected] - Juan Rafael Martínez-Galarza1
- Alexander Caicedo3, 4
- Raffaele D'Abrusco1
Affiliations:
- Center for Astrophysics | Harvard & Smithsonian, 60 Garden Street, Cambridge, MA 02138, USA
- School of Engineering, Science and Technology, Universidad del Rosario, Cll. 12C No. 6-25, Bogotá, Colombia
- Department of Electronics Engineering, Pontificia Universidad Javeriana, Cra. 7 No. 40-62, Bogotá, Colombia
- Ressolve, Cra. 42 # 5 Sur - 145, Medellín, Colombia
Note: Due to the ever-changing nature of the SIMBAD database, the exact results presented in the paper may not be reproducible. Our crossmatch was performed in August 2022. We provide the original cluster_csc_simbad.csv
. With this, you can reproduce the same result starting from step 3.
For exact reproducibility, we suggest to install package versions provided in the environment.txt
file.
Follow the steps below to execute the research pipeline:
Execute the clustering script to generate clusters.
python3 clustering.py
Crossmatch the generated cluster_csc.csv
file with the SIMBAD database within a 1" radius, choosing the Best match. Retain all rows in the cluster_csc.csv
set. Use TOPCAT for a convenient crossmatch process.
Execute the classification script.
python3 classification.py
Execute the master classification script.
python3 master_classification.py
Inspect the resulting tables in the /out_data
directory:
detection_level_classification.csv
uniquely_classified.csv
ambiguous_classification.csv
By following these steps, you should be able to closely replicate the pipeline presented in the paper.
Check the notebook classify_your_source.ipynb
for instructions on how to classify your own X-ray source(s) with our pipeline.