A tool to generate a new dataset containing redundant attributes (by mathematical expressions) from other dataset. It can be used to evaluate dimensionality reduction algorithms and the influence of curse of high-dimensionality on their results.
This tool makes use of Apache Maven for practicality and application automation purposes.
To compile this package, simply install Apache Maven and run the command: mvn clean compile assembly:single
.
Otherwise you'll need to manually setup the dependency JSON In Java.
To generate new redundant attributes, you must enter the equivalent expression in a configuration file in the Json format, as well as the path of the original dataset, the path of the new dataset (with redundant attributes), and the char separator (for example: "," for .csv files).
Here's a basic example of the configuration file formatting:
{
"origin_dataset": "dataset.csv",
"target_dataset": "new_dataset.csv",
"separator": " ",
"redundant_attributes":[
"{2}*[0]+{1.34}",
"{2}*[0]+{1.34}",
"{2}*([0]+[1])+{1.34}"
]
}
- Constant Syntaxe = {Constant}
- Attribute reference = [Attribute index]
- Supported operators = {
+
,-
,*
,/
,^
,rand
,l
{log},s
{sin}}
After entering all the required settings, run the following command:
java -jar target/GenerateSyntheticDataset-1.0-SNAPSHOT-jar-with-dependencies.jar <path-of-config-file>.json
Or run the script.sh file.
./script.sh
This tool is composed for a set of unit tests, created with the JUnit library.
To run the existing unit tests, with the Apache Maven installed correctly, just run the following command:
mvn test