Skip to content

This project demonstrates an external sorting algorithm in Go, designed to efficiently handle large data sets that exceed available memory. The program reads a random string data set from a file, sorts it using an external merge sort technique, and writes the sorted output to a new file.

License

Notifications You must be signed in to change notification settings

fajarnugraha37/go_external_sort

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

External Sorting in Go

This project implements an external sorting algorithm in Go, capable of handling large datasets that exceed available memory. The program reads a dataset of random strings from a file, sorts them, and writes the sorted output to a new file. It includes a dataset generator that creates random strings for testing purposes.

Table of Contents

Features

  • Generates a dataset of random strings.
  • Implements external merge sort to handle large datasets.
  • Uses a temporary directory for storing intermediate sorted chunks.
  • Supports sorting of strings.

Project Structure

go_external_sort/
│
├── cmd/main.go                
│   ├── dataset/
│        └── main.go        # Entry point of the program to generate dataset
│   └── sort
│        └── main.go        # Entry point of the program to sort dataset
├── sorter/
│   ├── sorter.go          # External sorting logic
│   └── merge.go           # Merging sorted chunks
└── utils/
    ├── file.go            # File handling utilities
    └── logger.go          # Logging utilities

Requirements

  • Go 1.22.3 or later

Installation

  1. Clone the repository:
git clone https://github.com/fajarnugraha37/go_external_sort.git 
cd go_external_sort
  1. Ensure you have Go installed on your machine. You can download it from golang.org.

Build

Simple Build

To build the project for development, you can use the following command:

go build -o ./bin/dataset.exe ./cmd/dataset/main.go
go build -o ./bin/sort.exe ./cmd/sort/main.go

This command compiles the code and creates executables in the bin directory.

Optimized Build

For a optimized-optimized build, you can use the following command:

go build -o ./bin/dataset.exe -ldflags="-s -w" ./cmd/dataset/main.go
go build -o ./bin/sort.exe -ldflags="-s -w" ./cmd/sort/main.go
  • -ldflags="-s -w": This flag reduces the size of the binary by omitting the symbol table and debug

Cross-Compilation

If you are on a different operating system and want to cross-compile for another OS, you can set the GOOS and GOARCH environment variables accordingly.

Usage

Generating a Dataset

To generate a dataset of random strings, use the following command:

go run ./cmd/dataset/main.go -output dataset_input.txt -count 10000000 -length 16

or

./bin/dataset.exe -output dataset_input.txt -count 10000000 -length 16
  • -output: Path to the output file where the dataset will be saved.
  • -count: Number of random strings to generate (default is 100,000).
  • -length: Length of each random string (default is 10).

Sorting the Dataset

To sort the generated dataset, use the following command:

go run ./cmd/sort/main.go -input dataset_input.txt -output dataset_output.txt

or

./bin/sort.exe -input dataset_input.txt -output dataset_output.txt
  • -input: Path to the input file containing the dataset to be sorted.
  • -output: Path to the output file where the sorted dataset will be saved.

How It Works

  1. Dataset Generation: The ./cmd/dataset/main.go file generates a specified number of random strings and writes them to a file.
  2. External Sorting: The ./cmd/sort/main.go file orchestrates the sorting process:
    • It reads the dataset from the input file.
    • It splits the dataset into manageable chunks, sorts each chunk, and writes them to temporary files.
    • It merges the sorted chunks into a final sorted output file.

Sequence Diagram

sequenceDiagram
    participant User
    participant Generator
    participant Sorter
    participant FileSystem as FS
    participant TempDirectory as TempDir

    User->>Generator: Run dataset generation command
    Generator->>FS: Create dataset file
    FS-->>Generator: Confirm file creation
    Generator->>FS: Write random strings to dataset file
    FS-->>Generator: Confirm write completion
    Generator->>User: Dataset generated successfully

    User->>Sorter: Run sorting command
    Sorter->>FS: Open dataset file
    FS-->>Sorter: Provide dataset file
    Sorter->>TempDir: Create temp directory (if not exists)
    Sorter->>FS: Read dataset file
    FS-->>Sorter: Provide dataset content
    Sorter->>Sorter: Split dataset into chunks
    Sorter->>Sorter: Sort each chunk
    Sorter->>TempDir: Write sorted chunks to temp files
    TempDir-->>Sorter: Confirm chunk write completion
    Sorter->>TempDir: Merge sorted chunks
    Sorter->>FS: Write sorted output file
    FS-->>Sorter: Confirm output file creation
    Sorter->>User: Sorting completed successfully
Loading
  • Participants: The diagram includes five participants: User, Generator, Sorter, FileSystem, and TempDirectory.
  • Interactions: The arrows represent the interactions between these participants, showing the flow of commands and responses during the dataset generation and sorting processes.

Memory Management

The program is designed to use minimal memory by:

  • Limiting the size of each chunk to 10 MB.
  • Using buffered I/O for efficient file operations.
  • Storing temporary files in a dedicated temp directory within the working directory.

Contributing

Contributions are welcome! If you have suggestions or improvements, feel free to open an issue or submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

This project demonstrates an external sorting algorithm in Go, designed to efficiently handle large data sets that exceed available memory. The program reads a random string data set from a file, sorts it using an external merge sort technique, and writes the sorted output to a new file.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages