Skip to content

Commit

Permalink
doc changes
Browse files Browse the repository at this point in the history
  • Loading branch information
splendidbug committed Aug 18, 2024
1 parent 708a02f commit d187d54
Show file tree
Hide file tree
Showing 3 changed files with 74 additions and 3 deletions.
5 changes: 2 additions & 3 deletions docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,8 @@ makedocs(;
canonical = "https://splendidbug.github.io/DocsScraper.jl",
edit_link = "main",
assets = String[]),
pages = [
"API Index" => "index.md"
]
pages = ["Home" => "home.md",
"API Reference" => "index.md"]
)

deploydocs(;
Expand Down
71 changes: 71 additions & 0 deletions docs/src/home.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@

## DocsScraper: "A document scraping and parsing tool used to create a custom RAG database for AIHelpMe.jl"
[![Dev](https://img.shields.io/badge/docs-dev-blue.svg)](https://splendidbug.github.io/DocsScraper.jl/dev/) [![Build Status](https://github.com/splendidbug/DocsScraper.jl/actions/workflows/CI.yml/badge.svg?branch=main)](https://github.com/splendidbug/DocsScraper.jl/actions/workflows/CI.yml?query=branch%3Amain) [![Aqua](https://raw.githubusercontent.com/JuliaTesting/Aqua.jl/master/badge.svg)](https://github.com/JuliaTesting/Aqua.jl)


DocsScraper is a package designed to create a vector database from input URLs. It scrapes and parses the URLs and with the help of PromptingTools.jl, creates a vector store that can be used in a RAG applications. It integrates with AIHelpMe.jl and PromptingTools.jl to offer highly efficient and relevant query retrieval, ensuring that the responses generated by the system are specific to the content in the created database.

## Features

- **URL Scraping and Parsing**: Automatically scrapes and parses input URLs to extract relevant information, paying particular attention to code snippets and code blocks. Gives an option to customize the chunk sizes
- **URL Crawling**: Optionally crawls the input URLs to look for multiple pages in the same domain.
- **Vector Database Creation**: Leverages PromptingTools.jl to create embeddings with customizable embedding model, size and type (Bool and Float32).

## Installation

To install DocsScraper, use the Julia package manager and the package name:

```julia
using Pkg
Pkg.add("DocsScraper")
```


**Prerequisites:**

- Julia (version 1.10 or later).
- Internet connection for API access.
- OpenAI API keys with available credits. See [How to Obtain API Keys](#how-to-obtain-api-keys).


## Usage
```julia
index = make_knowledge_packs(; single_urls=["https://docs.sciml.ai/Overview/stable/"], index_name="sciml", embedding_size=1024)
```
```
[ Info: robots.txt unavailable for https://docs.sciml.ai:/Overview/stable/
[ Info: Processing https://docs.sciml.ai/Overview/stable/...
. . .
[ Info: Parsing URL: https://docs.sciml.ai/Overview/stable/
[ Info: Scraping done: 69 chunks
[ Info: Removed 0 short chunks
[ Info: Removed 0 duplicate chunks
[ Info: Created embeddings for sciml. Cost: $0.001
a sciml__v20240817__textembedding3large-1024-Bool__v1.0.hdf5
[ Info: ARTIFACT: sciml__v20240817__textembedding3large-1024-Bool__v1.0.tar.gz
┌ Info: sha256:
└ bytes2hex(open(sha256, fn_output)) = "58bec6dd9877d1b926c96fceb6aacfe5ef6395e57174d9043ccf18560d7b49bb"
┌ Info: git-tree-sha1:
└ Tar.tree_hash(IOBuffer(inflate_gzip(fn_output))) = "031c3f51fd283e89f294b3ce9255561cc866b71a"```
```
`make_knowledge_packs` is the entry point to the package. This function takes in the URLs to parse and returns the index. This index can be passed to AIHelpMe.jl to answer queries on the built knowledge packs.

**Using the created index:**
```julia
using AIHelpMe
sciml_index = AIHelpMe.load_index!(index)
aihelp(sciml_index, "what is Sciml")
```
```
[ Info: Updated RAG pipeline to `:bronze` (Configuration key: "textembedding3large-1024-Bool").
[ Info: Loaded index from packs: julia into MAIN_INDEX
[ Info: Loading index from sciml__v20240817__textembedding3large-1024-Bool__v1.0.hdf5
[ Info: Loaded index a file sciml__v20240817__textembedding3large-1024-Bool__v1.0.hdf5 into MAIN_INDEX
[ Info: Done with RAG. Total cost: $0.01
--------------------
AI Message
--------------------
SciML, or Scientific Machine Learning, is an ecosystem developed in the Julia programming language, aimed at solving equations and modeling systems while integrating the capabilities of
scientific computing and machine learning. It provides a range of tools with unified APIs, enabling features like differentiability, sensitivity analysis, high performance, and parallel
implementations. The SciML organization supports these tools and promotes their coherent use for various scientific applications.
```
1 change: 1 addition & 0 deletions docs/src/working.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
## Parser

0 comments on commit d187d54

Please sign in to comment.