Code updates (#8)

* changed kwargs; added target_path kwarg * mae code more readable * added examples scripts, code changes * doc updates
JuliaGenAI · Aug 23, 2024 · 945a93a · 945a93a
1 parent 2999ea5
commit 945a93a
Show file tree

Hide file tree

Showing 14 changed files with 499 additions and 280 deletions.
diff --git a/Project.toml b/Project.toml
@@ -5,6 +5,8 @@ version = "0.1.0"
 
 [deps]
 AbstractTrees = "1520ce14-60c1-5f80-bbc7-55ef81b5835c"
+CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
+DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
 Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
 EzXML = "8f5d6c58-4d21-5cfd-889c-e3ad7ee6a615"
 Gumbo = "708ec375-b3d6-5a57-a7ce-8257bf98657a"
@@ -24,6 +26,8 @@ Unicode = "4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5"
 [compat]
 AbstractTrees = "0.4"
 Aqua = "0.8"
+CSV = "0.10"
+DataFrames = "1.6"
 Dates = "1"
 EzXML = "1.2"
 Gumbo = "0.8"

diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 
 ## DocsScraper: "A document scraping and parsing tool used to create a custom RAG database for AIHelpMe.jl"
-[![Dev](https://img.shields.io/badge/docs-dev-blue.svg)](https://splendidbug.github.io/DocsScraper.jl/dev/) [![Build Status](https://github.com/splendidbug/DocsScraper.jl/actions/workflows/CI.yml/badge.svg?branch=main)](https://github.com/splendidbug/DocsScraper.jl/actions/workflows/CI.yml?query=branch%3Amain) [![Coverage](https://codecov.io/gh/JuliaGenAI/DocsScraper.jl/branch/main/graph/badge.svg)](https://codecov.io/gh/JuliaGenAI/DocsScraper.jl) [![Aqua](https://raw.githubusercontent.com/JuliaTesting/Aqua.jl/master/badge.svg)](https://github.com/JuliaTesting/Aqua.jl)
+[![Dev](https://img.shields.io/badge/docs-dev-blue.svg)](https://juliagenai.github.io/DocsScraper.jl/dev/) [![Build Status](https://github.com/JuliaGenAI/DocsScraper.jl/actions/workflows/CI.yml/badge.svg?branch=main)](https://github.com/JuliaGenAI/DocsScraper.jl/actions/workflows/CI.yml?query=branch%3Amain) [![Coverage](https://codecov.io/gh/JuliaGenAI/DocsScraper.jl/branch/main/graph/badge.svg)](https://codecov.io/gh/JuliaGenAI/DocsScraper.jl) [![Aqua](https://raw.githubusercontent.com/JuliaTesting/Aqua.jl/master/badge.svg)](https://github.com/JuliaTesting/Aqua.jl)
 
 
 DocsScraper is a package designed to create "knowledge packs" from online documentation sites for the Julia language.
@@ -32,29 +32,37 @@ Pkg.add("DocsScraper")
 
 ## Building the Index
 ```julia
-index = make_knowledge_packs(["https://docs.sciml.ai/Overview/stable/"]; index_name="sciml", embedding_size=1024, bool_embeddings=true)
-```
+crawlable_urls = ["https://juliagenai.github.io/DocsScraper.jl/dev/home/"]
+
+index_path = make_knowledge_packs(crawlable_urls;
+    index_name = "docsscraper", embedding_dimension = 1024, embedding_bool = true), target_path=joinpath(pwd(), "knowledge_packs")
 ```
-[ Info: robots.txt unavailable for https://docs.sciml.ai:/Overview/stable/
-[ Info: Processing https://docs.sciml.ai/Overview/stable/...
+```julia
+[ Info: robots.txt unavailable for https://juliagenai.github.io:/DocsScraper.jl/dev/home/
+[ Info: Scraping link: https://juliagenai.github.io:/DocsScraper.jl/dev/home/
+[ Info: robots.txt unavailable for https://juliagenai.github.io:/DocsScraper.jl/dev
+[ Info: Scraping link: https://juliagenai.github.io:/DocsScraper.jl/dev
 . . .
-[ Info: Parsing URL: https://docs.sciml.ai/Overview/stable/
-[ Info: Scraping done: 69 chunks
+[ Info: Processing https://juliagenai.github.io:/DocsScraper.jl/dev...
+[ Info: Parsing URL: https://juliagenai.github.io:/DocsScraper.jl/dev
+[ Info: Scraping done: 44 chunks
 [ Info: Removed 0 short chunks
-[ Info: Removed 0 duplicate chunks
-[ Info: Created embeddings for sciml. Cost: $0.001
-a sciml__v20240817__textembedding3large-1024-Bool__v1.0.hdf5
-[ Info: ARTIFACT: sciml__v20240817__textembedding3large-1024-Bool__v1.0.tar.gz
-┌ Info: sha256: 
-└   bytes2hex(open(sha256, fn_output)) = "58bec6dd9877d1b926c96fceb6aacfe5ef6395e57174d9043ccf18560d7b49bb"
-┌ Info: git-tree-sha1: 
-└   Tar.tree_hash(IOBuffer(inflate_gzip(fn_output))) = "031c3f51fd283e89f294b3ce9255561cc866b71a"
+[ Info: Removed 1 duplicate chunks
+[ Info: Created embeddings for docsscraper. Cost: $0.001
+a docsscraper__v20240823__textembedding3large-1024-Bool__v1.0.hdf5
+[ Info: ARTIFACT: docsscraper__v20240823__textembedding3large-1024-Bool__v1.0.tar.gz
+┌ Info: sha256:
+└   sha = "977c2b9d9fe30bebea3b6db124b733d29b7762a8f82c9bd642751f37ad27ee2e"
+┌ Info: git-tree-sha1:
+└   git_tree_sha = "eca409c0a32ed506fbd8125887b96987e9fb91d2"
+[ Info: Saving source URLS in Julia\knowledge_packs\docsscraper\docsscraper_URL_mapping.csv      
+"Julia\\knowledge_packs\\docsscraper\\Index\\docsscraper__v20240823__textembedding3large-1024-Bool__v1.0.hdf5"
 ```
 `make_knowledge_packs` is the entry point to the package. This function takes in the URLs to parse and returns the index. This index can be passed to AIHelpMe.jl to answer queries on the built knowledge packs.
 
 **Default `make_knowledge_packs` Parameters:** 
-- Default embedding type is Float32. Change to boolean by the optional parameter: `bool_embeddings = true`.
-- Default embedding size is 3072. Change to custom size by the optional parameter: `embedding_size = custom_dimension`.
+- Default embedding type is Float32. Change to boolean by the optional parameter: `embedding_bool = true`.
+- Default embedding size is 3072. Change to custom size by the optional parameter: `embedding_dimension = custom_dimension`.
 - Default model being used is OpenAI's text-embedding-3-large.
 - Default max chunk size is 384 and min chunk size is 40. Change by the optional parameters: `max_chunk_size = custom_max_size` and `min_chunk_size = custom_min_size`.
 
@@ -67,24 +75,25 @@ a sciml__v20240817__textembedding3large-1024-Bool__v1.0.hdf5
 using AIHelpMe
 
 # Either use the index explicitly
-aihelp(index, "what is Sciml")
+aihelp(index_path, "what is DocsScraper.jl?")
 
 # or set it as the "default" index, then it will be automatically used for every question
-AIHelpMe.load_index!(index)
-aihelp("what is Sciml")
-```
+AIHelpMe.load_index!(index_path)
+
+pprint(aihelp("what is DocsScraper.jl?"))
 ```
+```julia
 [ Info: Updated RAG pipeline to `:bronze` (Configuration key: "textembedding3large-1024-Bool").
 [ Info: Loaded index from packs: julia into MAIN_INDEX
-[ Info: Loading index from sciml__v20240817__textembedding3large-1024-Bool__v1.0.hdf5
-[ Info: Loaded index a file sciml__v20240817__textembedding3large-1024-Bool__v1.0.hdf5 into MAIN_INDEX
-[ Info: Done with RAG. Total cost: $0.01
+[ Info: Loading index from Julia\DocsScraper.jl\docsscraper\Index\docsscraper__v20240823__textembedding3large-1024-Bool__v1.0.hdf5
+[ Info: Loaded index a file Julia\DocsScraper.jl\docsscraper\Index\docsscraper__v20240823__textembedding3large-1024-Bool__v1.0.hdf5 into MAIN_INDEX
+[ Info: Done with RAG. Total cost: $0.009
 --------------------
 AI Message
 --------------------
-SciML, or Scientific Machine Learning, is an ecosystem developed in the Julia programming language, aimed at solving equations and modeling systems while integrating the capabilities of      
-scientific computing and machine learning. It provides a range of tools with unified APIs, enabling features like differentiability, sensitivity analysis, high performance, and parallel      
-implementations. The SciML organization supports these tools and promotes their coherent use for various scientific applications.
+DocsScraper.jl is a Julia package designed to create a vector database from input URLs. It scrapes and parses the URLs and, with the assistance of      
+PromptingTools.jl, creates a vector store that can be utilized in RAG (Retrieval-Augmented Generation) applications. DocsScraper.jl integrates with     
+AIHelpMe.jl and PromptingTools.jl to provide efficient and relevant query retrieval, ensuring that the responses generated by the system are specific to the content in the created database.
 ```
 
 Tip: Use `pprint` for nicer outputs with sources

diff --git a/docs/make.jl b/docs/make.jl
@@ -13,8 +13,8 @@ makedocs(;
         canonical = "https://splendidbug.github.io/DocsScraper.jl",
         edit_link = "main",
         assets = String[]),
-    pages = ["Home" => "home.md",
-        "API Reference" => "index.md"]
+    pages = ["Home" => "index.md",
+        "API Reference" => "api.md"]
 )
 
 deploydocs(;

diff --git a/docs/src/api.md b/docs/src/api.md
@@ -0,0 +1,8 @@
+# Reference
+
+```@index
+```
+
+```@autodocs
+Modules = [DocsScraper]
+```
diff --git a/docs/src/home.md b/docs/src/home.md
diff --git a/docs/src/index.md b/docs/src/index.md
@@ -1,8 +1,100 @@
-# Reference
 
-```@index
+## DocsScraper: "A document scraping and parsing tool used to create a custom RAG database for AIHelpMe.jl"
+DocsScraper is a package designed to create "knowledge packs" from online documentation sites for the Julia language.
+
+It scrapes and parses the URLs and with the help of PromptingTools.jl, creates an index of chunks and their embeddings that can be used in RAG applications. It integrates with AIHelpMe.jl and PromptingTools.jl to offer highly efficient and relevant query retrieval, ensuring that the responses generated by the system are specific to the content in the created database.
+
+## Features
+
+- **URL Scraping and Parsing**: Automatically scrapes and parses input URLs to extract relevant information, paying particular attention to code snippets and code blocks. Gives an option to customize the chunk sizes
+- **URL Crawling**: Optionally crawls the input URLs to look for multiple pages in the same domain.
+- **Knowledge Index Creation**: Leverages PromptingTools.jl to create embeddings with customizable embedding model, size and type (Bool and Float32). 
+
+## Installation
+
+To install DocsScraper, use the Julia package manager and the package name:
+
+```julia
+using Pkg
+Pkg.add("DocsScraper")
+```
+
+
+**Prerequisites:**
+
+- Julia (version 1.10 or later).
+- Internet connection for API access.
+- OpenAI API keys with available credits. See [How to Obtain API Keys](#how-to-obtain-api-keys).
+
+
+## Building the Index
+```julia
+crawlable_urls = ["https://juliagenai.github.io/DocsScraper.jl/dev/home/"]
+
+index_path = make_knowledge_packs(crawlable_urls;
+    index_name = "docsscraper", embedding_dimension = 1024, embedding_bool = true), target_path=joinpath(pwd(), "knowledge_packs")
+```
+```julia
+[ Info: robots.txt unavailable for https://juliagenai.github.io:/DocsScraper.jl/dev/home/
+[ Info: Scraping link: https://juliagenai.github.io:/DocsScraper.jl/dev/home/
+[ Info: robots.txt unavailable for https://juliagenai.github.io:/DocsScraper.jl/dev
+[ Info: Scraping link: https://juliagenai.github.io:/DocsScraper.jl/dev
+. . .
+[ Info: Processing https://juliagenai.github.io:/DocsScraper.jl/dev...
+[ Info: Parsing URL: https://juliagenai.github.io:/DocsScraper.jl/dev
+[ Info: Scraping done: 44 chunks
+[ Info: Removed 0 short chunks
+[ Info: Removed 1 duplicate chunks
+[ Info: Created embeddings for docsscraper. Cost: $0.001
+a docsscraper__v20240823__textembedding3large-1024-Bool__v1.0.hdf5
+[ Info: ARTIFACT: docsscraper__v20240823__textembedding3large-1024-Bool__v1.0.tar.gz
+┌ Info: sha256:
+└   sha = "977c2b9d9fe30bebea3b6db124b733d29b7762a8f82c9bd642751f37ad27ee2e"
+┌ Info: git-tree-sha1:
+└   git_tree_sha = "eca409c0a32ed506fbd8125887b96987e9fb91d2"
+[ Info: Saving source URLS in Julia\knowledge_packs\docsscraper\docsscraper_URL_mapping.csv      
+"Julia\\knowledge_packs\\docsscraper\\Index\\docsscraper__v20240823__textembedding3large-1024-Bool__v1.0.hdf5"
+```
+`make_knowledge_packs` is the entry point to the package. This function takes in the URLs to parse and returns the index. This index can be passed to AIHelpMe.jl to answer queries on the built knowledge packs.
+
+**Default `make_knowledge_packs` Parameters:** 
+- Default embedding type is Float32. Change to boolean by the optional parameter: `embedding_bool = true`.
+- Default embedding size is 3072. Change to custom size by the optional parameter: `embedding_dimension = custom_dimension`.
+- Default model being used is OpenAI's text-embedding-3-large.
+- Default max chunk size is 384 and min chunk size is 40. Change by the optional parameters: `max_chunk_size = custom_max_size` and `min_chunk_size = custom_min_size`.
+
+**Note:** For everyday use, embedding size = 1024 and embedding type = Bool is sufficient. This is compatible with AIHelpMe's `:bronze` and `:silver` pipelines (`update_pipeline(:bronze)`). For better results use embedding size = 3072 and embedding type = Float32. This requires the use of `:gold` pipeline (see more `?RAG_CONFIGURATIONS`)
+
+  
+## Using the Index for Questions
+
+```julia
+using AIHelpMe
+
+# Either use the index explicitly
+aihelp(index_path, "what is DocsScraper.jl?")
+
+# or set it as the "default" index, then it will be automatically used for every question
+AIHelpMe.load_index!(index_path)
+
+pprint(aihelp("what is DocsScraper.jl?"))
+```
+```julia
+[ Info: Updated RAG pipeline to `:bronze` (Configuration key: "textembedding3large-1024-Bool").
+[ Info: Loaded index from packs: julia into MAIN_INDEX
+[ Info: Loading index from Julia\DocsScraper.jl\docsscraper\Index\docsscraper__v20240823__textembedding3large-1024-Bool__v1.0.hdf5
+[ Info: Loaded index a file Julia\DocsScraper.jl\docsscraper\Index\docsscraper__v20240823__textembedding3large-1024-Bool__v1.0.hdf5 into MAIN_INDEX
+[ Info: Done with RAG. Total cost: $0.009
+--------------------
+AI Message
+--------------------
+DocsScraper.jl is a Julia package designed to create a vector database from input URLs. It scrapes and parses the URLs and, with the assistance of      
+PromptingTools.jl, creates a vector store that can be utilized in RAG (Retrieval-Augmented Generation) applications. DocsScraper.jl integrates with     
+AIHelpMe.jl and PromptingTools.jl to provide efficient and relevant query retrieval, ensuring that the responses generated by the system are specific to the content in the created database.
 ```
 
-```@autodocs
-Modules = [DocsScraper]
+Tip: Use `pprint` for nicer outputs with sources
+```julia
+using AIHelpMe: pprint, last_result
+print(last_result)
 ```
diff --git a/examples/scripts/generate_knowledge_pack.jl b/examples/scripts/generate_knowledge_pack.jl
@@ -0,0 +1,34 @@
+# The example below demonstrates the creation of JuliaData knowledge pack
+
+using Pkg
+Pkg.activate(temp = true)
+Pkg.add(url = "https://github.com/JuliaGenAI/DocsScraper.jl")
+using DocsScraper
+
+# The crawler will run on these URLs to look for more URLs with the same hostname
+crawlable_urls = ["https://juliadatascience.io/dataframes",
+    "https://juliadatascience.io/dataframesmeta", "https://csv.juliadata.org/stable/",
+    "https://tutorials.pumas.ai/html/DataWranglingInJulia/04-read_data.html#csv-files-with-csv.jl",
+    "https://dataframes.juliadata.org/stable/man/getting_started/", "https://dataframes.juliadata.org/stable/",
+    "https://juliadata.org/DataFramesMeta.jl/stable/",
+    "https://juliadata.org/DataFramesMeta.jl/dev/", "https://juliadb.juliadata.org/latest/", "https://tables.juliadata.org/dev/",
+    "https://typedtables.juliadata.org/stable/",
+    "https://docs.juliahub.com/General/SplitApplyCombine/stable/", "https://categoricalarrays.juliadata.org/dev/",
+    "https://docs.juliahub.com/General/IndexedTables/stable/",
+    "https://felipenoris.github.io/XLSX.jl/dev/"]
+
+# Crawler would not look for more URLs on these
+single_page_urls = ["https://docs.julialang.org/en/v1/manual/missing/",
+    "https://arrow.apache.org/julia/stable/",
+    "https://arrow.apache.org/julia/stable/manual/",
+    "https://arrow.apache.org/julia/stable/reference/"]
+
+index_path = make_knowledge_packs(crawlable_urls; single_urls = single_page_urls,
+    embedding_dimension = 1024, embedding_bool = true,
+    target_path = joinpath(pwd(), "knowledge_to_delete"), index_name = "juliadata", custom_metadata = "JuliaData ecosystem")
+
+# The index created here has 1024 embedding dimensions with boolean embeddings and max chunk size is 384. 
+
+# The above example creates the output directory (Link to the output directory). It contains the sub-directories "Scraped" and "Index". 
+# "Scraped" contains .jls files of chunks and sources of the scraped URLs. Index contains the created index along with a .txt file 
+# containing the artifact info. The output directory also contains the URL mapping csv.