Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gitindex] Indexing repositories with malformed documents / missing blobs #73

Open
r10r opened this issue Jan 7, 2019 · 6 comments
Open

Comments

@r10r
Copy link

r10r commented Jan 7, 2019

When running zoekt-git-index on all of our GIT repositories, I've noticed that few repositories are missing from the search. After digging into it I discovered that the indexer aborts at the first indexing error. Since it may happen from time to time that a repository contains a malformed document (e.g with invalid UTF-8 sequences ...) the indexer should be able to ignore these erros.
I've added a flag ContinueOnError to allow indexing of repositories with missing blobs and malformed documents. These repositories should be fixed anyway - but in the meantime only the broken files are not indexed but not the whole repository.

@hanwen
Copy link
Contributor

hanwen commented Jan 7, 2019

For invalid UTF-8 sequences, we should just insert a placeholder and continue. I quickly looked at the code, and I think it's already doing that. Can you verify if it really aborted for invalid UTF-8 ?

@r10r
Copy link
Author

r10r commented Jan 8, 2019

This is the error I get, I only suspect it to be an encoding / UTF-8 error. Honestly I did not search for the root cause.

2019/01/08 11:37:48 failed to add document rootfs/usr/share/aptitude/README.cs : no rune for section boundary at byte 514

README.cs.zip

@hanwen
Copy link
Contributor

hanwen commented Jan 8, 2019

aha. Could you share the file with me? Or maybe make a smaller reproducer? You probably need to cut off runes from the start in multiples of 100.

@r10r
Copy link
Author

r10r commented Jan 8, 2019

File is already attached in the previous comment. I've zipped it because otherwise github does not let me upload it.

@r10r
Copy link
Author

r10r commented Jan 8, 2019

I've refactored the patch and uploaded it to gerrit ( a new commit with a new changeset id ). Honestly I'm a little bit confused with the gerrit workflow, never worked with it before. Please tell me if I have to change something. Thanks.

@hanwen
Copy link
Contributor

hanwen commented Jan 8, 2019

can't repro. Which version were you using?

hanwen@han-wen:~/go/src/github.com/google/zoekt$ git log HEAD |head -1
commit 43635377d1e262e9a40da6d865ba8f8d2157b88f
hanwen@han-wen:~/go/src/github.com/google/zoekt$ go install github.com/google/zoekt/cmd/zoekt-index && zoekt-index --file_limit 1000000 t/
2019/01/08 20:41:06 finished /usr/local/google/home/hanwen/.zoekt/t_v15.00000.zoekt: 1076357 index bytes (overhead 3.1)
hanwen@han-wen:~/go/src/github.com/google/zoekt$ ls -l t/
total 340
-rw-r--r-- 1 hanwen primarygroup 347653 Jan  8 14:20 README.cs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants