diff --git a/README.md b/README.md index 7c9f34c..e91f77f 100644 --- a/README.md +++ b/README.md @@ -56,7 +56,7 @@ A CIFF export can be ingested into a number of different search systems. + [OldDog](https://github.com/chriskamphuis/olddog) by [creating csv files through CIFF](https://github.com/Chriskamphuis/olddog/blob/master/src/main/java/nl/ru/convert/CiffToCsv.java) + [Terrier](http://terrier.org) via the [Terrier-CIFF plugin](https://github.com/terrierteam/terrier-ciff) -## Tips for writing your own CIFF Importer +## Tips for writing your own CIFF Importer / Exporter The systems above all provide concrete examples of taking an existing CIFF structure and converting it into a different (internal) index format. Most of the data/structures within the CIFF are quite straightforward and self-documenting. However, there are a few important details which @@ -64,4 +64,6 @@ should be noted. 1. The default CIFF exports come from Anserini. Those exports are engineered to encode document identifiers *as deltas (d-gaps).* Hence, when decoding a CIFF structure, care needs to be taken to recover the original identifiers by computing a prefix sum across each postings list. See the discussion [here](https://github.com/osirrc/ciff/issues/19). -2. Since Anserini is based on Lucene, it is important to note that document lengths are encoded in a lossy manner. This means that the document lengths recorded in the `DocRecord` structure are *approximate* - See the discussion [here](https://github.com/osirrc/ciff/issues/21). +2. Since Anserini is based on Lucene, it is important to note that document lengths are encoded in a lossy manner. This means that the document lengths recorded in the `DocRecord` structure are *approximate* - see the discussion [here](https://github.com/osirrc/ciff/issues/21). + +3. Multiple records are stored in a single file using Java protobuf's parseDelimitedFrom() and writeDelimitedTo() methods. Unfortunately, these methods are not available in the bindingsĀ for other languages. These can be trivially reimplemented be reading/writing the bytesize of the record using varint - see the discussion [here](https://github.com/osirrc/ciff/issues/27).