Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance, possible improvements #53

Closed
jrochkind opened this issue Feb 28, 2018 · 22 comments
Closed

performance, possible improvements #53

jrochkind opened this issue Feb 28, 2018 · 22 comments

Comments

@jrochkind
Copy link
Contributor

jrochkind commented Feb 28, 2018

I'm evaluating using this in a server side (web) application. For my use case, I just actually need to take a single citation (which I already have in a csl-data-json hash), and output it in one of a handful of styles. Following the directions in the README, I get:

csl_hash # a csl-data-json style hash, assume exists
cp = CiteProc::Processor.new style: 'chicago-note-bibliography', format: 'html'
cp.import [csl]
cp.render :bibliography, id: csl.first["id"]

This works, but it's slow. On my macbook, 100-200ms. Too slow for me.

However, good news, profiling reveals that it's slow mostly only because each time it loads and parses the style and locale files. The slow is in CSL::Loader.load, which is called twice, once with chicago-note-bibliography and once with the (default) locale en-US.

If just to play around, I modify CSL::Loader.load to cache in memory, it saves an order of magnitude of time, down to 10-20ms to process.

One probably wouldn't want to actually do that, because then any program using would cache whether it wanted to or not, eventually holding all styles and locales ever used in memory. Plus there could be concurrency issues.

But it would be good if there were some way for the caller to load the style/locale manually, and pass it in. Perhaps something like:

chicago_note_bibliography = CSL::Loader.load("chicago-note-bibliography")
locale = CSL::Loader.load("en-US")

# now I can re-use them to my hearts content over and over,
# without paying the loading cost...

 cp = CiteProc::Processor.new style: chicago_note_bibliography, locale: locale,  format: 'html'
 cp.import [csl]
 cp.render etc

  # do it again without reload

 cp = CiteProc::Processor.new style: chicago_note_bibliography, locale: locale,  format: 'html'
 cp.import [another_csl]
 cp.render etc

That's not quite right, CSL::Loader is actually a module, I got to figure out where it is used, plus of course this just doesn't work at present. It could probably be made to work, I started looking at the code, and got a bit lost in all the objects across three gems involved, but perhaps it could be done fairly straightforwardly. I'm still investigating, toward a potential PR (in one or more of the involved gems).

Or maybe there's other existing API not mentioned in the README that would be better to use for this, either as-is, or with some modification.

Do you have any thoughts? Thanks!

@jrochkind
Copy link
Contributor Author

Hmm, something close to this may almost work already. It's actually:

chicago_note_bibliography = CSL::Style.load("chicago-note-bibliography")
locale = CSL::Locale.load("en-US")

If I pass those in to cp.render, the style is not loaded again, it uses the one I already loaded. (not totally sure if this is actually thread-safe though. Also not sure if it just works accidentally).

the locale is still loaded again though, somewhere it turns the locale I pass in back to a string and then it's loaded again using that string as input.

Hmm.

@inukshuk
Copy link
Owner

It helps to distinguish between what I call the citation 'renderer' and 'processor' part. The renderer implements CSL and is fairly complete; the processor has many other concerns that will vary a great deal based on your application: parses input, processes output, keeps track of which items have been cited before, and their order (for stuff like 'ibid.'), decides if disambiguation is necessary etc. You'll notice that the renderer, in theory, can be almost stateless, whereas the processor is inherently stateful. At the moment, the processor part is very limited in citeproc-ruby, because my own use cases have always been limited to generating bibliographies in one go as opposed to managing all the citations in a document and the corresponding bibliography, based on dynamic input.

That said, you're absolutely right: parsing styles and locales takes a long time and it obviously makes sense to keep instances around. In the context of the library, this would happen in the processor: you create a processor instance with style and locales and use for all your processing needs. If your use case is inherently a stateless one, like you say, you're in the position I usually find myself in as well: you just need to render single (or sometimes multiple) citations and then forget about them (i.e., you're not concerned about disambiguation, ibidems and all that stuff). In this case, typically you don't need a processor at all: this is basically the conclusion you arrived at, I just wanted to give you my rationale for it. So you'd load the styles and locales you need, create a renderer, and use those instances.

If you do not modify the styles or locales they should be thread safe; the renderer needs to track some state while it renders, so I would definitely put use a thread local instance for it (to be on the safe side, it would probably be best to keep style and locale in thread local too, especially if you don't do any dynamic modifications, it's safer and only a little memory overhead).

With the caveat that I'll have to familiarize myself with the code again, I would say that there is no reason for the renderer to keep loading the same locale again, but that's something we can figure out.

@inukshuk
Copy link
Owner

As an example, here is how we create the renderer in jekyll-scholar and then use it to render bibliographies.

@jrochkind
Copy link
Contributor Author

jrochkind commented Feb 28, 2018

I found the code where it insists on reloading the locale, which I can change to be analagous to the code for loading the style, which does not. I'll make a PR.

As far as your example from jekyll-scholar (which is helpful, thanks! And I think would be a helpful example in the README for one of the involved gems maybe?), my concern is that a CiteProc::Ruby::Renderer may not be safe to use concurrently between multiple threads. That might not matter in jekyll, but does matter in Rails or other web served apps, unless you do some non-trivial and error-prone things to try and make it not matter.

The solution "cache it in Thread.local, not globally" doens't really work, because depending on server environment, you can't neccesarily depend on threads being re-used at all. Caching things in Thread.local can end up being equivalent to re-creating it for every request, if every request ends up getting a brand new thread. It's not a super reliable solution.

@inukshuk
Copy link
Owner

The renderer is definitely not thread-safe, because it keeps some state during the rendition: to render an item, the renderer basically walks down the CSL nodes; some state is stored in the citation item as the renderer visits each node; this is probably thread safe even for the more complicated CSL features like groups, because you can make sure not to share citation items between threads. But there is also some state kept in the renderer instance. To make rendering thread safe we'd have to pass down the state to every node which may not be easy to do conceptually, because order is really important: e.g., if you have a style that replaces an author's name in subsequent references with something like ----. So I think we'd have to use a mutex for the render method to make it thread safe.

Looking at some of the render methods, you see the state being stored and then at the end cleared again. It's this state, I think which is a conceptual problem so those methods would have to be protected by a mutex.

@jrochkind
Copy link
Contributor Author

OK, thanks. Presumably that also means a CiteProc::Processor is not thread safe, since it uses a renderer. But no good way to share that anyway, since a given processor already has a certain list of known records in it's list.

By making sure to re-use Locale and Style (which we think are probably thread-safe, so long as you aren't intentionally mutating them), creating a new CiteProc::Processor and using it gets me down to reasonable if not awesome performance -- 10-20ms instead of the 100-200ms it took including loading the Style and Locale. So I think that's good enough for now!

I will PR a change to citeproc-ruby to allow you to re-use an already loaded locale with a new renderer/CiteProc::Processor. It is simple.

@jrochkind
Copy link
Contributor Author

@inukshuk hmm, in your example of using the Renderer directly -- what data structure has to be passed in to the renderer, in your code what sort of object is citation_item_for(entry, index). When I try passing in an ordinary ruby hash (that is in csl-data.json format), I get NoMethodError: undefined method data' for #Hash:0x007faa68d418d0`.

Ah, okay, I see in your code I need to create a CiteProc::CitationItem.new. Which looks maybe straightforward to do from a csl-data hash. I'll try that next!

I wonder if it would be convenient to have a simpler more convenient api for "render a single citation from csl-data hash".

jrochkind added a commit to sciencehistory/chf-sufia that referenced this issue Feb 28, 2018
@jrochkind
Copy link
Contributor Author

got it working. hmm, making it work for "re-use already loaded locale/style" is actually harder at this lower level using a renderer directly, than it was using the higher level CiteProc::Processor. But I'm gonna figure it out.

This code gets a bit... tangled.

jrochkind added a commit to sciencehistory/chf-sufia that referenced this issue Feb 28, 2018
@inukshuk
Copy link
Owner

While I blush a bit at the 20ms, I'll say that the library has not been optimized for performance yet at all -- there is much room for improvement (but with CSL/CiteProc there are also many rabbit holes, in my experience!).

I do agree that it would be nice to have a simple API to 'just render this item' -- it is the use case that I think a lot of people want. PRs much appreciated, if you want to take a stab at it.

@jrochkind
Copy link
Contributor Author

Actually, using the lower-level Renderer API and using a patch to let it use already-loaded locale, I'm getting down to 7-9ms. So much better than the 100-200ms I started out with! Trying to figure out the best place to actually PR the patch.

Definitely either the README needs an example of how to do this in a performant way, and/or it needs a simpler API. Maybe both. I'll see what I can do once I get the first patch in.

jrochkind added a commit to jrochkind/citeproc-ruby that referenced this issue Feb 28, 2018
Ref inukshuk#53

CiteProc::Ruby::Engine class already does it here for styles:
    https://github.com/inukshuk/citeproc-ruby/blob/c6b81ea04868ad2afc704d9a38148df053cf4898/lib/citeproc/ruby/engine.rb#L117-L121

And CiteProc::Ruby::Format does it here for formats:
    https://github.com/inukshuk/citeproc-ruby/blob/c6b81ea04868ad2afc704d9a38148df053cf4898/lib/citeproc/ruby/format.rb#L31

Loading (or not) of all three types of objects is done in kind of different non-parallel ways, but that was kind of there in the code when I found it.

With this change for locale, I can avoid loading both locales AND styles (as well as formats while we’re at it although it probably doesn’t make as much of a difference) with EITHER of these APIs:

```ruby
csl_style # assume exists a CSL::Style, prob previously loaded with CSL::Style.load
csl_locale # assume exists a CSL::Style, prob previously loaded with CSL::Locale.load

    cp = CiteProc::Processor.new style: csl_style,
       locale: csl_locale,
       format: CiteProc::Ruby::Formats::Html.new
    cp.import csl
    cp.render :bibliography, id: whatever

   csl_hash # assume exists, a single hash (not array) that’s csl-data.json format
   citation_item = CiteProc::CitationItem.new(id: csl_hash[“id"]) do |c|
      c.data = CiteProc::Item.new(csl)
    end

      renderer = CiteProc::Ruby::Renderer.new :format => CiteProc::Ruby::Formats::Html.new,
         :style => csl_style, :locale => csl_locale
      renderer.render citation_item, csl_style.bibliography
```

The latter is giving me pretty good performance, 7-9ms on my macbook, with already loaded style/locale.

The former still gives better performance than it did before an already loaded locale could be used.

There’s other code here:
https://github.com/inukshuk/citeproc-ruby/blob/c6b81ea04868ad2afc704d9a38148df053cf4898/lib/citeproc/ruby/renderer/locale.rb#L11

…that at first I was looking at to avoid the load if it already has a locale…. I’m not totally sure when that code is used vs the code I _did_ patch, but the thing I’m PR’ing here is the thing that seemed to work for both CiteProc::Processor use and CiteProc::Renderer use.  I think.

None of these approaches were obvious from the docs. Perhaps a README PR giving examples of how to do this now with current code?  Better convenience API for this use case might also be nice, but I’m not totally sure how to do it sensibly.  Maybe just a new method on Renderer that lets you just pass in a csl_hash, without having to make the CitationItem yourself, and lets you pass in :bibliography if you want. (And why do I have to tell CitationItem the `id`, when that’s already in the hash?)

Or maybe the convenience API should actually be in terms of `Engine`, create an engine and just pass that in to render each time. Which maybe is possible now. I gotta figure out how to work with Engine directly. Is `Engine` thread-safe do you think?  If not, then nevermind on this last paragraph.
jrochkind added a commit to jrochkind/citeproc-ruby that referenced this issue Feb 28, 2018
Ref inukshuk#53

CiteProc::Ruby::Engine class already does it here for styles:
    https://github.com/inukshuk/citeproc-ruby/blob/c6b81ea04868ad2afc704d9a38148df053cf4898/lib/citeproc/ruby/engine.rb#L117-L121

And CiteProc::Ruby::Format does it here for formats:
    https://github.com/inukshuk/citeproc-ruby/blob/c6b81ea04868ad2afc704d9a38148df053cf4898/lib/citeproc/ruby/format.rb#L31

Loading (or not) of all three types of objects is done in kind of different non-parallel ways, but that was kind of there in the code when I found it.

With this change for locale, I can avoid loading both locales AND styles (as well as formats while we’re at it although it probably doesn’t make as much of a difference) with EITHER of these APIs:

```ruby
csl_style # assume exists a CSL::Style, prob previously loaded with CSL::Style.load
csl_locale # assume exists a CSL::Style, prob previously loaded with CSL::Locale.load

    cp = CiteProc::Processor.new style: csl_style,
       locale: csl_locale,
       format: CiteProc::Ruby::Formats::Html.new
    cp.import csl
    cp.render :bibliography, id: whatever

   csl_hash # assume exists, a single hash (not array) that’s csl-data.json format
   citation_item = CiteProc::CitationItem.new(id: csl_hash[“id"]) do |c|
      c.data = CiteProc::Item.new(csl)
    end

      renderer = CiteProc::Ruby::Renderer.new :format => CiteProc::Ruby::Formats::Html.new,
         :style => csl_style, :locale => csl_locale
      renderer.render citation_item, csl_style.bibliography
```

The latter is giving me pretty good performance, 7-9ms on my macbook, with already loaded style/locale.

The former still gives better performance than it did before an already loaded locale could be used.

There’s other code here:
https://github.com/inukshuk/citeproc-ruby/blob/c6b81ea04868ad2afc704d9a38148df053cf4898/lib/citeproc/ruby/renderer/locale.rb#L11

…that at first I was looking at to avoid the load if it already has a locale…. I’m not totally sure when that code is used vs the code I _did_ patch, but the thing I’m PR’ing here is the thing that seemed to work for both CiteProc::Processor use and CiteProc::Renderer use.  I think.

None of these approaches were obvious from the docs. Perhaps a README PR giving examples of how to do this now with current code?  Better convenience API for this use case might also be nice, but I’m not totally sure how to do it sensibly.  Maybe just a new method on Renderer that lets you just pass in a csl_hash, without having to make the CitationItem yourself, and lets you pass in :bibliography if you want. (And why do I have to tell CitationItem the `id`, when that’s already in the hash?)

Or maybe the convenience API should actually be in terms of `Engine`, create an engine and just pass that in to render each time. Which maybe is possible now. I gotta figure out how to work with Engine directly. Is `Engine` thread-safe do you think?  If not, then nevermind on this last paragraph.
@jrochkind
Copy link
Contributor Author

jrochkind commented Mar 5, 2018

@inukshuk in the example you showed me before, you pass a style: argument to Renderer.new:

@renderer = CiteProc::Ruby::Renderer.new :format => 'html', :style => style, :locale => config['locale']

But as far as I can tell, that style argument is completely ignored, it's not actually saved anywhere.

Instead, when you actually call render on the renderer, you need to pass in styles(style).bibliography as the second arg. I'm not sure what you can that style#bibliography thing ; is that a style too? A substyle? A "mode"? And it can apparently be different on every call to render, it does not need to be set on Renderer.new (and in fact cannot be even as a default, that style arg is simply ignored).

Does this seem like I have it right? Thank you!

@inukshuk
Copy link
Owner

inukshuk commented Mar 5, 2018

Oh, yes, that sounds right -- you actually don't pass a default style to the renderer. In CSL there are two root rendering nodes: citation and bibliography (for citations and references which go into a bibliography); you pass one or the other to the render function based on what you want to render.

@mattluce
Copy link

mattluce commented Mar 20, 2018

So I'm not clear from this thread, is there a way to pass in the locale/style with the existing code? Any performance gains would be welcome.

@jrochkind
Copy link
Contributor Author

jrochkind commented Mar 20, 2018

@mattluce style, yes. locale, no.

Pretty much anywhere in existing code you pass in a style as a local file path or URL, you can also pass in a CSL::Style (that you loaded previously with CSL::Style.load), and it will use it, without having to load it again.

By "pretty much anywhere", I mean the places I was using/testing, and the places I found in the code. :)

Locale, you can pass in a CSL::Locale -- but the existing code will end up trying to load it again anyway (often succeeding, if you loaded it from local path, so you won't see an error, but it'll be extra time to (re)-load). #54 fixes that.

@mattluce
Copy link

Unfortunately, I'm not seeing any performance gains when reusing a loaded style object.
cp = CiteProc::Processor.new style: $style, format: 'html'

Anyway, performance improvements would be greatly welcomed. I'm not a Ruby developer, otherwise I'd be happy to contribute.

@jrochkind
Copy link
Contributor Author

jrochkind commented Mar 21, 2018

@mattluce interesting, I definitely saw big improvements, and benchmarked it in my local test sandbox app.

(Of course, you've still got to load the style once, so if you benchmark loading the style in advance plus doing ONE processor instantiation, it will be exactly the same as one processor instantiation loading the style itself. Perf gains are only there if you load a style and then re-use it with multiple processors and/or renderers).

Let me check my test code to see what API's I was using... yep, it was the same one as you
CiteProc::Processor.new style: $style, format: 'html', as well as additionally some other API with a direct CiteProc::Ruby::Renderer.new. I suggested an API to make using a renderer directly a bit easier at #56, and some (IMO) README improvements at #57

@jrochkind
Copy link
Contributor Author

Hey @inukshuk, any chance of getting a release with the CSL::Locale re-use perf improvement in there from #54?

I was thinking it might make sense to get some of the other PR's I suggested in before a release, but I guess those are more controversial/challenging/need-thought. #54 is already merged, and I could really use it in a release, and don't want to have my production app depend on a fork! What do you think?

@inukshuk
Copy link
Owner

Yes, sorry, I'm running late here. I'll merge #54 and release it tomorrow!

@jrochkind
Copy link
Contributor Author

Thanks @inukshuk! You actually already did merge #54! it's in master, it just needs a release, heh.

@inukshuk
Copy link
Owner

Ha, indeed! I actually pushed a release with this as well -- 1.1.10 should include #54 already.

For the other changes, I want to finish date ranges first and then release those together.

@jrochkind
Copy link
Contributor Author

jrochkind commented Mar 27, 2018

Oh, okay, great, thanks! The other stuff is added bonus, but #54 is all I need to put my feature in production.

For now, I'm formatting date ranges myself and adding them to a literal field in the citeproc-json date, which works fine for my specific needs. If anyone is curious: sciencehistory/chf-sufia#1027

I'm going to close this ticket, thanks! And thanks for your advice and help @inukshuk!

jrochkind added a commit to sciencehistory/chf-sufia that referenced this issue Mar 27, 2018
@jrochkind
Copy link
Contributor Author

I wrote up technical details on our citation implementation in a sufia app, using citeproc-ruby, if anyone is interested. https://bibwild.wordpress.com/2018/04/04/another-round-of-citation-features-in-a-sufia-app/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants