Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inserting a large block is very slow #149

Open
sdh4 opened this issue Nov 14, 2017 · 4 comments
Open

Inserting a large block is very slow #149

sdh4 opened this issue Nov 14, 2017 · 4 comments

Comments

@sdh4
Copy link

sdh4 commented Nov 14, 2017

Inserting a large block into a ProxyList can be very slow because of the _synchronize() method called after each insertion.

Propose an insert_multiple() method that allows multiple insertions at the same time:
insert_multiple.patch.txt

Use of this patch can easily give an order of magnitude speedup for use cases such as loop unrolling that involve a large number of insert operations.

@rojaster
Copy link
Contributor

Wassap @sdh4, does it take input as a string to convert to nodes?

ps>
please insert code with

insert code

--- base_nodes.py.orig	2017-01-02 15:59:58.000000000 -0600
+++ base_nodes.py	2017-11-14 14:16:19.626943779 -0600
@@ -1375,6 +1375,17 @@
         self.data.insert(index, [value, None])
         self._synchronise()
 
+    # sdh4 11/14/17 add insert_multiple
+    def insert_multiple(self,index,values):
+        """ Insert multiple values from an iterable into the ProxyList 
+        at the specified index, without resynchronizing after each one. """
+        for value in values:
+            value = self._convert_input_to_node_object(value, parent=self.node_list, on_attribute=self.on_attribute)
+            self.data.insert(index, [value, None])
+            index+=1
+            pass
+        self._synchronise()
+    
     def append(self, value):
         self.insert(len(self), value)

What about converting if we wanna do multiple inserts from one RB tree to another RB tree?

@sdh4
Copy link
Author

sdh4 commented Nov 15, 2017

Here's an improvement that can handle a broader array of sources (such as fst's) because it uses _convert_input_to_node_object_list().

Not sure how best to do multiple inserts in different places with no synchronization. I guess a data structure with an array of indices and node lists could be used.

Leaving the tree unsync'd seems like it could be a bad idea.

Separately, this seems to trigger an indentation bug which I will list separately.

--- base_nodes.py.orig	2017-01-02 15:59:58.000000000 -0600
+++ base_nodes.py	2017-11-15 09:42:22.360961636 -0600
@@ -1375,6 +1375,17 @@
         self.data.insert(index, [value, None])
         self._synchronise()
 
+    # sdh4 11/14/17 add insert_multiple
+    def insert_multiple(self,index,values):
+        """ Insert multiple values from an iterable into the ProxyList 
+        at the specified index, without resynchronizing after each one. """
+        values = self._convert_input_to_node_object_list(values, parent=self.node_list, on_attribute=self.on_attribute)
+        for value in values:
+            self.data.insert(index, [value, None])
+            index+=1
+            pass
+        self._synchronise()
+    
     def append(self, value):
         self.insert(len(self), value)
 

@duncf
Copy link
Contributor

duncf commented Nov 16, 2017

I haven't looked at the LineProxyList algorithm, but I took a pretty close look at the CommaProxyList algorithm a few months ago.

CommaProxyList was super slow for big files while computing the inter-element indentation. (node.indentation is shockingly slow.) I suspect that could be what's happening here. #146 has my changes. A similar change to LineProxyList might fix this and #150.

@sdh4
Copy link
Author

sdh4 commented Feb 6, 2018

The speed problem here seems to come from the self._synchronize() call completely rebuilding the inner list and then the node list from the inner list. When you do that iteratively the process ends up O(n^2) which blows up quickly for large n.

The patch above addresses the problem and does not seem to generate any other problems (#150 is a separate and independent problem). Shall I make a pull request?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants