-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fill the table with few threads #325
Comments
|
Allow me just to put some further meat on Michael's answer... and allow me
I might get a bit off-topic when talking on history and design.. ;-)
1a. Michael is totally correct, ODFDOM was not implemented to have thread
safeness in mind, still if the threads are not editing simultaneously the
document (writing in the same place) , but work on specific parts "just
filling a distinct part of the table", then you "just need to make certain
that they do not write simultaneously on the same data".
1b. I will not dive into the problem too deeply, but if you can manage to
divide the spreadsheet in parts (parts of rows) and let the parts be
created by different threads and afterwards be able to glue these parts in
the end together - you should be safe - still uncertain if there has to be
some post work (references?)...
For instance, 300k of rows being split up into 100k and 100k and 100k. The
interesting question is how do you mark these?
Perhaps very user specific, if you know it is a large table and you know
that the action to be performed could be performed in multiple threads.
2+3. Some history...
We started decades ago to work on the ODF Toolkit and first on ODFDOM.
From the beginning we started to generate ODFDOM Java sources from the
RelaxNG XML grammar, but we could not use existing Java XML Binding (JAXB),
as JAXB works only with XSD grammars not RNG. Therefore we generated the
Java sources ourselves, first with XSLT later split the complexity within
the XSLT into
a) reading the RNG into Multi Schema Validator (MSV) and wrapping the MSV
model with our Generator (Schema2Template) module
<https://odftoolkit.org/generator/index.html>
b) to fill templates from Apache Velocity (Template Engine)
<https://velocity.apache.org/engine/2.0/user-guide.html> as well glued
together by our Generator (Schema2Template) module
<https://odftoolkit.org/generator/index.html>
The focus of our approach was to make the code developer friendly, we
created a DOM model for the ODF XML tree(s) of the ODF document.
Easy to debug and even 3rd party XSLT is able to be applied directly to
such an ODFDOM document - see XSLT Runner module
<https://odftoolkit.org/xsltrunner/ODFXSLTRunner.html>.
But a spreadsheet is in all documents format an anomaly in the document
trees, as it is a two dimensional array within a tree structure.
The spreadsheet can become arbitrary big and with some (hopefully past)
naive ODF usages of ODF application emulating the background color of a
table by the repetition of thousands of rows and setting the background
color in each cell, were each row and cell is being created as DOM objects,
would certainly kill any performance and consume far too much memory for
that kind of table background color feature, it is simply inefficient and
unnecessary.
To avoid the larger DOM creation some program designs sought to express the
loaded XML data with as little memory as possible - like the Xerces XSD
validator is using bits for boolean attributes (not cells but same
problem), instead of full Java objects. This is just one of the
optimizations Michael is likely talking about.
But "never" as fast as calc as Michael stated, is a strong word, but never
with a DOM implementation is surely correct.
Still ODF XML runtime by DOM is in the end just an implementation detail.
We can go even a step further, if you load a document in LibreOffice, the
user often does not know what kind of syntax the document was loaded from.
It could have been an ODT, docbook, doc, docx, rtf, etc. as long the
feature set fills into the document syntax - and there is an
importer/exporter -there is no difference. Quite similar to liquid that was
filled into a glass, the features can be saved and reopened in various
syntaxes of file formats . Therefore, nowadays I would even "layer
away"/hide the specific XML dialect (syntax) with a semantic user layer.
Such a semantic layer would focus on concepts as paragraphs, tables,
characters and styles, which ar all used across many file formats, just
with level of feature support various for the syntaxes (for instance, text
files are very poor with format & styles, but markdown can rescue the basic
features).
Therefore, wrapping all these syntaxes with a semantic layer would have
been a good idea from the beginning.
To embrace multithreading from the beginning, I assume that even our file
format requires some refactoring and honestly to me it does not matter if a
format is using XML, JSON and binary variations for its syntax as long as
there are open-source transformations on standardisation level available
for everyone to change from variant into the other...
Back to your question, the most important tipp and next step for you was
already given by Michael. Take a Java profiler... :-)
Good luck!
Svante
…On Wed, 11 Sept 2024 at 10:52, Michael Stahl ***@***.***> wrote:
1.
ODFDOM isn't threadsafe, so the client needs to make sure it's only
accessed by one thread at a time.
2.
ODFDOM architecture isn't really a good fit for processing "huge"
spreadsheets, it's much better to use LibreOffice Calc for big
spreadsheets, with its highly optimized C++ core... you can use it from
Java via UNO API
3.
if you want to speed up ODFDOM, start by profiling where it's spending
all the time - i bet it can be made substantially faster than it is
currently, but it will never be performance competitive with Calc
—
Reply to this email directly, view it on GitHub
<#325 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGJNWYTD3RFSSFADK25SKDZWAAEBAVCNFSM6AAAAABOAIZWR2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBTGA2DCMZYHA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
svanteschubert thanks for this deep answer. and it does not work - file is not readable if i do it in single thread there is no any issues and file is readable so i think odf isnt threadsafe even your threads write to different coords of table |
@apb12 When you unzip the ODF document file (rename .ods to .zip and
unzip), you will realize that ODF in the content.xml - like other document
formats put the cell and its content into the rows. A column has no cell or
cell content but only attributes like column colour.
When you think of a two-dimensional array like a chess board, the chess
figure is standing equally in a certain row and column, but to avoid data
duplication the cell is only in the row. Therefore, you might consider
splitting and filling the table by rows and not by columns, as otherwise,
the writing on shared variables is more likely as the threads accessing the
same row.
You need to debug with a Java IDE - I usually use Netbeans and IntelliJ
IDEA - as you need to see what is happening and what corrupts the model in
the end?
Do you already know what is exactly corrupted? Have you taken a look at the
ODF XML?
I guess that there is concurrent write access to some static member or
shared objects, either directly in ODFDOM or indirectly in the underlying
Xerces DOM implementation.
I think you should change your scenario - threads writing into cells of
different rows (like 3 threads sharing access and the first thread gets
access to the first third of rows, the second thread to the second third of
rows, etc...
It requires some investigation. You might want to give it a try! ;-)
Good luck!
Svante
…On Wed, 11 Sept 2024 at 18:35, apb12 ***@***.***> wrote:
svanteschubert <https://github.com/svanteschubert> thanks for this deep
answer.
But im not sure for 1a. This code does not write data in same plase. All
that it does fill the column completely from top to bottom
so if you have 14 columns 14 threads will fill all of them and every
thread write only to same coumn. That means threads will never write in
same cell, its impossible because every thread gets his own column number
and works with it only.
thead 1 - get cell at position(0, 0) write data to it then going down and
get cell at position(1, 0) then (2,0)
thead 2 - get cell at position(0, 1) write data to it then going down and
get cell at position(1, 1) then (2,1)
and it does not work - file is not readable
if i do it in single thread there is no any issues and file is readable
so i think odf isnt threadsafe even your threads write to different coords
of table
may be it works for rows, that mean first thread fill first 100k rows next
one fill nex 100k and finally we will get table with 200k rows
—
Reply to this email directly, view it on GitHub
<#325 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGJNW2NQDHFHBJEVRFU2L3ZWBWOXAVCNFSM6AAAAABOAIZWR2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBUGE2DAMJSGI>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
svanteschubert i agreed with you, seems like we only can separate rows by threads not columns. I will try it out btw. Thx for answering. |
svanteschubert also want to say, that there is no any issue or exception, so debug do not show you anything helpfull on this case. Everything going good until you try to open this file xD |
@apb12
Can you unzip the result and take a look at the content.xml? Just curious!
BTW if you desire to debug with Xerces-J - the used underlying DOM
implementation, by incident I have pushed a testing-branch on GitHub for
build with Netbeans & IntelliJ-IDEA
<https://github.com/svanteschubert/xerces-j/tree/smoke-testing>, the
original Xerces uses an ANT file, which did not work for me...
…On Wed, 11 Sept 2024 at 20:47, apb12 ***@***.***> wrote:
svanteschubert <https://github.com/svanteschubert> also want to say, that
there is no any issue or exception, so debug do not show you anything
helpfull on this case. Everything going good until you try to open this
file xD
—
Reply to this email directly, view it on GitHub
<#325 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGJNW7PGXLBHDMVKDVDQ4DZWCF3PAVCNFSM6AAAAABOAIZWR2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBUGQZDKNZUGQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@apb12
To add a debuggable Xerces-J (Java) you should better download the latest
Xerces-J as ZIP <https://xerces.apache.org/mirrors.cgi#binary> as there are
all required JARs on root level.
For instance, for the IDE Netbeans
<https://netbeans.apache.org/front/main/index.html> you do the following
1. Create a new Project with existing Java sources
2. Choose JDK 8
3. Add the root src and the sample directory as Java Sources
4. Add all JARs from the root (aside XercesImpl.jar, which you are going
to build) to your class path
5. JUnit and Hamcrest JAR dependencies were added by Netbeans, when you
fix the project (context menu on the project).
Happy debugging,
Svante
On Wed, 11 Sept 2024 at 22:31, Svante Schubert ***@***.***>
wrote:
… @apb12
Can you unzip the result and take a look at the content.xml? Just curious!
BTW if you desire to debug with Xerces-J - the used underlying DOM
implementation, by incident I have pushed a testing-branch on GitHub for
build with Netbeans & IntelliJ-IDEA
<https://github.com/svanteschubert/xerces-j/tree/smoke-testing>, the
original Xerces uses an ANT file, which did not work for me...
On Wed, 11 Sept 2024 at 20:47, apb12 ***@***.***> wrote:
> svanteschubert <https://github.com/svanteschubert> also want to say,
> that there is no any issue or exception, so debug do not show you anything
> helpfull on this case. Everything going good until you try to open this
> file xD
>
> —
> Reply to this email directly, view it on GitHub
> <#325 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAGJNW7PGXLBHDMVKDVDQ4DZWCF3PAVCNFSM6AAAAABOAIZWR2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBUGQZDKNZUGQ>
> .
> You are receiving this because you commented.Message ID:
> ***@***.***>
>
|
this works
this not works
commom method
If i do it in single thread it works, but it take too long - 14 columns 590k rows ,it take more then 30 mins..
ive tried to do it with code above in multithreading - each column will be writing in new thread and it works rly fast but file that i get is not readable.
Is there any chance to do it faster? My app is based on your lib. Please any idea
The text was updated successfully, but these errors were encountered: