Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harvester is slow #267

Open
ldesousa opened this issue May 29, 2015 · 6 comments
Open

Harvester is slow #267

ldesousa opened this issue May 29, 2015 · 6 comments
Labels

Comments

@ldesousa
Copy link
Contributor

The harverter.py script is taking 11 minutes to run, this might be an issue long term. Can it be optimised somehow?

@ldesousa ldesousa added the low label May 29, 2015
@uleopold
Copy link
Contributor

It would be important to know what is slowing down the script.

Looking at the code it is likely the loops which slow down the code. Vectorised programming would probably improve this. Avoiding loops and if else statements where possible might improve speed substantially. You probably do not need to loop over each row on the data base as this is not a required sequential operation. It could be done at once.

There are functions like Map() list-comprehension etc. See here:
https://wiki.python.org/moin/PythonSpeed/PerformanceTips
https://www.python.org/doc/essays/list2str/

@ldesousa
Copy link
Contributor Author

It would be wiser to understand the cause of the slowdown before proposing random solutions. There are less than 200 rows in the datasets table, looping through them is certainly not the cause.

A programme without loops and ifs will not do much; the map function also produces a loop, just of a different kind. Also, keep in mind that Python is an interpreted language.

@uleopold
Copy link
Contributor

uleopold commented Jun 4, 2015

It is a discussion not a solution. Profiling would certainly help to identify the causes. It is exactly because of interpreted languages that you need vectorized programming. That is why I suggested to look into it. It is the same in other languages, e.g. R. Anyway, profiling will probably tell you the bottleneck.

@ldesousa
Copy link
Contributor Author

ldesousa commented Jun 4, 2015

I have never seen the term "vectorised programming" before, you are perhaps referring to array programming, but that is something completely unrelated to this issue. It is also unrelated to the fact that Python is an interpreted language.

Profiling could help, but there are certainly easier ways to study this issue.

@uleopold
Copy link
Contributor

uleopold commented Jun 4, 2015

Vectorised is a well established term and applies in particular to interpreted languages such as Matlab, R and Python etc. to operate on lists of strings or arrays.

As interpreted languages are very slow in looping it is advised to avoid loops when possible by vectorised programming, in particular inner loops. The only problem is that you need to rethink the code implementation when avoiding loops. All above is quite relevant and it would be more productive to first study the literature before flagging terms unrelated and inappropriate.

@ldesousa
Copy link
Contributor Author

ldesousa commented Jun 4, 2015

I clearly lack the knowledge for this task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants