-
Notifications
You must be signed in to change notification settings - Fork 3
/
chapter_02_open_science.tex
1290 lines (1154 loc) · 59.5 KB
/
chapter_02_open_science.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\chapter{Towards reproducible, open science through open research software}
\section{Introduction}
% FIXME: Deleted
% Science: giants, collective, grows when shared
% Web: money, open-source, tools esp. DVCS
% Computing: performance, GPU
% Changes: dynamics, way science is done, coding inferior, open science
% see http://sciencecodemanifesto.org/
% http://lorenabarba.com/gallery/reproducibility-pi-manifesto/
% https://www.numfocus.org/
%
% http://www.nature.com/news/interactive-notebooks-sharing-the-code-1.16261
% We provide examples of solutions for: Good coding practice, source control
% management, packaging, licenses documentation, testing,CI
% Fluiddyn: foster open-science, source. Framework for collaborative dev, open
% methods
In order to attain the research goals in this thesis, a path off the beaten
track was pursued -- i.e. to create a handful of easy to maintain, rigorously
tested, and most-importantly, open-source scientific packages which do not
compromise on performance.
%
This resulted in the creation of FluidDyn, a project to facilitate open-science
in the field of fluid mechanics.
%
The focus of the project was to develop a set of packages to implement a
framework for simulations \citep{fluidsim}, experiments
\citep{augier_fluidimage_2016,augier_fluidlab_nodate}, Fast Fourier Transforms
\citep{fluidfft} -- all depending on a base package to reuse certain utilities
and data structures \citep{fluiddyn}.
%
The project owes a lot to the recent developments in methods and tools for
open-source software engineering.
%
Several modern coding practices were adopted while developing these packages.
The FluidDyn packages are primarily written in Python language. Incidentally,
it is the defining characteristics of Python which make the eventual success of
such a project realizable.
%
Python is one of the most important tools in recent open-source dynamics,
particularly in science.
%
The popularity of the language in fluid mechanics is growing, but Python is yet
to reach widespread adoption as the language of choice.
We shall now discuss these topics in greater detail. The chapter is organized
into four sections. The first section focusses on the technical aspects, with
particular focus on emerging developments in programming languages, tools and
services suited for open-source software development. The subsequent section
presents Python's strengths and weaknesses as a language for scientific
computing.
%
The third section compares various software development methodologies. Here, we
will also discuss about the possible contradiction between productivity of
individuals and productivity at the community level.
%
The chapter finally ends with a summary of the motivations for the project
FluidDyn and an overview of the packages we have developed.
% The chapter ends with a note on the capabilities of FluidDyn
% packages\footnote{We use FluidDyn (with capital letters) to name the project
% and \fluiddyn for the base package.}.
%
% \section{Science, software, open-source and the computer revolution}
\section{Open-source software and open-science}
Computational sciences, and especially computational fluid dynamics (CFD) as a
discipline, have flourished in the latter half of the twentieth century.
Incidentally, the origin of this discipline can be traced back to the ideas in
\citet{richardson_weather_1922} to achieve numerical weather prediction.
Following the arrival of first electronic general-purpose computer, ENIAC in
1945, the first simulations were made by \citet{charney_numerical_1950} using a
simple barotropic model to make 24 hour forecasts \citep[see
also][]{lynch_richardson_2010}. In those early days when the availability of
computers was limited, such methods were neither widespread nor feasible.
Scientific investigations were dominated by theoretical and experimental
methods.
%
We have come a long way and today computational methods complement theoretical
and experimental studies, and are equally important.
%
% Thus, in a typical scientific study, there is significant effort involved in
% translating the mathematical representation to a working research software.
%
With the connectivity provided by the world wide web it is possible to achieve
more and reinvent the way we do sciences.
\begin{figure}[h]
\centering
\includegraphics[width=0.8\textwidth]{open_science}
\caption{A schematic representation of various concepts and methodologies
involved in achieving reproducibility in sciences.}\label{fig:opensci}
\end{figure}
In recent years, there has been a move towards open science
\citep[see the report by][]{royal_society_great_britain_science_2012}.
%
Depicted in figure~\ref{fig:opensci} are some of steps needed to ensure
reproducibility and transparency in sciences.
%
Traditionally, \emph{knowledge} has been disseminated in the form of courses
and books from universities, and to the wider scientific community through
research articles, conference proceedings and workshops.
%
In the last decade, knowledge has become more accessible to the public
through public-domain and Creative Commons (or similar) licensed course
materials and massively open online courses (MOOCs). Moreover, there is also an
emerging push to publish in open-access journals\footnote{In the European
Union, a proposal to implement Plan S (\url{https://www.coalition-s.org}) was
been set in motion in September 2018 to fast track adoption of open-access.
} either through systematic
reforms in academia on how researchers' productivity are assessed, or through
regulations and offering incentives, such as waiving the application processing
charge for publishing \citep{nosek_promoting_2015}.
A lot of work is required to condense and implement this knowledge in the form
of working applications.
%
The \emph{know-how} can be made accessible and open to scrutiny by making
research software open-source with an appropriate license\footnote{For a
detailed comparison of licenses, visit either
\url{https://choosealicense.com} or
\url{https://www.gnu.org/licenses/license-list.html}. }.
%
To make codes usable they have to be complemented by documentation which
typically includes tutorials, examples and a detailed commentary of different
components.
%
Furthermore, to inspire confidence in and to ensure reliability of a code
throughout the process of software development, it needs unit-tests through
which it is continuously built, tested and deployed --- a process known as
Continuous Integration (CI). Depending on how matured a software is, CI can be
easy or cumbersome to implement, but nevertheless, the result is a cleaner and
reliable code.
Finally, the \emph{results} generated from such codes together with the
workflow (i.e.\ scripts to run the codes and to post-process the data) when made
open in the form of datasets can ensure that research is reproducible and
preserved in archives for several years to come \citep{gewin_data_2016}. Open
datasets can be shared and made citable through services such as
\href{https://zenodo.org}{Zenodo} and \href{https://figshare.com}{Figshare}.
In FluidDyn project, we have strived to implement all the aspects of making the
know-how open through open-source software. We shall now take a look at the
tools which have made it possible.
\subsection{Methods and tools for open-source software engineering}
\subsubsection{Free and Open-Source Software}
The Free and Open-Source Software (FOSS) movement has given rise to the tools
that enable open-source software development today.
%
The term \emph{free} in FOSS is a misnomer, as it actually stands for
\emph{freedom} (to use, modify and distribute). The FOSS movement has
dramatically decreased the cost of using computers, which is evident from the
widespread use of GNU/Linux systems in desktops, computing clusters and
web-servers in academia and beyond.
%
The FOSS culture can be traced back to the success of typesetting
standards \TeX\ (1977) and \LaTeX\ (1985) for authoring scientific publications.
%
Another founding moment for the FOSS movement was the launch of the GNU project
in 1983 by \href{https://en.wikipedia.org/wiki/Richard_Stallman}{Richard
Stallman}, to create a Unix-like computer operating system composed entirely
of free software\footnote{For completeness, see also the work done on
\href{https://www.levenez.com/unix/}{other Unix operating systems, for
example BSD}.}.
%
GNU is today known for its compiler collection (GCC) and a multitude of tools
which combined with the Linux kernel (created by
\href{https://en.wikipedia.org/wiki/Linus_Torvalds}{Linus Torvalds} in 1991)
forms the GNU/Linux operating system
% and its several distributions (Debian, Red Hat, Slackware, Arch Linux and
% Gentoo to name a few)
that we are familiar with today.
%
% FOSS movement has had huge successes in many frontiers --- Apache could be
% termed as the ``first killer-app of Linux'' and now with an ever-increasing
% suite of softwares including Firefox, LibreOffice, Gimp as solid alternatives
% to proprietary offerings.
%
Linux has become the most widely used kernel, being deployed on servers,
personal computers, embedded devices, and also smart phones (with Android). The
success of the Linux kernel project was attributed to the \emph{bazaar} model of
software development, as described in \citet{raymond_cathedral_1999}, wherein a
code can stay structured despite releasing frequently and delegating tasks to
community with myriad agendas.
% remark Julien Salort: not interesting. Nothing on BSD.
% https://www.levenez.com/unix/
Over the years, FOSS development has transitioned from an organic community of
volunteers, towards an organized system with participation from industries,
non-profit organizations and government institutions.
%
Such a model results in a win-win scenario -- users benefit from transparency and
ability to tinker, and the organizations profit with more contributions.
%
The FOSS movement has now entered into a second era
\citep{fitzgerald_transformation_2006}.
% - 90s arrival of internet (soon mass market)
% Some huge open-source successes : Apache ("first killer app of Linux"),
% now Firefox, Open-office
% av: https://www.reddit.com/r/AskReddit/comments/7x639l/what_free_software_is_so_good_you_cant_believe/du6pw11/
% Linux kernel now widely used on servers, personal computers, embedded
% devices, smarth phone (with Android).
% https://en.wikipedia.org/wiki/Linux_kernel
% Now, new period for open-source:
% see http://www.cepis.org/upgrade/files/full-2005-III.pdf
% - "Libre Software Movement: The Next Evolution of The IT Production
% Organization?"
% - ``the composition of development teams was changing, from all-volunteer
% teams to teams with paid participants from industry, government or
% not-for-profit organizations.''
% By the way, we also have to use the term "libre software" (?)
% av: Libre-software is not so mainstream. Even Linux kernel is not libre with
% binary blobs for drivers.
% We may have to change the order of this list?
% - git and mercurial 2005
% - software repositories ~ 10 years before
% so I would exchange the too paragraph...
% av:That is probably because we did not mention patches and CVS, subversion
% etc.
\subsubsection{Source code management}
Collaboration was achieved in the early years of FOSS development through
emailing ``patches'', and centralized Version Control Systems (VCS) near the
turn of the 21\textsuperscript{st} century. The collective collaboration on
development is today streamlined with the emergence of Distributed Version
Control Systems (DVCS), especially, Git and Mercurial).
%
Version control tools are designed to work on any kind of text files, not just
codes, so they could even be utilized to write scientific documents with
multiple authors.
%
In web-based source development platforms like GitHub, Bitbucket and GitLab,
DVCS repositories are enhanced with code-review tools, typically consisting of
a issue tracker and pull-request mechanism.
%
Git has become the \textit{de facto} standard for source management. However,
for the FluidDyn project, we prefer to use Mercurial because we consider it to
be simpler for the beginners and just as powerful for expert users\footnote{See
{``a detailed comparison''} on Mercurial website,
\url{https://www.mercurial-scm.org/wiki/GitConcepts}.
}. Simplicity and low learning barriers for
scientists who are not experts in software development is crucial for our
project.
%
Note that, it is now possible to use Mercurial to work with Git repositories,
for example hosted in GitHub and GitLab\footnote{By using the Mercurial
extension \href{http://hg-git.github.io/}{hg-git}.}.
%
The use of DVCS in science will lead to long term productivity benefits
\citep{wilson_best_2014}.
\subsubsection{Package management and software repositories}
Packaging process depends on the language used in a project, however some good
practices are universal, such as licensing, supplying documentation (a README
file at the bare minimum), keeping installation paths flexible and relying on
official build tools \citep{taschuk_ten_2017}.
%
A credible repository to packages as compressed archives or as pre-compiled binary
packages is indispensable to promote code reuse and reproducibility of results.
%
The use of repositories started with the conception of
CPAN and CRAN repositories for the languages Perl and R in 1993 --- based on
the Comprehensive \TeX\ Archive Network (CTAN) model for \TeX\ packages.
%
Python as a medium for scientific computing owes its success to a rich
ecosystem of third-party packages. \href{https://pypi.org}{Python Package
Index (PyPI)} and \href{https://anaconda.org}{Anaconda Cloud} are major
repositories for delivering and downloading Python packages, assisted by
package managers \pack{pip} and \pack{conda} respectively. The
\fnref{https://packaging.python.org}{``Python Packaging User Guide''} is an
excellent guide to learn how to install, package and distribute projects.
% Almost all major
% languages\footnote{\href{https://stackoverflow.com/questions/1693529/%
% list-of-top-repositories-by-programming-language}{https://stackoverflow.com/%
% questions/1693529/list-of-top-repositories-by-programming-language}} have
% similar repositories, with exceptions such as Fortran and Matlab.
Specialized package managers, such as \pack{easybuild} and \pack{spack}, have
also emerged to install scientific libraries and compilers.
\subsubsection{Continuous Integration and documentation} The need for
strong reliability of packages with increasing complexity has forced developers
to use Continuous Integration (CI).
%
When the code depends on a wide variety of external dependencies, CI is useful
to avoid \fnref{https://en.wikipedia.org/wiki/Software\_rot}{software rot},
i.e.\ to avoid obsolescence with newer versions of the compiler or its
dependencies. CI is specifically important for dynamical languages and
open-source code since, it allows open-source projects to welcome contributions
from other developers while maintaining a high degree of reliability.
%
Firstly, the code is \emph{built} against a known, repeatable emulation
environment on a server. Thereafter, a set of \emph{unit tests} are run for
each commit (modification of the code) to check that no bugs have been
introduced. It is also important to analyse what portion of the source code is
covered by these unit tests --- a method referred to as \emph{code coverage}.
%
Almost all CI platforms are free for open-source projects --- Travis CI,
CircleCI and Appveyor to name a few. Bitbucket and GitLab have built-in CI
implementations. Websites such as Codecov and Coveralls help to chart and
analyse the code coverage history, viewable by anyone. To use these services,
one simply has to write a simple file in YAML format and authenticate the
respective services with ``read'' permissions.
New users and developers often rely on good documentation and examples.
Documentation generators such as Sphinx (for Python) and Doxygen (for many
languages including C, C++, Fortran and Python) parses in-line comments into
documentations as PDF, HTML and more \citep{lee_ten_2018}.
%
The documentation generation can be automated online and deployed using any CI.
``Read the Docs'' has, in recent years, become the hub for building and hosting
documentation online, thus facilitating the process for developers and
considerably improving access to information for users.
%
In the FluidDyn project we this limit ourselves to automating tests and
documentation of the latest version. It is possible to further exploit the
CI by adding code style checks by running ``linters'' alongside unit tests,
\emph{continuous delivery} of package releases and preserving older versions of
documentation.
\subsubsection{Knowledge hubs and communication}
% Mutual aid and knowledge sharing have been important aspects of the web since
% its inception and these practices are now of foremost importance for many
% human activities.
%
%In particular,
Learning programming and computer science is now greatly aided by websites like
the community driven encyclopaedia \href{https://www.wikipedia.org/}{Wikipedia}
or the community driven forum \href{https://stackoverflow.com/}{Stack
Overflow}, where people write thousands of questions and answers on programming
each week. The IRC protocol has continued to sustain the test of time. Instant
messaging channels such as \codeinline{\#python} on Freenode facilitate
interactive discussions and a form of community-level support system --- a role
which used to be filled by mailing lists alone in the past. Alternatives such
as Riot (Matrix protocol), Gitter, and Slack have gained popularity offering
different integration solutions on top of instant messaging.
%
Through instant messaging a geographically separated team, as in the case of
the FluidDyn project, can achieve near real-time collaboration without having
to arrange video conferences or meetings.
Thus, there is a number of new methods and tools in software development.
Taken all together, they greatly improve our collective efficiency and
open possibilities for interactions that were unthinkable a few years ago.
%
For example, the success of community-driven software like \pack{astropy} (a
subject-area research library with
\href{https://github.com/astropy/astropy}{more than 240 contributors}) and
\pack{scikit-learn} (a machine learning toolkit with
\href{https://github.com/scikit-learn/scikit-learn}{more than 1000
contributors}) would not have been possible without these new software
development methods.
%
As scientists, we can wonder how to fully exploit these new possibilities to do
science. The project FluidDyn is an attempt to advance the fluid dynamics
community in this direction.
\section{Python, a programming language adapted for open-science}
We now turn to the presentation of the
\href{https://www.python.org/}{programming language Python}, another
fundamental tool for the FluidDyn project. Note that, this is not a detailed
technical presentation: we focus on the ideas necessary to understand why the
FluidDyn project is based on this language and on facts interesting for
potential FluidDyn users and developers.
The Python community has thrived remarkably in the last few years, which is
evident from the fact that by the end of 2017, Python became the second most
popular language in terms of pull-requests in GitHub\footnote{%
%See GitHub's annual report titled \href
\url{https://octoverse.github.com/2017}
% { ``The State of the Octoverse 2017''}.
%
} and
\fnref%
% [See the blog post ``The Incredible Growth of Python'']%
{https://stackoverflow.blog/2017/09/06/incredible-growth-python/}%
{questions on StackOverflow}.
%
TIOBE-index\footnote{\url{https://www.tiobe.com/tiobe-index}}, which assess the
popularity of a language based on number of skilled users, courses and
third-party vendors, ranks Python at number four. IEEE Spectrum\footnote{
%See the analysis titled
\url{https://spectrum.ieee.org/computing/software/the-2017-top-programming-languages}
%{``The 2017 Top Programming Languages''} on IEEE spectrum.
}
ranks Python as the most popular language in 2017 with a metric
based on trends, demand by employers and users on open-source hubs. All the
rankings agree that Python usage has been steadily growing and is here to stay.
Python largely outclasses languages like Fortran and Matlab according to these
statistics, and is now clearly one of the mainstream languages with C, Java,
C++ and JavaScript.
%
The language has been designed to boost the communication of technical ideas
between humans.
%
Let us summarize some characteristics of the Python languages that have lead to
such an incredible success.
%
Python differs from other mainstream languages for a variety of reasons:
\begin{itemize}
\item \textbf{Aesthetics.} The syntax is expressive, with an inherent emphasis
on readability. Blocks of code are defined with the indentation.
\item \textbf{Code style.} There are official (but, optional) guidelines, named
\fnref{https://www.python.org/dev/peps/pep-0008/}{PEP~8}, advocating for
regularity of the code. Linters and code-formatters can evaluate and enforce
adherence to this style.
\item \textbf{Automatic memory management.} A garbage collector handles the
memory allocation and deallocation, which needs no user intervention. There
are, of course, ways to control the memory usage of a program, though not as
precisely as for low-level languages.
\item \textbf{Dynamic typing.}
The types of objects are inferred at run time from the code and the
context, and in many situations, one does not need to declare them explicitly.
\item \textbf{Interpreted.} The standard way\footnote{There are also tools to
compile Python code (for example Cython and Pythran).} to run a Python code is
to ``interpret'' it. A program called the interpreter executes the code
nearly\footnote{Actually, the execution of Python code is less simple and
involves an Abstract Syntax Tree.} instruction-by-instruction, with very few
optimizations.
%
In contrast to other compiled languages, such as Fortran, C or C++, there is no
proper compilation step: the code is not translated to optimized machine
instructions, but only to Python bytecode.
\item \textbf{Multi-paradigm.} Imperative, object-oriented, functional
and aspect-oriented programming are supported.
\item \textbf{Extensible}. It is easy to interface with code written in
several other languages (in particular C, C++ and Fortran).
\item \textbf{Batteries included.} Python comes with a large
\fnref{https://docs.python.org/3/library/index.html}{standard library}.
\item \textbf{Cross platform.} Python can be deployed on many different
machines with different operating systems (Linux, Windows, macOS, Android) and
architectures (from a microcontroller --- with MicroPython or CircuitPython ---
to a Blue Gene supercomputer).
\end{itemize}
%
Thus, Python is one of the best ``glue'' languages for fast prototyping.
Code development in Python is much faster and easier than with many other
languages. Less bugs are introduced by the developers simply because there are
fewer lines of code \citep{nanz_comparative_2015}.
%
Maintaining and modifying existing code is also facilitated by the very good code
readability associated with the language.
%
Another consequence of the apparent simplicity of Python is that the learning
curve has a nice shape. Learning Python is very easy from the start and the
complexity gradually increases as the level of the developer increases.
%
Its interpreted nature is advantageous for a rapid development cycle and the
possibility of interactive workflow (usually with IPython or Jupyter
notebooks).
%
Therefore, Python is good for developers at all levels: very gentle for
beginners and very powerful for advanced users.
Python is an old language (first implementation in 1991!) but it continues to
evolve.
%
Note that there are many implementations of Python
interpreters\footnote{Notable ones are CPython (written in C), Jython (Java),
IronPython (C\#), PyPy (Rpython, a subset of Python) and
MicroPython/CircuitPython (C, targeted to micro-controllers).}. The default and
most widely used implementation is written in C and is called CPython.
%
A hard and controversial decision was taken to clean up Python from its
historical incoherences by introducing a new series of backwards-incompatible
versions of the language, starting with Python 3.0, released in December 2008.
After a long and difficult process, the transition from Python 2 to Python 3 is
completed for nearly all important packages\footnote{Close to
\href{https://python3wos.appspot.com/}{95\% of the most downloaded packages}
are Python 3 compatible as of 2018.}. Especially for scientific applications,
we can now work only in Python 3. For instance, \href{https://github.com/%
numpy/numpy/blob/master/doc/neps/dropping-python2.7-proposal.rst}{future
versions of \Numpy} and \fnref{http://www.python3statement.org/}{the main
scientific packages} will not be compatible with Python 2. These changes
open doors to a very clean, coherent and potentially faster Python
experience. Therefore, one should not use Python 2 for science any more and
instead, adopt the newer versions.
%
New versions of Python have nice features to adapt to new usages and to
recent trends in computer science. For example, three new features were
introduced in Python 3.5 (first released on September 2015): the \codeinline{@}
operator for matrix multiplication, the new \codeinline{async} and
\codeinline{await} keywords for concurrency and type hinting\footnote{See also our
paragraph on the issue of type checking in the following.} (with the module
\codeinline{typing} and an associated syntax presented in the
\href{https://www.python.org/dev/peps/pep-0484/}{PEP~484}).
Being a very versatile language, Python is widely used for many different
applications\footnote{\href{https://www.jetbrains.com/research/devecosystem-2018/python/}{See
\emph{The State of Developer Ecosystem Survey in 2018} by JetBrains}}:
\begin{itemize}
\item Simple scripting
\item Web development
\item Data science
\item System, database and network administration
\item GNU/Linux distribution software
\item GUI desktop applications
\item Scripting layer for applications (for example Paraview, Visit, QGIS,
Blender)
\item Web scraping
\item Animation, gaming and film industry
\item Education
\item Science
\end{itemize}
%
Python
% has recently become one of the most popular language for teaching programming
% and developing scientific applications. It
has now a mature and powerful scientific ecosystem with well-established
foundational packages
%
(\Numpy for N-dimensional homogeneous arrays, \Scipy as the
fundamental toolkit for scientific computing, \pack{Matplotlib} for plotting
and \pack{Pandas} for data structures) and several more specialized packages
(to name a few, \pack{h5py}, \pack{mpi4py}, \pack{skimage}, \pack{sklearn}).
The \href{https://github.com/rougier/python-visualization-landscape}{%
landscape of visualization tools usable through Python} makes it possible to
integrate post-processing and analysis capabilities into a single application.
%
Python is one of the main languages for data science with packages such as
\pack{Pandas}, \pack{statmodels}, \pack{sklearn}, \pack{Keras},
\pack{TensorFlow} and \pack{PyTorch}.
%
There are user-friendly and ready-to-use Python installers (similar to Matlab), in
particular the
\href{https://en.wikipedia.org/wiki/Anaconda_(Python_distribution)}{%
open-source distribution Anaconda} to streamline the installation in
desktop platforms.
Integrated development environments (IDE) adapted for
scientists, for example \href{https://github.com/spyder-ide/spyder}{Spyder},
\href{https://pyzo.org}{Pyzo},
and \href{https://jupyterlab.readthedocs.io}{JupyterLab} are often useful for
interactive development.
Python has a very large and supportive community (see, for example,
\fnref{http://stackoverflow.com/tags/python}{Stack Overflow tags}).
%
Python development is also supported by companies, and most of them specialize
in web technologies and data-driven research. However, recently, companies
involved in scientific software development have started to become quite
influential. For example, Anaconda Inc.\ is strongly engaged in the development
of tools like JIT Python accelerator \pack{Numba}, and parallelism framework
\pack{dask}. Hardware giants such as Intel and Nvidia also contribute to the
Python scientific ecosystem.
%
To summarize, Python's numerous features and its adoption in several domains
makes it a sensible language choice for scientific purposes.
\subsection{Some Python issues and their solutions}
In the context of scientific computing there are some common issues that Python
developers would face. The issues are listed below and the solutions are
specified briefly. The solutions are explained in further detail in subsequent
paragraphs.
\begin{itemize}
\item \textbf{Performance in CPU bounded tasks.} Since the standard Python
interpreter does not do any proper compilation, pure Python code for
some CPU bounded tasks can be too slow. Thus, one has to avoid writing
loops as much as possible and use vectorized array operations as much
as possible. One can also call \emph{foreign functions}, or use special
tools to build extensions using \emph{Ahead Of Time} or \emph{Just In
Time} compilation to speedup performance-critical code.
\item \textbf{Absence of type checking.} In contrast to compiled languages,
there is no type-checking in Python (and more generally in many
interpreted languages). Types are respected however, and type errors,
if any, would be raised during runtime.
\emph{Type hints} and \emph{static type checking} can solve this issue.
\item \textbf{Concurrent, but not parallel multi-threading.} Threads behave
like light sub-programs. In principle, they can use the different cores
of the CPU at the same time, but the CPython interpreter forbid threads
to interpret python code at the same time\footnote{See
\url{https://opensource.com/article/17/4/grok-gil}
%{a blog article ``How to grok the GIL''}
% and \url{https://faster-cpython.readthedocs.io/cpython37.html}{a
% technical presentation by a CPython core developer}.
}. In other words, threads can coexist concurrently, by not executing in
parallel.
%
This limitation comes from an important detail in the implementation of the
CPython interpreter: a Global Interpreter Lock (GIL) to prevent race conditions
that could corrupt data.
%
GIL greatly simplifies the implementation of CPython and that
it is very difficult to remove it while keeping other nice technical properties
of CPython intact\footnote{See
\url{https://wiki.python.org/moin/GlobalInterpreterLock}. Note that PyPy
also uses a GIL, while two other implementations of the Python language, Jython
and IronPython, do not have this limitation}.
\item \textbf{Lively, and thus, complicated ecosystem of packages.} For many
applications, the standard library is insufficient and external
packages are needed. Sometimes there are multiple options and the
user/developer has to make choices regarding the tools to use.
Especially for the beginners who are not used to the open-source realm,
it can be a confusing prospect to make good technological choices; but
this can be solved through proper guidance and training.
Of course, such an abundance of projects can also be seen as an advantage.
\end{itemize}
\subsubsection{Foreign functions}
%
Python is primarily a ``glue'' language. A common solution to achieve
performance is to use existing code written in C, C++ or Fortran.
%
It is possible, without a compilation step, to call functions in C libraries
through pure python code using \pack{ctypes} standard module or the \pack{cffi}
package.
%
Similarly \href{https://cppyy.readthedocs.io}{\pack{cppyy}} can interpret C++
code. Libraries with Python bindings can also be used directly (for e.g.,
wrapped using SWIG or Boost.Python) or one has to write extensions, i.e.\
Python modules written in C or C++ using the CPython API that can be imported
directly from Python like any Python module.
%
However, to rely on such techniques to call C libraries would inhibit
development of extending functionalities, as they have a steeper learning curve
compared to Python.
%
Nevertheless, this strategy has given rise to the base modules of the
scientific Python stack, namely \Numpy, \Scipy and \pack{Matplotlib}.
\subsubsection{Ahead Of Time compilation}
Ahead Of Time compilation is a way of generating compiled extensions which
can be dynamically executed by Python.
%
Python developers do not actually have to write the extensions in C, C++ or
Fortran, since there are tools to generate them automatically from Python code
(Pythran) or from Cython code. Cython \citep{behnel_cython2011} is a smooth
blend of two languages with possibility to add type declarations like in C/C++,
but with a syntax similar to Python.
%
There are also other projects to create extensions in C++ with \pack{pybind11},
and in Rust with \href{https:///pyo3.rs}{PyO3}.
Pythran \citep{guelton2015pythran, guelton2018pythran} is a quite recent Python
compiler which gives impressive results. It creates compiled extensions from pure
Python code with simple type annotations written as comments.
%
The resulting extensions are usually as fast as Fortran or C++ written by
non-specialists \citep[see, for example, benchmarks for the packages
\fluidpack{fft} and \fluidpack{sim},][]{fluidfft, fluidsim}.
%
They are created with a very interesting two-step compilations: first the code
is optimized at the Python level and then an automatically produced C++ code is
properly compiled.
%
This two-step compilation opens possibilities for very clever optimizations%
\footnote{As demonstrated in this blog post by the main developer of
Pythran \url{http://serge-sans-paille.github.io/pythran-stories/%
being-more-than-a-translator.html}}.
%
Pythran supports OpenMP pragma \citep{guelton2013compiling} and can use modern
vectorization with SIMD instructions \citep{guelton2014exploring}. Pythran
understands both Matlab-like vectorized code and C-like code with explicit loops
\citep{guelton2018pythran}.
%
In \citet{fluidfft, fluidsim}, we present examples of
highly efficient Python codes, optimized using a combination of Cython and
Pythran extensions.
\subsubsection{Just In Time (JIT) compilation}
Another way to obtain optimized machine instructions is by using JIT
compilation, i.e.\ by compiling only the critical code at run time.
%
This strategy can yield good results with other languages as for example Matlab
or Julia.
%
Adding a JIT to the interpreter CPython has been notoriously difficult
\footnote{\url{https://faster-cpython.readthedocs.io/}}. One of the
problem seems to be the C API provided in CPython and used by many extensions.
%
A faster CPython with a JIT is actually not so necessary for many purposes,
since Python extensions are enough. Therefore, not as much money and work
has been put in accelerating Python than for example for Java and JavaScript.
PyPy, an alternative interpreter written in Python, has a JIT compiler. However,
PyPy is not widely used for scientific applications mainly because of
compatibility problems with the extensions written for CPython.
%
However this could change since recent versions, starting with PyPy 6.0,
supports the main packages of the Python scientific stack (\Numpy, \Scipy,
\pack{Matplotlib}, \pack{Pandas}, etc.).
%
Another strategy is to add a JIT to CPython through an external package
(\pack{Numba}).
%
\pack{Numba} is particularly interesting because it can take advantage of the
GPU\footnote{See for example
\url{https://devblogs.nvidia.com/parallelforall/seven-things-numba/}}.
\subsubsection{Type hints and static type checking}
%
Type-checking is useful since it can detect simple bugs.
\fnref{https://docs.python.org/3/whatsnew/3.5.html\#whatsnew-pep-484}{PEP-484}
introduced a syntax and a related \codeinline{typing} module to add
type-hints, typically for function parameters and return types.
%
Static type checkers like \href{http://mypy-lang.org/}{Mypy} and
\href{https://github.com/google/pytype}{pytype}%
\footnote{We plan to add type hinting in the most important FluidDyn modules to
investigate how we can take advantage of type-checking with Mypy.}
%
allows to process type-hints and report inconsistencies in the code.
However, it is not a good practice even with static languages to rely only on
type-checking to look for bugs in a code.
%
It is known that a rigorous set of unit tests is also needed to obtain more
reliable codes.
%
Dynamical languages rely a lot on unit tests so it is very important to put at
least a little bit of time and energy to write a decent battery of tests.
\subsubsection{Multi-core computational parallelism}
At any rate, threads using the interpreter do not use the CPU at the same time
so it is not possible to use threads to do multi-core computational parallelism
with pure python code to accelerate CPU bounded tasks.
%
Note however, that threads can be (and are widely) used in Python for
concurrency, i.e.\ to perform I/O tasks, which does not need CPU, concurrently.
%
To do proper computational parallelism for CPU bounded tasks with the GIL, one
has to use other strategies: extensions (fine grain parallelism),
multiprocessing (coarse grain parallelism) and inter-process communication, for
example, with MPI (with mpi4py) or ZeroMQ.
% \subsubsection{Choosing the right package}
% %
% For a beginner in Python development, it is important to get a good
% introduction on how to work with the open-source workflows and understanding of
% the scientific Python stack.
% %
% It can also be very useful to ask to more advanced users and attend Python
% courses or workshops\footnote{This could be regular academic courses, MOOCs or
% workshops like ones organized by
% \href{https://software-carpentry.org/}{Software Carpentry}.}.
% %
\subsection{The state of languages for scientific computing}
So far, we have discussed the details of Python. However, there are many
interesting open-source languages that can be used for scientific applications.
All have strengths and weaknesses.
Fortran remains widely used for computing codes, at least in fluid mechanics.
C++ is used in many recent developments. Its recent improvements (C++11, C++14)
make it a modern and very useful language.
%
JavaScript, Java, Scala, Smalltalk, Haskell, R, Julia, Perl and Lua are used for
some scientific codes. Go and Rust are quite young languages with emphasis on
memory-safety and are not (yet) widely used for scientific applications.
%
It is, of course, very difficult to predict which technologies will be widely
used in even the near future (10 or 20 years for example).
%
It seems that the idea of ``one language to do everything for science'' will
not succeed, at least not in the near future.
%
The limited scope of the language is likely the reason why the community
around Fortran and Julia are not widespread outside academia.
%
An important aspect in the future would very likely be interoperability
between coexisting tools\footnote{See for example the cross-language development
platform for in-memory data, \href{https://arrow.apache.org/}{Apache Arrow}.}.
%
Considering what we have presented on Python, we think that it is reasonable to
bet that it will become one of the \textit{lingua franca} in science and in
fluid dynamics.
%
We have, therefore, chosen it to be the main language for the project FluidDyn.
\section{On software development methodologies}
\subsection{Productivity at individual, group and community levels}
% remark Julien Salort "It is well known that": bad pa: common sense says ? It
% is well known
It is not surprising
% av: "It is not surprising"
%
that short-term efficiency and long-term efficiency are sometimes incompatible.
%
Quick and dirty scripts can be efficient in the short term, but have negative
impact on a longer time scale.
%
Similarly, we can also differentiate productivity at different levels of
collaboration.
%
An individual can be very efficient with a particular tool, but the same tool
can be very inefficient when used in a collaboration.
%
A group can be very efficient with a code which is closed-source. Although such
a strategy can be efficient for the group (at least in the short term), it may
lead to waste for the community. Other groups will need to develop codes with
the same features.
%
Good ideas will be spread in separate codes and since the number of users and
developers is small for each of these codes, the code quality will not
improve as fast as it would with an open-source strategy. In a nutshell, as
this proverb says, \textit{``If you want to go fast, go alone. But if you want
to go far, go together.''}
Thus, a community using open-source methods can be efficient in co-developing
its tools. However, some of its members have to spend more energy to bear this
dynamics, at the risk of sacrificing their individual efficiency. We see that
short-term efficiency, long-term efficiency and productivity at different
levels can often be incompatible and that we have to consider these conflicts
when choosing between different technologies.
\subsection{Programming in the field of fluid mechanics}
We study fluid mechanics via laboratory experiments, in situ measurements,
analytic tools, numerical simulations and data processing.
%
Nowadays, programming is involved in all methods. It is usually difficult to do
fluid mechanics without software development.
%
However, on average, the level in software engineering is very low in the
community.
%
Even today, it is quite commonplace to start a Ph.D. without any serious
training in GNU/Linux and modern programming tools.
%
A large majority of scientists and technicians are also unaware of the new
challenges and opportunities of open-source.
Since fluid dynamics is an engineering science, closed-source commercial
software have been dominant in the field historically.
%
Few people are aware of the vicious circle of the closed-source model for
people and groups relying on it.
%
In the closed-source model, a group pays for a license or a new development.
The group does not learn how to develop what has been paid for. The company, on
the other hand, gains from the feedback generated from users, through which it
improves the product. The group produces codes, books, courses using the
closed-source product or acquire knowledge on how to use the closed-source
product, making the group increasingly dependent on the product and ready to
pay more for it.
%
It is difficult to break such vicious circles, but not impossible. This can be
done by introducing alternative open-source solutions.
Despite the reliance on commercial software for research in fluid mechanics,
there are also a lot of codes produced by scientists in fluid mechanics.
%
Without the technical knowledge on how to work collectively, such codes
are often substandard (compared to the information technology
industry standards) and are basically doomed to be abandoned after their use.
%
However, it is natural to try to reuse codes or at least to build the next steps
on what has been done already. Thousands of hours of highly qualified people
are spent in trying to understand and reuse badly written codes in
inappropriate languages!
From a technical perspective, the majority of the coding in the field involves
a mix of Fortran/C or C++, shell languages (as Bash) and Matlab. For
experiments, the graphical programming environment
\href{http://www.ni.com/en-us/shop/labview.html}{Labview} is dominant for
control of physical objects and acquisition. Fortran is often used to implement
solvers and post-processing algorithms. Matlab is used for data processing and
visualization.
%
Often, languages are used for purposes which they are not adapted for.
%
Fortran, C or C++ have not been designed for fast prototyping of complex
programs.
%
It is known that for scientific purposes, one should restrict shell scripting
to extremely simple tasks.
%
Use of compiled languages to develop new algorithms hinders the development
process and ability to debug.
%
Similarly, using Matlab for developing complex programs is a questionable
choice.
%
These technical solutions are also problematic for code reuse, sharing and
collaborative development.
%
We are now going to present factual arguments explaining why we think that the
massive usage of Matlab decreases the collective efficiency of the community.
\subsection{Discussion on the use of proprietary software for research}
% \paragraphbf{Discussion on Matlab, Labview and Mathematica}
Matlab is an example of a closed-source proprietary numerical computing
environment.
%
It is a good tool for simple processing with matrices, image processing, data
visualization and certain niche applications. The language is well adapted for
these tasks. The development environment is nice and the interpreter is quite
fast, especially now, when it has a JIT compiler.
%
However, the language suffers from serious technical issues which make Matlab
an inadequate tool
% \citationneeded[opinion] % pa: I think what we say is fine for an article. We
% are not writing a wikipedia page which has the ambition to be neutral. Here,
% we gives an argued opinion, which is totally normal in a scientific article.
% av: fair enough :)
for doing more than simple processing and data plotting. The comparison with
Python is often debated upon. We list a few striking Matlab weaknesses:
\begin{itemize}
\item One file for each function (or class) available outside the file where it
is implemented. No notion of package or module.
\item No real organization of the standard library. All built-in and
user-defined functions are available in a huge flat namespace. Due to this
absence of an import mechanism we cannot deduce, simply by looking at the code,
where a function is defined.
\item A standard way to organize multi-file code is to write scripts that
modify and define global variables (see for example
\fnref{https://www.damtp.cam.ac.uk/user/jrt51/files/diablo\_mat.tar.gz}{%
the Matlab version of the code Diablo}). It is so simple to do this that we
can assert the language strongly encourages this practice. Matlab files are
not self consistent, i.e.\ it is normal to use in a file a global variable
defined outside of the file.
\item Very bad default argument mechanism.
%
Default arguments for a function is a very common feature is many programming
languages. In Python, we can write:
\begin{minted}[fontsize=\footnotesize]{python}
def myfunc(a, b, c=1, has_to_print=True):
if has_to_print:
print('a =', a, 'b =', b, 'c =', c)
return c * (a + b)
\end{minted}
%
An implementation for approximately the same behaviour in Matlab could be (as
advised in the official Matlab documentation):
\begin{minted}[fontsize=\footnotesize]{matlab}
function ret = myfunc(a, b, varargin)
if nargin < 2 | nargin > 4
error(['The number of arguments has to be ' ...
'greater than 2 and lower than 5'])
end
if nargin == 4
has_to_print = varargin{2};
else
has_to_print = 1;
end