forked from glimmerphoenix/plantilla-memoria-TFG-TFM
-
Notifications
You must be signed in to change notification settings - Fork 0
/
memoria.tex
2376 lines (1815 loc) · 164 KB
/
memoria.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Plantilla de memoria en LaTeX para TFG/TFM - Universidad Rey Juan Carlos
%%
%% Por Gregorio Robles <grex arroba gsyc.urjc.es>
%% Felipe Ortega <[email protected]>
%% Grupo de Sistemas y Comunicaciones (GSyC)
%% Escuela Técnica Superior de Ingenieros de Telecomunicación
%% Universidad Rey Juan Carlos
%%
%% (Muchas ideas tomadas de Internet, colegas del GSyC, antiguos alumnos...
%% etc. Muchas gracias a todos)
%%
%% La última versión de esta plantilla está siempre disponible en:
%% https://github.com/glimmerphoenix/plantilla-memoria
%%
%% - Ejecución en sistema local:
%% Para obtener el documento en PDF, ejecuta en la shell:
%% make
%%
%% A diferencia de la anterior versión, que usaba la herramienta pdfLaTeX
%% para compilar el documento, esta nueva versión de la plantilla usa
%% XeLaTeX. Es un compilador más moderno que, entre otras mejoras, incluye
%% soporte nativo para caracteres con codificación UTF-8, traducción políglota
%% de referencias (usando Biblatex) y soporte para fuentes OTF. Esta última
%% característic permite, por ejemplo, insertar iconos de la colección
%% Fontawesome en el texto.
%%
%% XeLaTeX viene ya incluido en todas las distribuciones modernas de LaTeX.
%%
%% - Edición y ejecución en línea:
%% Puedes descargar y subir la plantilla a
%% Overleaf, un editor de LaTeX colaborativo en línea. Overleaf ya tiene
%% instalados todos los paquetes LaTeX y otras dependencias software para
%% que esta plantilla compile correctamente.
%%
%% IMPORTANTE: Si compilas este documento en Overleaf, recuerda cambiar
%% la configuración (botón "Menu" en la esquina superior izquierda de la interfaz)
%% y elegir la opción Compiler --> XeLaTeX. En caso contrario no funcionará.
%%
%% - Nota: las imágenes deben ir en PNG, JPG, EPS o PDF. También se pueden usar
%% imágenes en otros formatos con algunos cambios en el preámbulo del documento.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\documentclass[a4paper, 12pt]{book}
%%-- Geometría principal (dejar activada la siguiente línea en la versión final)
\usepackage[a4paper, left=2.5cm, right=2.5cm, top=3cm, bottom=3cm]{geometry}
%%-- Activar esta línea y comentar la anterior en modo borrador, para comentarios al margen
%\usepackage[a4paper, left=2.5cm, right=2.5cm, top=3cm, bottom=3cm, marginparwidth=60pt]{geometry}
%%-- Hay que cargarlo antes que las traducciones
\usepackage{listing} % Listados de código
% Traducciones en XeLaTeX
\usepackage{polyglossia}
\setmainlanguage{english}
%\setmainlanguage{spanish} % Comenta esta línea si tu memoria es en inglés
\usepackage[normalem]{ulem} %% Para tablas
\useunder{\uline}{\ul}{} %% Para tablas
% Traducciones particulares para español
% Caption tablas
\gappto\captionsspanish{
\def\tablename{Table}
\def\listingscaption{Code}
\def\refname{Bibliography}
\def\appendixname{Appendix}
\def\listtablename{Table index}
\def\listingname{Code}
\def\listlistingname{Fragments of code index}
}
%% Tipografía y estilos
\usepackage[OT1]{fontenc} % Keeps eulervm happy about accents encoding
% Símbolos y fuentes matemáticas elegantes: Euler virtual math fonts
% ¡Importante! Carga siempre las fuentes math AMS Euler ANTES QUE fontspec
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage[OT1,euler-digits,euler-hat-accent,small]{eulervm}
% En XeLaTeX las fuentes se especifican con fontspec
\usepackage{fontspec}
\defaultfontfeatures{Scale=MatchLowercase, Ligatures=TeX} % Default option in font config
% Fix para fuentes usadas con operadores y \mathrm
\DeclareSymbolFont{operators}{\encodingdefault}{\familydefault}{m}{n}
% Configura la fuente principal (serif): MinionPro
\setmainfont[Scale=0.96]{TeX Gyre Pagella}
% Configura la fuente sans-serif (\sffamily)
\setsansfont[Scale=MatchLowercase]{Lato}
% Configura la fuente para letra monoespaciada: Source Code Pro, escala 0.85
\setmonofont[Scale=0.85]{Source Code Pro}
%%-- Familias de fuentes específicas
%%-- Se pueden definir etiquetas para familias de fuentes personalizadas
%%-- que luego puedes emplear para cambiar el formato de una parte de texto
%%-- Ejemplo:
% \newfontfamily{\myriadprocond}{Myriad Pro Semibold Condensed.otf}
%%-- Opciones de interlineado y espacios
\linespread{1.07} % Aumentar interlineado para fuentes tipo Palatino
\setlength{\parskip}{\baselineskip} % Separar párrafos con línea en blanco
\usepackage[parfill]{parskip} % Mantiene alineados los párrafos
%%-- Hipervínculos
\usepackage{url}
%%-- Gráficos y tablas
\PassOptionsToPackage{
dvipdfmx,usenames,dvipsnames,
x11names,table}{xcolor} % Definiciones de colores
\PassOptionsToPackage{xetex}{graphicx}
\usepackage{subfig} % Subfiguras
\usepackage{pgf}
\usepackage{svg} % Integración de imágenes en formato SVG
\usepackage{float} % H para posicionar figuras
\usepackage{booktabs} % Already loads package xcolor
\usepackage{multicol} % multiple column layout facilities
\usepackage{colortbl} % For coloured tables
\usepackage{lscape} % For landscape format
%%-- Bibliografía con Biblatex y Biber
% Más info:
% https://www.overleaf.com/learn/latex/Biblatex_bibliography_styles
% https://www.overleaf.com/learn/latex/biblatex_citation_styles
\usepackage[
backend=biber,
style=numeric,
sorting=nty
]{biblatex}
\addbibresource{memoria.bib}
\DeclareFieldFormat{url}{\mkbibacro{URL}\addcolon\nobreakspace\url{#1}}
%\usepackage[nottoc, notlot, notlof, notindex]{tocbibind} %% Opciones de índice
%%-- Matemáticas e ingeniería
% El paquete units permite mostrar unidades correctamente
% Permite escribir unidades con espaciado y estilo de fuente correctos
\usepackage[ugly]{units}
% Ejemplo de uso: $\unit[100]{m}$ or $\unitfrac[100]{m}{s}$
% Entornos matemáticos
\newtheorem{theorem}{Theorem}
% Paquetes adicionales
\usepackage{url} %% Gestión correcta de enlaces
\usepackage{float} %% H para posicionar figuras
\usepackage[nottoc, notlot, notlof, notindex]{tocbibind} %% Opciones de índice
\usepackage{metalogo} %% Múltiples logos para XeLaTeX
% Fuentes especiales y glifos
\usepackage{ccicons} % Creative Commons icons
\usepackage{metalogo} % XeTeX logo
\usepackage{fontawesome5} % Fontawesome 5 icons
\usepackage{adforn}
% Blindtext
% Opciones pangram, bible, random (defecto)
\usepackage[pangram]{blindtext}
% Lorem ipsum
\usepackage{lipsum}
% Kant lipsum
\usepackage{kantlipsum}
\usepackage{fancyvrb} % Entornos verbatim extendidos
\fvset{fontsize=\normalsize} % Tamaño de fuente por defecto en fancy-verbatim
% Configura listas (itemize, enumerate) con iconos personalizados
% Fácil reinicio de numeración con enumerate
% Info: http://ctan.org/pkg/enumitem
\usepackage[shortlabels]{enumitem}
% Usar \usageitem para configurar iconos personalizados en listas
\newcommand{\usageitem}[1]{%
\item[%
{\makebox[2em]{\strut\color{GSyCblue} #1}}%
]
}
%%-- Definición de colores personalizados
% \definecolor{LightGrey}{HTML}{EEEEEE}
% \definecolor{darkred}{rgb}{0.5,0,0} %% Refs. cruzadas
% \definecolor{darkgreen}{rgb}{0,0.5,0} %% Citas bibliográficas
% \definecolor{darkblue}{rgb}{0,0,0.5} %% Hiperenlaces ordinarios (también ToC)
%%-- Configuración fragmentos de código
%%-- Minted necesita Python Pygments instalado en el sistema para funcionar
%%-- En Overleaf ya está instalada esta dependencia
% \usepackage[center, labelfont=bf]{caption}
\usepackage{minted}
%%-- Se debe cargar aquí para evitar warnings
\usepackage{csquotes} % Para traducciones con biblatex
%%-- Glosario de términos
\usepackage[acronym]{glossaries}
\makeglossaries
\loadglsentries{glossary}
% % Definición de cabeceras del documento, usando fancyhdr
% \usepackage{fancyhdr}
% %% Configuración de cabeceras para el cuerpo principal del documento
% \pagestyle{fancy}
% \fancyhead{}
% \fancyhead[RO,LE]{\myriadprocond{\thepage}}
% \renewcommand{\chaptermark}[1]{\markboth{\chaptername\ \thechapter.\ #1}{}}
% \renewcommand{\sectionmark}[1]{\markright{\thesection.\ #1}}
% \fancyhead[RE]{\myriadprocond{\leftmark}}
% \fancyhead[LO]{\myriadprocond{\rightmark}}
% \renewcommand{\headrulewidth}{0pt}
% \setlength{\headheight}{15pt} %% Al menos 15pt para evitar warning al compilar
% \fancyfoot{}
% %% Configuración para páginas con cabecera en blanco
% \fancypagestyle{plain}{%
% \fancyhf{}% clear all header and footer fields
% \fancyhead[RO,LE]{\myriadprocond{\thepage}}
% \renewcommand{\headrulewidth}{0pt}%
% \renewcommand{\footrulewidth}{0pt}%
% }
%%-- Metadatos del doc
\title{Automatic identification of bot accounts in open-source projects}
\author{Miguel Ángel Fernández Sánchez}
%%-- Hiperenlaces, siempre se carga al final del preámbulo
\usepackage[colorlinks]{hyperref}
\hypersetup{
pdftoolbar=true, % Muestra barra de herramientas en Adobe Acrobat
pdfmenubar=true, % Muestra menú en Adobe Acrobat
pdftitle={MSc Thesis},
pdfauthor={Miguel Ángel Fernández},
pdfcreator={ETSII/ETSIT, URJC},
pdfproducer={XeLaTeX},
pdfsubject={Topic1, Topic2, Topic3},
pdfnewwindow=true, %links open in new window
colorlinks=true, % false: boxed links; true: coloured links
linkcolor=Firebrick4, % enlaces internos
citecolor=Aquamarine4, % enlaces a citas bibliográficas
urlcolor=RoyalBlue3, % hiperenlances ordinarios
linktocpage=true % Enlaces en núm. pág. en ToC
}
%%%---------------------------------------------------------------------------
% Comentarios en línea de revisión
% Este bloque se puede borrar cuando finalizamos el borrador
% \usepackage[colorinlistoftodos]{todonotes}
% \usepackage{verbatim}
%%%---------------------------------------------------------------------------
\begin{document}
%%-- Configuración común para todos los entornos listing
%%-- Descomentar para usar y personalizar valores
%\lstset{%
%breakatwhitespace=true,
% breaklines=true,
% basicstyle=\footnotesize\ttfamily,
% keywordstyle=\color{blue},
% commentstyle=\color{green!40!black},
% language=Python}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% PORTADA
\begin{titlepage}
\begin{center}
\begin{tabular}[c]{c c}
%\includegraphics[bb=0 0 194 352, scale=0.25]{logo} &
\includegraphics[scale=1.5]{img/LogoURJC.png}
%&
%\begin{tabular}[b]{l}
%\Huge
%\textsf{UNIVERSIDAD} \\
%\Huge
%\textsf{REY JUAN CARLOS} \\
%\end{tabular}
\\
\end{tabular}
\vspace{3cm}
\Large
MÁSTER EN DATA SCIENCE
% MASTER'S DEGREE IN DATA SCIENCE
\vspace{0.4cm}
\large
Curso Académico 2022/2023
% Academic Year 2022/2023
\vspace{0.8cm}
Trabajo Fin de Máster
% Master's Thesis
\vspace{2cm}
\LARGE AUTOMATIC IDENTIFICATION OF BOT ACCOUNTS IN OPEN-SOURCE PROJECTS
\vspace{3cm}
\large
Autor : Miguel Ángel Fernández Sánchez \\
Tutor : Dr. José Felipe Ortega Soto
\end{center}
\end{titlepage}
\newpage
\mbox{}
\thispagestyle{empty} % para que no se numere esta pagina
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%% Para firmar
\clearpage
\pagenumbering{gobble}
\chapter*{}
\vspace{-4cm}
\begin{center}
\LARGE
\textbf{Trabajo Fin de Máster}
\vspace{1cm}
\large
Automatic Identification of Bot Accounts in Open-Source Projects.
\vspace{1cm}
\large
\textbf{Autor :} Miguel Ángel Fernández Sánchez \\
\textbf{Tutor :} Dr. José Felipe Ortega Soto
\end{center}
\vspace{1cm}
La defensa del presente Trabajo Fin de Máster se realizó el día 20 de abril
\newline de 2023, siendo calificada por el siguiente tribunal:
\vspace{0.5cm}
\textbf{Presidente:}
\vspace{0.8cm}
\textbf{Secretario:}
\vspace{0.8cm}
\textbf{Vocal:}
\vspace{0.8cm}
y habiendo obtenido la siguiente calificación:
\vspace{0.8cm}
\textbf{Calificación:}
\vspace{0.8cm}
\begin{flushright}
Móstoles, a 20 de abril de 2023
\end{flushright}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%% Dedicatoria
\chapter*{}
%\pagenumbering{Roman} % para comenzar la numeración de paginas en numeros romanos
\begin{flushright}
\textit{A mi familia y amigos, \\
gracias por vuestro apoyo.}
\end{flushright}
\vspace{2cm}
\begin{flushright}
\textit{To my family and friends, \\
thank you for your support.}
\end{flushright}
% \vspace{4cm}
% \begin{flushright}
% \textit{Have you ever been light years away \\
% from what you want to be? \\
% (...) \\
% I was light years away \\
% Now I've got that sunshine in my life.\\
% - "Light Years", by Jamiroquai.}
% \end{flushright}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%% Agradecimientos
\chapter*{Acknowledgements}
%\addcontentsline{toc}{chapter}{Agradecimientos} % si queremos que aparezca en el índice
\markboth{Acknowledgements}{Acknowledgements} % encabezado
% Aquí vienen los agradecimientos\ldots
% Hay más espacio para explayarse y explicar a quién agradeces su apoyo o ayuda para
% haber acabado el proyecto: familia, pareja, amigos, compañeros de clase\ldots
% También hay quien, en algunos casos, hasta agradecer a su tutor o tutores del proyecto
% la ayuda prestada\ldots
I want to thank my classmates from this Master's degree for their help and support during all the courses and through this final phase, especially Edgli, Enrique, and David. The COVID-19 pandemic arrived in our lives in the middle of our degree, so here goes a special recognition for them and the Master's professors for the extra aid and cooperation.
To my family and friends, who stood by my side all this time, bearing with me after endless promises of finishing this thesis once and for all. Your support kept me confident in the most challenging times. Specially, I want to thank my friend Quan, who also helped me solving technical issues and questions during the project.
To my tutor, Dr. Felipe Ortega, for accepting such a challenge and for the great help and guidance he provided me during this process with a lot of patience, great pieces of advice, and wise teachings. This encouraged me to keep improving and growing academically and personally throughout this project and beyond.
Last but not least, to Bitergia for supporting me in the course of this Master's degree; and also to Professor Tom Mens, Professor Alexandre Decan, and PhD student Mr. Mehdi Golzadeh from the University of Mons (Belgium); for guiding me during the early stages of this project with their generous ideas and knowledge.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%% Resumen
%%%% Resumen en inglés
\chapter*{Summary}
%\addcontentsline{toc}{chapter}{Summary} % si queremos que aparezca en el índice
\markboth{SUMMARY}{SUMMARY} % encabezado
People participating in software projects (in particular, in Free, Open-Source projects) rely on many tools and platforms to support their activity on many facets, such as code review or bug management. Within this scenario, automatic accounts (also known as \emph{bot} accounts) are commonly used in software development to automate and ease repetitive or particular tasks.
Identifying these bot accounts and their activity in the projects is crucial for anyone willing to measure many aspects of the software project and the community of contributors behind it. \emph{GrimoireLab} is a tool that provides metrics about the software development process, including a component to manage the contributors' identities, with an option to mark individual profiles as bots. Nonetheless, this labelling process is entirely manual.
In this MSc thesis, a \emph{Python} tool to detect bots automatically based on their profiles' information and their activity in the project is developed. This tool can be integrated as a component inside the \emph{GrimoireLab} toolchain. To this aim, we analysed the code changes from a set of software projects from the Wikimedia Foundation, produced between January 2008 and September 2021 using GrimoireLab, labelling manually the bot accounts generating activity with the purpose of creating an input dataset to train a binary classifier to detect whether a given profile is a bot or not.
After testing different classification models using the \emph{Scikit-learn} module for \emph{Python}, the model that performed best was a ``Random Forest'' classifier, where the most relevant features were a terms score calculated based on domain-related heuristics and statistical values obtained from the individuals' activity, such as number of changes in source code or number of words and files per code change submitted to the projects.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter*{Resumen}
%\addcontentsline{toc}{chapter}{Resumen} % si queremos que aparezca en el índice
\markboth{RESUMEN}{RESUMEN} % encabezado
Las personas que participan en proyectos de \emph{software} (y en paricular en proyectos de \emph{software} libre y código abierto), se apoyan en varias herramientas y plataformas para interactuar y tratar con diferentes aspectos de estos proyectos, tales como la revisión de código o la gestión de errores o \emph{bugs}. En este contexto, las cuentas automáticas (también conocidas como cuentas \emph{bot}) se usan frecuentemente en el desarrollo de software para automatizar y simplificar ciertas tareas repetitivas o específicas.
Para cualquier persona interesada en medir ciertos aspectos de un proyecto de software y de la comunidad de personas que lo sustenta, es crucial identificar estas cuentas \emph{bot} y su actividad. \emph{GrimoireLab} es una herramienta que proporciona métricas sobre el proceso de desarrollo de \emph{software}, que incluye un componente para la gestión de los perfiles de contribuidores. Dicho componente cuenta con una opción para marcar aquellos perfiles que pertenezcan a una cuenta \emph{bot}. Sin embargo, este proceso de etiquetado es enteramente manual.
En este Trabajo de Fin de Máster se propone una herramienta desarrollada en \emph{Python} para detectar automáticamente cuentas bot, integrable como un componente dentro de \emph{GrimoireLab}, utilizando como base la información de los perfiles de los diferentes individuos y de su actividad en el proyecto analizado. Para desarrollar esta herramienta se han analizado con \emph{GrimoireLab} los cambios en el código de un conjunto de proyectos de \emph{software} de la Fundación Wikimedia, producidos entre enero de 2008 y septiembre de 2021, etiquetando manualmente aquellas cuentas \emph{bot} activas en ese periodo; con el propósito de crear un conjunto de datos (\emph{dataset}) de entrada para entrenar un clasificador binario, que detecte si un determinado perfil pertenece a una cuenta \emph{bot} o no.
Tras probar diferentes modelos de clasificación usando el módulo \emph{Scikit-learn} para \emph{Python}, el modelo que mejor resultados obtuvo fue un clasificador de tipo \emph{Random Forest}. Entre sus caraterísticas más relevantes destaca el empleo de una puntuación numérica calculada en base a heurísticos de este dominio de aplicación junto con valores estadísticos obtenidos de la actividad de los individuos, tales como el número de cambios o el numero de palabras y ficheros de cada cambio producido en los proyectos analizados.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%--------------------------------------------------------------------
% Lista de comentarios de revisión
% Se puede borrar este bloque al acabar el borrador
%\listoftodos
%\markboth{TODO LIST}{TODO LIST} % encabezado
%%%%--------------------------------------------------------------------
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% ÍNDICES %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Las buenas noticias es que los índices se generan automáticamente.
% Lo único que tienes que hacer es elegir cuáles quieren que se generen,
% y comentar/descomentar esa instrucción de LaTeX.
%%-- Índice de contenidos
\tableofcontents
\cleardoublepage
%%-- Índice de figuras
\addcontentsline{toc}{chapter}{List of Figures} % para que aparezca en el indice de contenidos
\listoffigures % indice de figuras
\cleardoublepage
%%-- Índice de tablas
\addcontentsline{toc}{chapter}{List of Tables} % para que aparezca en el indice de contenidos
\listoftables % indice de tablas
\cleardoublepage
%%-- Índice de fragmentos de código
\addcontentsline{toc}{chapter}{List of Listings} % para que aparezca en el indice de contenidos
\listoflistings
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% INTRODUCCIÓN %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\cleardoublepage
\chapter{Introduction}
\label{chap:intro}
\pagenumbering{arabic} % para empezar la numeración de página con números
% En este capítulo se introduce el proyecto.
% Debería tener información general sobre el mismo, dando la información sobre el contexto en el que se ha desarrollado.
% No te olvides de echarle un ojo a la página con los cinco errores de escritura más frecuentes\footnote{\url{http://www.tallerdeescritores.com/errores-de-escritura-frecuentes}}.
% Aconsejo a todo el mundo que mire y se inspire en memorias pasadas.
% Las memorias de los proyectos que he llevado yo están (casi) todas almacenadas en mi web del GSyC\footnote{\url{https://gsyc.urjc.es/~grex/pfcs/}}.
% \section{Sección}
% \label{sec:seccion}
% Esto es una sección, que es una estructura menor que un capítulo.
% Por cierto, a veces me comentáis que no os compila por las tildes.
% Eso es un problema de codificación.
% Al guardar el archivo, guardad la codificación de ``ISO-Latin-1'' a ``UTF-8'' (o viceversa) y funcionará.
% \subsection{Estilo}
% \label{subsec:estilo}
% Recomiendo leer los consejos prácticos sobre escribir documentos científicos en \LaTeX \ de Diomidis Spinellis\footnote{\url{https://github.com/dspinellis/latex-advice}}.
% Lee sobre el uso de las comas\footnote{\url{http://narrativabreve.com/2015/02/opiniones-de-un-corrector-de-estilo-11-recetas-para-escribir-correctamente-la-coma.html}}.
% Las comas en español no se ponen al tuntún.
% Y nunca, nunca entre el sujeto y el predicado (p.ej. en ``Yo, hago el TFG'' sobre la coma).
% La coma no debe separar el sujeto del predicado en una oración, pues se cortaría la secuencia natural del discurso.
% No se considera apropiado el uso de la llamada coma respiratoria o \emph{coma criminal}.
% Solamente se suele escribir una coma para marcar el lugar que queda cuando omitimos el verbo de una oración, pero es un caso que se da de manera muy infrecuente al escribir un texto científico (p.ej. ``El Real Madrid, campeón de Europa'').
% A continuación, viene una figura, la Figura~\ref{figura:foro_hilos}.
% Observarás que el texto dentro de la referencia es el identificador de la figura (que se corresponden con el ``label'' dentro de la misma).
% También habrás tomado nota de cómo se ponen las ``comillas dobles'' para que se muestren correctamente.
% Nota que hay unas comillas de inicio (``) y otras de cierre (''), y que son diferentes.
% Volviendo a las referencias, nota que al compilar, la primera vez se crea un diccionario con las referencias, y en la segunda compilación se ``rellenan'' estas referencias.
% Por eso hay que compilar dos veces tu memoria.
% Si no, no se crearán las referencias.
People contributing to software projects (in particular, FLOSS projects) rely on several tools to support their activity on many aspects of the project, such as source code changes, project management and coordination, software bugs or issues~\cite{dabbish-et-al-socialcoding-github12}. Data generated by such interactions can be used to extract valuable information that project managers and leaders can use to make the right decisions for the future of the project (known as data-driven decisions). Some of the most common questions while analysing an open-source project are:
\begin{itemize}
\item How many contributors are participating?
\item How many companies contribute to the project?
\item How good are these participants at handling issues?
\end{itemize}
These data are also interesting for academic purposes, as researchers and practitioners may be interested in answering a set of questions about a given project~\cite{hemmati-et-al-msr-cookbook13}.
\section{Identity problems}
\label{sec:intro-identities}
From a project management perspective, a person (generally with a manager role) needs to know their community or project. In order to get valuable insights, that person may ask two main questions:
\begin{itemize}
\item How many unique contributors do the project have?
\item How many different organisations are contributing to the project?
\end{itemize}
To answer these questions, we must manage contributor identities within the project.
After spotting the usage of a plethora of different tools within FLOSS projects, it is important to explain that, for interacting with each of these tools, each project contributor must be identified in some way. This could be done by creating an account or setting up a set of credentials, usually a combination of name and email. This means each contributor will end up with one or more different ``accounts'' or ``identities'' for the services the project is using.
In such a scenario, it could happen that some contributors use multiple accounts or credentials sets (from now on, we will refer to these as \textit{identities}) for the same tool or service, for instance, to differentiate the contributions made through an organisational account from those made from a personal or academic account. We name an individual as the entity representing the many identities of a contributor, its profile, and enrolment information. This problem alone entails one of the hardest challenges: how to merge identities owned by the same individual.
This is where SortingHat, a component that is part of the GrimoireLab toolset (see~\ref{ssec:GrimoireLab}), comes into play. This tool aims to ease the task of managing contributors' identities within a project or set of projects~\cite{moreno_et_al-sortinghat}. It will be described in detail in Section~\ref{sec:SortingHat}.
\begin{figure}
\centering
\includegraphics[width=16cm, keepaspectratio]{img/example-identity}
\caption{A project contributor can use many accounts across different tools and platforms, besides having a number of organisational affiliations.}
\label{fig:example-identity}
\end{figure}
\section{Automatic accounts: bots}
\label{sec:intro-bots}
It is essential to state that some interactions which occur within software development tools are not directly created by humans. Instead, they stem from an automated process set with a specific purpose and permission level to produce a specific output affecting the state of the project and its members.
This type of interaction is very common in open-source projects~\cite{erlenhov-et-al-emprirical-study-bots-oss20}, including top-level projects and communities such as GitLab\footnote{\url{https://gitlab.com/gitlab-org/gitlab}}, Wikimedia Foundation\footnote{\url{https://wikimediafoundation.org/our-work/wikimedia-projects/}} and OpenStack\footnote{\url{https://www.openstack.org/software/}}.
Some bots scan and re-post information, whereas others can also have a formal authority role associated with task evaluation. They can even play a management role combining evaluation and formal authority with interactive coordination, among other examples~\cite{hukal-et-al-bots-coordinating-oss}.
But, why is it important to identify bot accounts in open-source projects? A substantial reason is that their presence challenges any researcher or stakeholder interested in analysing the activity within a software project. Although these accounts are usually ignored in different studies, they may play an important role, as there are cases where they undertake a significant percentage of the total activity (e.g., projects where bots are responsible for accepting or rejecting 25\% of all pull requests\footnote{A request for integrating changes into a repository.})~\cite{golzadeh-mens-ground-truth-github2021}.
The number of bot accounts and their interactions depends on many factors, like:
\begin{itemize}
\item Type and purpose of the tool or service (issue management, messaging, bug tracker, etc.).
\item Whether this is an option provided by default by the tool or it is an \textit{ad-hoc} feature.
\item The way these automated accounts (bots) are configured: triggered by events, periodic execution, etc.
\item The amount of activity generated by humans or by other automatic accounts within the project.
\end{itemize}
SortingHat provides a way to mark unique identities as ``bot'' accounts by editing the identity’s profile (configuring a Boolean field named \texttt{is\_bot}). Currently, there is no automated way to identify which individuals from the whole data set could be marked as ``bots'', yet.
Up to now, this has been an entirely manual process that consumes substantial time from a person, who actively searches for suspicious identities of being bot accounts, looking at some key values such as username, email, or contribution type. This person must also double-check the original source of the data, looking for helpful extra information to verify the operator’s guesses.
\section{How this project was born}
\label{sec:intro-project-origin}
In this thesis, an approach to identify individuals from Automatic accounts (bots) is proposed, using machine learning techniques to build a classifier based on contributions produced by all identities from a given set of projects. As an additional goal, this classifier will be incorporated as a new feature in SortingHat, integrated with the original recommendation engine already implemented in it.
It is worth mentioning that this project was born within a strong research context. I had already started with this project when Prof. Tom Mens, Head of the Software Engineering Lab from the Faculty of Sciences at University of Mons (Belgium), contacted our Bitergia team to let us know about a research article regarding bot classification that they were developing at that time (September 2020). As soon I became aware of this project, I reached Prof. Tom Mens and his team to have a meeting to discuss the scope of their research and the possibility of starting a collaboration between Bitergia (the company I work for, at the time of writing this thesis) and the Software Engineering Lab from University of Mons.
From Bitergia's point of view, this was a long-desired topic to explore, as identifying bots is a crucial part of the identity management process that our company offers to customers. For the Software Engineering Lab researchers, it was helpful to promote their new tool for bot classification BoDeGHa\footnote{\url{https://github.com/mehdigolzadeh/BoDeGHa}} (previously BoDeGa) and their goal of having better ground-truth datasets for research purposes.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% %%-- Objetivos del proyecto
% %%-- Si la sección anterior ha quedado muy extensa, se puede considerar convertir
% %%-- Las siguientes tres secciones en un capítulo independiente de la memoria
\section{Project objectives}
\label{sec:objectives}
\subsection{General objective} % título de subsección (se muestra)
\label{ssec:general-objective} % identificador de subsección (no se muestra, es para poder referenciarla)
The main objective of this project is to assess if it is possible to develop an automated or semi-automated way to classify individuals from GrimoireLab’s SortingHat into human users and bot accounts. This goal should be achieved through data obtained from each individual, using specific channels (data sources) to identify variables relevant to effectively undertaking this classification.
% Recuerda que los objetivos siempre vienen en infinitivo.
\subsection{Specific objectives: Goals and Questions}
\label{ssec:specific-objectives}
The specific goals for this project have been defined following the ``Goal, Question, Metric'' approach. See Section~\ref{ssec:gqm-approach} for more information.
% Los objetivos específicos se pueden entender como las tareas en las que se ha desglosado el objetivo general. Y, sí, también vienen en infinitivo.
% Lo mejor suele ser utilizar una lista no numerada, como sigue:
% \begin{itemize}
% \item Un objetivo específico.
% \item Otro objetivo específico.
% \item Tercer objetivo específico.
% \item \ldots
% \end{itemize}
\textbf{Goal 1}: Creating an automated process to discriminate between human users and bot accounts, integrated with the GrimoireLab toolset.
\begin{itemize}
\item \textbf{Q1.1.} How can bot accounts be separated from human users?
\item \textbf{Q1.2.} Is the profile information from a given individual enough to classify it as human or bot?
\item \textbf{Q1.3.} Are there differences between activity generated by humans and bots?
\item \textbf{Q1.4.} How can this classifier be integrated into GrimoireLab’s toolchain?
\end{itemize}
\textbf{Goal 2}: Finding which channels and footprints can be used to classify a user as human or bot.
\begin{itemize}
\item \textbf{Q2.1.} Are there any particular channels and footprints, as a combination of interactions, which can be used to classify a user as a human or bot?
\item \textbf{Q2.2.} Can the message content (commit messages, issue texts, etc.) be used to validate this classification?
\begin{itemize}
\item \textbf{Q2.2.1.} Does a richer syntax give any hint about the nature of a user?
\item \textbf{Q2.2.2.} Can the entropy of a comment give a hint about the nature of a user?
\end{itemize}
\item \textbf{Q2.3.} Do activity details (such as working hours or frequency of contributions) help with this classification?
\end{itemize}
\textbf{Goal 3}: Obtaining a curated dataset from real open-source communities with real examples of bot accounts.
\begin{itemize}
\item \textbf{Q3.1.} Which open-source communities should be analysed?
\item \textbf{Q3.2.} Which data sources are we taking into account?
\begin{itemize}
\item \textbf{Q3.2.1.} Which data should we consider form these sources?
\end{itemize}
\end{itemize}
\section{Time planning}
\label{sec:time-planning}
% Es conveniente que incluyas una descripción de lo que te ha llevado realizar el trabajo.
% Hay gente que añade un diagrama de GANTT.
% Lo importante es que quede claro cuánto tiempo has consumido en realizar el TFG/TFM
% (tiempo natural, p.ej., 6 meses) y a qué nivel de esfuerzo (p.ej., principalmente los
% fines de semana).
Considering natural time, I spent, roughly, $1$ year and $7$ months working mostly
during weekends, as I conciliated it with my full-time job. Whilst the main conversations for starting this project began in September 2020, the first stage started on March 2021. The time I spent during the first stages of the project were quite uneven, but from April 2021 I was able to keep a more regular pace until its completion on January 2023. This is the estimation of when each task was carried out and how much time was spent on it:
\begin{itemize}
\item First design of the tool, data retrieval and curation: March 2021-September 2021.
\item Designing the tool, additional work on data curation: September 2021-October 2021.
\item Building the input dataset and first experiments: October 2021-January 2022.
\item Second round of experiments: April 2022-July 2022.
\item Third round of experiments and writing the thesis: July 2022-January 2023.
\end{itemize}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Structure of the thesis}
\label{sec:structure}
This thesis is outlined as follows:
\begin{itemize}
\item In this Chapter~\ref{chap:intro}, ``Introduction'', the general context and motivation is described for the problem we aim to solve. Also, the objectives for the project were already detailed in subsection~\ref{sec:objectives}, ``Project objectives''.
\item Next, Chapter~\ref{chap:state-art}, ``State of the Art'', provides information about previous research work on this field and also a brief explanation of the technologies that were used during the process.
\item The design process and architecture of the tool are detailed in Chapter~\ref{chap:design-implementation}, ``Design and Implementation'', including a breakdown of its components and a detailed analysis of the dataset obtained for the purpose of this project.
\item In Chapter~\ref{chap:experiments}, ``Experiments and validation'', the classifiers performance and results are examined through different experiments, including the description of technical challenges encountered and how they were addressed.
\item Wrapping up, Chapter~\ref{chap:conclusions}, ``Conclusions'' evaluates whether the set objectives were met, and includes a discussion on the limitations of the tool, lessons learned and future work.
\item Furthermore, the Appendix~\ref{app:app-definitions}, ``Definitions'', provides additional explanations for several key terms.
\end{itemize}
There is a website dedicated to this final project\footnote{\url{https://mafesan.github.io/Memoria-TFM}}. It includes this thesis and complementary content such as the Notebooks for the exploratory data analysis and the classification experiments. In addition, the source code for this tool is available in another dedicated GitHub repository\footnote{\url{https://github.com/mafesan/2021-tfm-code}}.
% Por último, en esta sección se introduce a alto nivel la organización del resto del documento
% y qué contenidos se van a encontrar en cada capítulo.
% \begin{itemize}
% \item En el primer capítulo se hace una breve introducción al proyecto, se describen los objetivos del mismo y se refleja la planificación temporal.
% \item En el siguiente capítulo se describen las tecnologías utilizadas en el desarrollo de este TFM/TFG (Capítulo~\ref{chap:tecnologias}).
% \item En el capítulo~\ref{chap:diseño} Se describe el proceso de desarrollo
% de la herramienta \ldots
% \item En el capítulo~\ref{chap:experimentos} Se presentan las principales pruebas realizadas
% para validación de la plataforma/herramienta\ldots (o resultados de los experimentos
% efectuados).
% \item Por último, se presentan las conclusiones del proyecto así como los trabajos futuros que podrían derivarse de éste (Capítulo~\ref{chap:conclusiones}).
% \end{itemize}
\cleardoublepage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% ESTADO DEL ARTE %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{State of the Art} %% a.k.a "Tecnologías utilizadas"
\label{chap:state-art}
As mentioned in the Introduction chapter, this project was born within a strong research context. When I contacted Prof. Tom Mens and his team about this Master's thesis, we first exchanged ideas regarding the scope of the project and how both initiatives could complement one another. Then, they shared with me the scientific paper they were developing, aimed at detecting bots in issues and PR comments from GitHub.
A discussion followed about which research lines could be addressed for this Master's thesis. The Software Engineering Lab from University of Mons did not implement the tool, nor the underlying classifier, to detect bots based on Git commit comments or any other Git-related information. As it would be relatively easy to extend their tool to also consider Git commit comments, it was likely the classifier features that they used to distinguish bots from humans in GitHub issue and pull request comments do not work that well on Git comments. Studying, testing, and extending this behaviour was one of the main ideas they proposed to me for this Master's thesis. Likewise, this study could be extended to look at other systems and data sources. Last, but not least, it was interesting for them to learn how this classification was going to be integrated with identity merging (mainly talking about GrimoireLab's SortingHat component).
After reviewing the paper from the Software Engineering Lab at University of Mons, by Mehdi Golzadeh et al., I discovered that this text pointed to other interesting articles about the same topic, which are relevant to this Master's thesis.
\section{Research}
\label{sec:research}
In the following subsections, I summarise the two most relevant research articles on which this project is supported, and the technologies used for this project.
\begin{itemize}
\item The first one is the article by Researcher Mehdi Golzadeh, Prof. Tom Mens et al.: \textbf{``A ground-truth dataset and classification model for detecting bots in GitHub issue and PR comments''}~\cite{golzadeh-mens-ground-truth-github2021}, on detecting bots in issue and pull request comments from OSS projects.
\item The second one by Dey, B. Vacilescu et al.: \textbf{``Detecting and characterising bots that commit code''}~\cite{dey-et-al-detecting-bots}, on detecting bots contributing code in OSS\footnote{Open-Source Software.} projects.
\end{itemize}
\subsection{A ground-truth dataset and classification model for detecting bots in GitHub issue and PR comments}
\label{ssec:golzadeh}
The main goal of this paper is to propose an automated classification model to detect bots through comments submitted in GitHub issues and pull requests.
This article is divided into three large sections: First, they elaborate a ground-truth dataset of pull request and issue comments from $5K$ GitHub accounts, from which $527$ were identified as bots. Then, they propose a classification model that relies on comment-related features to classify accounts as either bot or human. Eventually, they propose an open-source tool based on the classification model to allow GitHub contributors to detect which accounts in their repositories correspond to bots.
\subsubsection{Creating the ground-truth dataset}
\label{sssec:golzadeh-dataset}
As the objective of this study is to focus on software development repositories, they need a way to identify which repositories from GitHub were created for such purpose. Hence, they rely on \textit{libraries.io}, a monitoring service indexing information for several million packages distributed through several package registries, such as PyPI, npm, etc.
Their initial dump contains more than $3.3$ million GitHub repositories, from which they randomly select around $136K$ of them. From each of these repositories, they extract the last $100$ comments of the last $100$ issues and pull requests during $4$ days, in February 2020, using GitHub's API. They obtain over $10M$ comments from more than $837K$ contributors, and from more than $3.5M$ issues and pull requests.
After considering the size of the initial dataset, they apply some constraints to reduce it. First, they exclude users who made less than $10$ comments. This threshold comes from a previous study. Then, they extract a subset of $5K$ commenters, selected both randomly and manually, adding $438$ commenters who had been identified as bots in previous studies or contained a specific substring in their GitHub account name, such as ``bot'', ``ci'', ``cla'', ``auto'', ``logic'', ``code'', ``io'' and ``assist''.
Then, for the labelling process, they develop a web application where each commenter is presented to at least two of the four authors of the paper. Comments belonging to a certain user are displayed in batches of $20$ comments (with the option of showing more, if needed). Then, the rater can select whether the commenter is a bot or a human being. All cases that were agreed upon are included in the ground-truth dataset.
\subsubsection{Creating the classification model}
\label{sssec:golzadeh-classification}
These are the selected features to create the classification model:
\begin{enumerate}
\item Text distance between comments.
\begin{itemize}
\item The main hypothesis is that bot commenters post more repetitive comments than humans do. This is why the metrics considered are text distance metrics, that are commonly used in natural language processing (NLP): the Jaccard and Levenshtein distances. The Jaccard distance~\ref{sec:jaccard-definition} aims to quantify the similarity of two texts based on their content, while the Levenshtein distance~\ref{sec:levenshtein-definition} intends to capture the structural difference by counting single-character edits.
\item After a tokenization process, for each commenter, they compute the mean of the Jaccard and Levenshtein distances between all pairs of comments. Results show that humans get higher median values for both distances than bots. Nonetheless, there is overlapping between the values from both classes, indicating that these mean distances are not enough to properly distinguish between the two classes.
\item Finally, a combination of both Jaccard and Levenshtein distances is used~\ref{sec:jacc-lev-comb-definition}.
\end{itemize}
\item Repetitive comment patterns.
\begin{itemize}
\item Observations suggest that bots tend to have sets of similar comments, while most comments from humans are unique, except some of them that seem to follow a pattern (mostly, short answers such as ``Thank you!'', ``\texttt{+1}'' or ``\textit{LGTM}''\footnote{ LGTM: Shorthand for ``Looks good to me''. ``\texttt{+1}'', as a common way to express agreement with something proposed in a previous comment or description.}).
\item To capture the comment patterns, they select \textit{DBSCAN}, a density-based clustering algorithm. To capture both structural and content distance between comments, a combination of both Levenshtein and Jaccard distances is computed. For each commenter, \textit{DBSCAN} is applied to its set of comments.
\item When the number of comment patterns (clusters) and the number of comments considered per commenter is represented, there is a clearer separation between humans and bots. The number of comment patterns for bots remained stable and low, regardless of the number of comments.
\end{itemize}
\item Inequality between comments in patterns
\begin{itemize}
\item The inequality in the number of comments in each pattern is used as an additional feature to distinguish between bots and humans by using the \textit{Gini} coefficient (a value of $0$ expresses perfect equality, while a value of $1$ expresses maximum inequality among values).
\item Humans show a lower inequality than bots with respect to the spread of comments within patterns, confirming that humans tend to have a lower inequality than bots, a consequence of many of their patterns containing a single comment.
\end{itemize}
\item Number of comments and empty comments.
\begin{itemize}
\item This feature makes it easier to distinguish between commenters having a similar number of patterns (the ones having more comments per pattern, will more likely be a bot).
\item Regarding the number of empty comments, although the GitHub interface does not allow empty comments in a discussion, it does not prevent comments composed of whitespace characters. Data shows that these empty comments are mostly created by human commenters.
\end{itemize}
\end{enumerate}
For selecting the classifier, they rely on a standard grid-search \textit{10-fold} cross-validation process to compare five families of classifiers (random forest, k-nearest neighbours, decision trees, logistic regression, and support vector machines) over the training set ($60\%$ of the ground-truth dataset) using \textit{Scikit-learn}~\ref{sssec:scikit-learn}. In addition, the class ``weight'' parameter is set to address the class imbalance problem for each supported classifier.
The $10$ subsets are created using a stratified shuffle split, to preserve the same proportion of bots and humans as in the complete training set.
The selected classifier is the ``Random forest'': using the \textit{Gini} split criterion, they get $10$ estimators (trees) and a maximum depth of $10$ for these trees. Results are available in Table~\ref{table:golzadeh-table-results}.
\smallskip
% Hemos hablado de cómo incluir figuras, pero no se ha descrito cómo incluir tablas.
% A continuación se presenta un ejemplo de tabla, la Tabla \ref{tabla:ejemplo} (fíjate
% en cómo se introduce una referencia a la tabla).
\begin{table}[htb] % [htb] suggests table position in page
% h: here; t: top; b: bottom
% remove any id to restrict options
\renewcommand{\arraystretch}{1.2} % Increase space between rows
\begin{center}
\begin{tabular}{ l c c r r r }
% \begin{tabular} { | l | c | r |} % tenemos tres colummnas, la primera alineada a la izquierda (l), la segunda al centro (c) y la tercera a la derecha (r). Se elimina el símbolo |, que indica que entre las columnas habría una línea separadora.
\toprule % Top table rule (horizontal) line
& \textbf{Classified as bot} & \textbf{Classified as human} & \textbf{P} & \textbf{R} & \textbf{$F_{1}$} \\
\midrule % Replace \hline for horizontal rule after table header
\textbf{Bot} & TP: $192$ & FN: $19$ & $0.94$ & $0.91$ & $0.92$ \\ %\hline
\textbf{Human} & FP: $13$ & TN: $1,776$ & $0.99$ & $0.99$ & $0.99$ \\ %\hline
\textbf{Weighted avg} & & & $0.98$ & $0.98$ & $0.98$ \\
\bottomrule % Bottom table rule (horizontal) line
\end{tabular}
\caption{Evaluation of the classification model using the test set.}
\label{table:golzadeh-table-results}
\end{center}
\end{table}
\subsubsection{BoDeGHa: an open-source tool to detect bots in GitHub repositories}
\label{sssec:golzadeh-tool}
The tool accepts as inputs the name of a GitHub repository and a GitHub API key. The output is computed in three steps:
\begin{enumerate}
\item Download all comments from that repository through GitHub's GraphQL API, which is transformed into a list of commenters and their corresponding comments.
\item Compute the features for the classification model: number of comments, empty comments, comment pattern, and inequality between the number of comments within patterns.
\item Apply the pre-trained model and outputs the prediction made by the model.
\end{enumerate}
\subsubsection{Conclusions}
\label{sssec:golzadeh-conclusions}
From the $15$ bots classified as humans, most cases correspond to bots that use, convert, copy or translate text that humans initially produced. When looking at the $51$ humans classified as bots, most have unfilled issue templates, use repetitive comments such as ``Thank you'' or ``LGTM'', or post empty comments. About $85\%$ of the misclassified humans and about $75\%$ of misclassified bots are initially difficult to classify by at least one of the raters, as ``I don't know'', ``difficult'', or ``very difficult''.
They also find several examples of commenters whose behaviour and comments correspond to those of both humans and bots, that is, mixed commenters using their GitHub accounts belonging to humans allowing automatic tools to make use of the account for specific tasks. These cases represent the $1.5\%$ ($78$ commenters out of $5,082$), and they exclude them from the ground-truth dataset, as they can not decide whether these commenters should be classified as bots or as humans. The mixed commenters are exposed to test how the model behaved with these cases, resulting in $29$ being classified as bots ($37.2\%$) and $49$ as humans ($62.8\%$).
Although other articles, such as the one from Dey et al. (explained in the following subsection), proposed approaches for identifying bot accounts based on their commit messages or their author information, such as checking the presence of the string ``bot'' in the account name or the committer name, it lead to numerous both false positives and false negatives.
\subsection{Detecting and characterising bots that commit code}
\label{ssec:dey}
The main goal of this article is to find an automated way of identifying bots (and their contributions) that commit code in open-source projects and characterise them according to their activity.
To do so, they propose a systematic approach named \textbf{BIMAN} (\emph{Bot Identification by commit Message, commit Association and author Name}) to detect bots considering different aspects of the commits made by an author:
\begin{enumerate}
\item Commit Message: Identify if commit messages are being generated from templates.
\item Commit Association: Predict if an author is a bot using a random forest model, using features related to the information from the commits as predictors.
\item Author Name: Match the author's name and email to common bot patterns.
\end{enumerate}
This method is applied to the \emph{World of Code} dataset~\cite{mockus-woc}, obtaining a subset of the data, gathering information about $461$ bots detected by this approach and manually verifying as bots, each with more than $1,000$ commits.
Their method to extract information about the authors consists of these steps: First, obtaining a list of all authors from the \emph{World of Code} dataset\footnote{The author's IDs are represented by a combination of name and email address.}.
Second, identifying all commits from the authors. And third, extracting the list of files modified by a commit, the list of projects the commit is associated with, and the commit content for each commit for every author.
That being said, \textbf{BIMAN} (the proposed technique for detecting bots) comprises three methods, which are detailed in the following subsections. This dataset is also used to characterise the bots based on their patterns, such as the type of files modified and time distribution, to analyse their work, and the programming languages they use.
\subsubsection{Identifying bots by name (BIN)}
\label{sssec:dey-bin}
After inspecting the dataset, regular expressions are used to identify if an author is a bot by checking if the author's name or the email contains the substring \texttt{bot}. These expressions have restrictions, like searching for the string preceded and followed by non-alpha characters (to avoid false positives, such as names like ``Abbot'') and excluding from this search the email domain.
The initial assumption is that bots are very active and produce a significantly greater number of commits than humans. However, the observations show that the number of commits between humans and bots is not quite different. Among the reasons behind this statement, one is that given an author ID consisting of a name-email combination, slight variations in this combination lead to consider some cases as different authors when they are not. Besides, bots might have been implemented as an experiment or as part of a course and never used afterwards. Another reason can be that some bots were designed for a project, but in the end, they were never fully adopted.
\subsubsection{Detecting bots by commit messages (BIM)}
\label{sssec:dey-bim}
The primary assumption is considering that bots use template messages as the starting point for the commit message. Thus, the goal is detecting if the commit message came from a template. Although humans can also generate commit messages with similar patterns, the hypothesis is that the variability of content within messages generated by bots is lower than the messages coming from humans.
The \textbf{BIM} approach uses a document template score algorithm, which compares document pairs and uses a similarity measure to group documents. A group represents documents suspicious of conforming to a similar base document. Each group has a single template document assigned to it, and this document is used for comparisons. A new group is created when the similarity of a document does not reach the threshold with any other template document for that group. After this, documents are compared, and a score is calculated based on the ratio of the number of template documents and the number of documents.
\subsubsection{Detecting bots by files changed and projects associated with commits (BICA)}
\label{sssec:dey-bica}
Twenty metrics are used as a starting point, using the files changed by each commit, the projects that the commit is associated with, and the timestamp and timezone of the commits.
The random forest model performs better than other approaches for predicting whether an author is a bot using the numerical features. Out of the $20$ variables, only six features are retained as predictors (see Table~\ref{table:dey-table-predictors}).
\begin{table}[htb]
\renewcommand{\arraystretch}{1.2}
\begin{center}
\begin{tabular}{ p{5cm} % p{Xcm} paragraph mode, row length
>{\raggedright\arraybackslash} p{10cm} }
% >{\raggedright\arraybackslash} avoids word splitting between row lines
\toprule
%\rowcolor[HTML]{C0C0C0} % Do not mix colors w/ horizontal rules
\textbf{Variable name} & \textbf{Variable description}\\
\midrule
Tot.FilesChanged & Number of files changed by author across commits \\ %\hline
Uniq.File.Exten & Num. of unique file extensions in all the author's commits \\ %\hline
Std.File.pCommit & Std. dev. of number of files per commit \\ %\hline
Avg.File.pCommit & Mean number of files per commit \\ %\hline
Tot.uniq.Projects & Num. of unique projects associated with commits\\ %\hline
Median.Project.pCommit & Median num. of projects associated with commits\\
\bottomrule
\end{tabular}
\caption{Predictors used in the random forest model for BICA.}
\label{table:dey-table-predictors}
\end{center}
\end{table}
\subsubsection{Ensemble model}
\label{sssec:dey-ensemble}
The ensemble model is implemented as another random forest model and combine the outputs of the three methods explained so far (\textbf{BIN}, \textbf{BIM} and \textbf{BICA}) as predictors to make a final decision on whether an author is a bot or not.
Since the golden dataset is generated using the BIN method, the authors do not use it for training the ensemble model. Instead, they create a new training dataset partly consisting of $67$ bots from which $57$ author IDs are associated with eight bots and ten author IDs are linked to three other known bots that are not in the golden dataset. Furthermore, $67$ human authors are included via random selection and manual validation.
The output from \textbf{BIN} is a binary value stating if the author ID matches the regular expressions detailed before; the output from \textbf{BIM} is a score, with higher values corresponding to a higher probability of the author being a bot; and the output from \textbf{BICA} is the probability for an author of being a bot.
\begin{figure}
\centering
\includegraphics[width=15cm, keepaspectratio]{img/BIMAN-workflow.png}
\caption{BIMAN workflow: Scores from each method are used by an ensemble model that classifies the given author as a bot or not a bot (taken from the original paper).}
\label{fig:biman-workflow}
\end{figure}
\subsubsection{BIMAN results}
\label{sssec:dey-results-biman}
\textbf{BIMAN} identifies $58$ ($87\%$) out of $67$ author IDs as bots, and $6$ out of $9$ other IDs could be identified as not actually being a bot via manual investigation, they are either
spoofing the name or simply using the same name.
\textbf{BIN performance}: during creation of the golden dataset, BIN obtains a precision close to $99\%$, which indicates that any author considered to be a bot using this method has a very high probability of being a bot. In general, humans do not try to disguise themselves as bots. The recall measure is not high, because BIN misses many cases where the bots do not explicitly have the substring ``\texttt{bot}'' in their name.
\textbf{BIM performance}: the document template score algorithm solely relies on the commit messages. The AUC-ROC value using the ratio values as predicted probabilities is $0.7$. Some details about the classification output:
\begin{itemize}
\item \textbf{True Positive}: The cases where this model can correctly identify
bots are cases where the bots actually use templates or repeat the same commit message.
\item \textbf{False Negative}: The cases where this model cannot correctly identify bots are mostly cases where the bots review code added by humans and create a commit message that adds a few words with the commit message written by a human.
\item \textbf{True Negative}: The human authors correctly identified have some variation in the text, with the usual descriptions of change.
\item \textbf{False Positive}: Humans who are misclassified as bots usually have short commit messages that are not descriptive, and they reuse the same commit message multiple times.
\end{itemize}
\textbf{BICA performance}: The golden dataset generated using the \textbf{BIN} method is used for training the model and testing its performance. $70\%$ of the data, randomly selected, is used for training the model and the rest $30\%$ is used for testing. This procedure is repeated $100$ times with different random seeds. The model shows good performance, with an AUC-ROC value of $0.89$.
\textbf{Ensemble model performance}: The dataset used for training and testing the performance of this model has only $134$ observations, because of reasons described in Section~\ref{sssec:dey-ensemble}. $80\%$ of the data are used for training, and $20\%$ for testing. The process is repeated $100$ times with different random seeds. The value of the AUC-ROC measure varies between $0.89$ and $0.95$, with a median of $0.90$.
\subsubsection{Conclusions}
\label{sssec:dey-conclusions}
After studying the results, the authors conclude that a significant portion of authors can be identified as bots using the proposed method.
Among the limitations for this approach, they mention the lack of a golden dataset and the lack of a ground truth to validate this dataset against. Like in the previous article, another threat is that a number of developers use automated scripts to handle some of their works, which uses their Git credentials while making commits.
Moreover, they mention that \textbf{BIM}’s performance varies according to the language of the commit messages (e.g., Spanish and Chinese), and it does not support multilingual sets of commit messages.
They do not address the problem of multiple IDs belonging to the same author, so this is planned as future work to extend the \textbf{BIMAN} method.
\section{Technologies}
\label{sec:Technologies}
\subsection{GQM approach}
\label{ssec:gqm-approach}
The ``Goal Question Metric'' (GQM) approach~\cite{Basili94-gqm} is based upon the assumption that for measuring purposefully, first the goals must be specified for the project, then those goals must be traced to the data that are intended to define these goals operationally, and finally to provide a framework for interpreting the data with respect to the stated goals.
This approach helps to define the metrics that matter in each case, avoiding frequent bad practices. For example, people tend to use or define a set of metrics without having a clear idea about the specific goals they pursue. This usually leads to a ``bottom-up'' approach: besides having metrics misaligned with the project or business goals, the set of metrics may also be biased by the current technology applied to obtain these metrics.
The lack of a well-defined strategy also hinders practitioners from understanding which metrics are important and why. By using a ``top-down'' approach (first goals, then metrics), it becomes easier to materialise a targeted set of questions for the current situation and then check which metrics could be useful in the future or not, or how these metrics can help to reach the different goals by answering the questions that were raised.
\begin{figure}
\centering
\includegraphics[width=12cm, keepaspectratio]{img/example-gqm-schema}
\caption{Example: Goal, Question, Metric approach hierarchy.}
\label{fig:example-gqm-schema}
\end{figure}
\subsection{GrimoireLab}
\label{ssec:GrimoireLab}
GrimoireLab~\footnote{\url{https://chaoss.github.io/grimoirelab/}} is a free, open-source toolset for producing software development analytics.
This toolset provides a whole platform that supports automatic and incremental data gathering from many tools (data sources or \emph{backends}) related to open-source development (source code management, issue tracking systems, messaging tools, mailing lists, etc.).
Data obtained from these tools is stored in JSON documents following a uniform format, no matter the source. These JSON documents are stored in ElasticSearch and, then, undergo a data enrichment process which adds additional information such time calculations (delays, duration), contributors' affiliation, and more.
Once the data have been augmented, they can be consumed by visualisation tools and also directly using the ElasticSearch API. GrimoireLab toolset comes with a tool named ``Kibiter'' which is a \emph{fork} of Elastic's Kibana. A set of predefined dashboards and visualisations are included for each data source.
GrimoireLab is part of CHAOSS~\footnote{\url{https://chaoss.community/about-chaoss/}}, a project sponsored by The Linux Foundation. It is mainly developed by the Spanish company Bitergia, and represents an evolution of the work done over more than $10$ years in Bitergia and the LibreSoft research group at Rey Juan Carlos University.
\subsection{SortingHat}
\label{sec:SortingHat}
SortingHat is the GrimoireLab component for identity management. It provides more than 20 commands to manipulate identities, including support for:
\begin{enumerate}[label=\roman*)]
\item identity merging based on email addresses, usernames, and full names found on many tools used in software development;
\item enrolling members to organisations for a given time span, marking identities as automatic accounts (bots);
\item gender assessment, among other features~\cite{moreno_et_al-sortinghat};
\end{enumerate}
This tool maintains a relational database with identities and related information extracted from different tools used in software development. An identity is a tuple composed of a name, email, username, and the source's name from where it was extracted. Tuples are converted to unique identifiers (i.e., uuid), which provide a quick mean to compare identities among each other. By default, SortingHat considers all identities as unique ones. Heuristics take care to automatically merge identities based on perfect matches on \((i)\) uuids, \((ii)\) name, \((iii)\) email, or \((iv)\) username.
In case of a positive match, an identity is randomly selected as the unique one, and the other identities are linked to it.
Currently, SortingHat is evolving into a service-based application implementing a \emph{GraphQL API} in \emph{Python}.
\subsection{Python}
\label{ssec:python}
\textbf{Python}~\footnote{\url{https://www.python.org/}} is an interpreted, object-oriented, high-level, open-source
programming language for general-purpose programming created by Guido van Rossum in 1991~\cite{van2007python}. Nowadays, the most
recent version is $3.11.2$, from February 2023. Its design is focused on code readability and clear syntax, making
it possible to program using fewer lines of code compared to other programming languages such as \emph{C++} or \emph{Ada}.