-
Notifications
You must be signed in to change notification settings - Fork 0
/
feed.xml
1263 lines (995 loc) · 110 KB
/
feed.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.9.2">Jekyll</generator><link href="http://dyanarose.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="http://dyanarose.github.io/" rel="alternate" type="text/html" /><updated>2022-08-17T21:23:52+01:00</updated><id>http://dyanarose.github.io/feed.xml</id><title type="html">These Things Happen</title><author><name>Dyana Rose</name></author><entry><title type="html">Reducing Costs by Switching to Cheaper AWS Services</title><link href="http://dyanarose.github.io/blog/2022/06/08/reducing-costs-by-switching-to-cheaper-aws-services/" rel="alternate" type="text/html" title="Reducing Costs by Switching to Cheaper AWS Services" /><published>2022-06-08T12:11:11+01:00</published><updated>2022-06-08T12:11:11+01:00</updated><id>http://dyanarose.github.io/blog/2022/06/08/reducing-costs-by-switching-to-cheaper-aws-services</id><content type="html" xml:base="http://dyanarose.github.io/blog/2022/06/08/reducing-costs-by-switching-to-cheaper-aws-services/"><p>I run a tiny website for tracking item prices in Guild Wars 2, <a href="http://www.gw2roar.com/">gw2roar.com</a>. I don’t run it for the money (it has 1 or 2 visitors per day), it’s simply a fun site that suits my needs.</p>
<p>Because it doesn’t make money, every penny of running cost comes out of my pocket. When I first started the site, it was all 12 month free tier AWS services so I didn’t care about cost and then I took part in an Alexa promotion that gave me a $100 credit against my monthly bill. All good things come to an end as they say. After I got notification that the promotion was ending, I started looking at how I could significantly reduce my costs.</p>
<h2 id="step-1-use-the-cost-explorer-to-find-the-high-cost-services">Step 1: Use the Cost Explorer to find the high cost services</h2>
<p>Open up <a href="console.aws.amazon.com/cost-management/home">Cost Management</a> and click on the Cost Explorer in the list on the left.</p>
<p>In this view you can explore the costs of each service you use.</p>
<p>I had two spikes in my cost explorer, Elasticache and RDS. Elasticache stores candlestick data per item per day for quick retrieval. RDS runs a postgres instance that stores the item price data retrieved every 30 minutes from the Guild Wars 2 API.</p>
<h2 id="step-2-imagine-your-site-without-the-high-cost-service">Step 2: Imagine your site without the high cost service</h2>
<p>What if Elasticache didn’t exist?</p>
<p>Calculating candlestick data on the fly is slow and the result never changes after a day is done and I used Elasticache to speed up loading times.</p>
<p>Ultimately I needed a quick, distributed cache I could query by item id and time. Elasticache isn’t the only service that provides that, I could move to nearly any distributed cache. My decision ultimately came down to cost.</p>
<h2 id="step-3-cost-up-the-changes">Step 3: Cost up the changes</h2>
<p>The AWS pricing calculator has gotten much easier to use over the past few years. Now that you can easily price by service there’s less room for missing out or adding in costs by mistake.</p>
<p>A 1 node, t2.micro <a href="https://calculator.aws/#/addService/ElastiCache">ElastiCache cluster costs about $12.41 per month</a>.</p>
<p>What would the price be if I switched to, say, DynamoDB?</p>
<p>DynamoDB is a bit more difficult to price, you need to know:</p>
<ul>
<li>the size of your data</li>
<li>the baseline read/write rates</li>
<li>the peak read/write rates</li>
<li>how long the peaks last</li>
<li>what different options and features mean</li>
</ul>
<p>I used the information from ElastiCache to estimate the size of the data and the web statistics to estimate the read capacity. The write capacity was harder to estimate because it came from two sources.</p>
<p>A visitor could cause a write by loading the data for an item, but that would be at a much frequency lower than the number of users on the site. So the baseline write rate is very low.</p>
<p>At the end of every month a batch job calculates and loads all the data for the month into the cache. Given I knew how many items there were, and that the data needed to be loaded within 15 minutes (the timeout of a Lambda function), I could work out what the peak write rate would be.</p>
<p>Did I need:</p>
<ul>
<li><a href="https://aws.amazon.com/blogs/aws/dynamodb-price-reduction-and-new-reserved-capacity-model/">reserved capacity</a>? no, that would drive up the cost and provide no benefit at a low rate of reads/writes</li>
<li><a href="https://aws.amazon.com/dynamodb/pricing/on-demand/">on-demand capacity</a>? It’s nice in theory to pay for what you use, but in practice it would be more expensive for my site than using <a href="https://aws.amazon.com/dynamodb/pricing/provisioned/">provisioned capacity</a> and auto-scaling.</li>
</ul>
<p>With everything in the calculator, AWS predicted a cost of about $3.20 per month. However, the calculator doesn’t take into account the <a href="https://aws.amazon.com/free/?all-free-tier.sort-by=item.additionalFields.SortRank&amp;all-free-tier.sort-order=asc&amp;awsf.Free%20Tier%20Types=tier%23always-free&amp;awsf.Free%20Tier%20Categories=categories%23databases">“always free” alowances</a>, so switching to DynamoDB would actually bring my costs to 0.</p>
<h2 id="step-4-make-the-changes-and-evaluate">Step 4: Make the changes and evaluate</h2>
<p>The cost estimation is just an estimation. The real world can be different.</p>
<p>For example, with my first implementation of the switch to DynamoDB, I put some of the code into the same Lamdba function that called RDS. That function was running in the RDS VPC, and it’s quite simple to set up a VPC endpoint for DynamoDB. So easy that I missed that this would cost me <a href="https://aws.amazon.com/privatelink/pricing/">$0.01 per AZ per hour</a>.</p>
<p>Thankfully I saw the costs within the first few hours of creating the endpoint and was able to re-architect my functions.</p>
<p>Watching and evaluating the costs is essential to finding out fast if your estimates were wrong.</p>
<h2 id="step-5-repeat-from-step-1">Step 5: Repeat from Step 1</h2>
<p>Keep hunting for options and reducing costs until you’re satisfied.</p></content><author><name>Dyana Rose</name></author><category term="blog" /><summary type="html">I run a tiny website for tracking item prices in Guild Wars 2, gw2roar.com. I don’t run it for the money (it has 1 or 2 visitors per day), it’s simply a fun site that suits my needs.</summary></entry><entry><title type="html">Upserting in Postges: It’s not just all or nothing</title><link href="http://dyanarose.github.io/blog/2022/06/01/upserting-in-postgres/" rel="alternate" type="text/html" title="Upserting in Postges: It’s not just all or nothing" /><published>2022-06-01T12:11:11+01:00</published><updated>2022-06-01T12:11:11+01:00</updated><id>http://dyanarose.github.io/blog/2022/06/01/upserting-in-postgres</id><content type="html" xml:base="http://dyanarose.github.io/blog/2022/06/01/upserting-in-postgres/"><p>Upserting in Postgres lets you insert a new value or update an existing value in a single atomic statement. This avoids needing two separate statements, read and update/insert, wrapped in a transaction.</p>
<p>If you want to only update particular fields, and not all fields in the row, you can do that to.</p>
<p>Let’s explore upserting into a table tracking a location’s high temperatures per day.</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">IF</span> <span class="k">NOT</span> <span class="k">EXISTS</span> <span class="n">temp_agg</span>
<span class="p">(</span>
<span class="n">id</span> <span class="nb">integer</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
<span class="k">day</span> <span class="nb">date</span><span class="p">,</span>
<span class="n">high</span> <span class="nb">NUMERIC</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span>
<span class="k">CONSTRAINT</span> <span class="n">id_day</span> <span class="k">UNIQUE</span> <span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="k">day</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div></div>
<p>Every hour our application polls for the temperature at a set of location endpoints and inserts the current temperature. If a location endpoint is down, unresponsive, or returning “bad” data, it will be skipped and re-polled in the next hourly run.</p>
<p>Our first attempt at writing the temperature <code class="language-plaintext highlighter-rouge">INSERT</code> statement starts out like:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">temp_agg</span>
<span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="k">day</span><span class="p">,</span> <span class="n">high</span><span class="p">)</span>
<span class="k">VALUES</span>
<span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="nb">DATE</span><span class="p">(</span><span class="s1">'2022-06-01'</span><span class="p">),</span> <span class="mi">15</span><span class="p">.</span><span class="mi">00</span><span class="p">),</span>
<span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="nb">DATE</span><span class="p">(</span><span class="s1">'2022-06-01'</span><span class="p">),</span> <span class="mi">15</span><span class="p">.</span><span class="mi">50</span><span class="p">)</span>
</code></pre></div></div>
<p>This runs and the table now looks like:</p>
<table>
<thead>
<tr>
<th>id</th>
<th>high</th>
<th>day</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>15.00</td>
<td>2022-06-01</td>
</tr>
<tr>
<td>2</td>
<td>15.50</td>
<td>2022-06-01</td>
</tr>
</tbody>
</table>
<p>But the next time the application tries to insert data for <code class="language-plaintext highlighter-rouge">2022-06-01</code> the INSERT statement returns the error <code class="language-plaintext highlighter-rouge">duplicate key value violates unique constraint "id_day"</code> because inserting the data would conflict with the constraint that each row have a unique (id, day) pair.</p>
<h2 id="on-conflict">On Conflict…</h2>
<p>The <a href="https://www.postgresql.org/docs/current/sql-insert.html#SQL-ON-CONFLICT">ON CONFLICT</a> clause acts a bit like a row based “catch” of a “try catch…” statement. <code class="language-plaintext highlighter-rouge">try</code> to insert the row, <code class="language-plaintext highlighter-rouge">catch and handle any conflict</code>. In the <code class="language-plaintext highlighter-rouge">ON CONFLICT</code> clause you declare what conflict you are interested in and how you want to deal with it at the row level.</p>
<p>There are two options for using ON CONFLICT, <code class="language-plaintext highlighter-rouge">DO NOTHING</code> and <code class="language-plaintext highlighter-rouge">DO UPDATE</code></p>
<h3 id="on-conflict-do-nothing">ON CONFLICT DO NOTHING</h3>
<p><code class="language-plaintext highlighter-rouge">ON CONFLICT ... DO NOTHING</code> inserts each row that doesn’t have a conflict and skips each row that does.</p>
<p>The following SQL has conflicts on ids 1 and 2 on the day ‘2022-06-01’. It <em>ignores</em> those two rows and inserts the row with id 3.</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">temp_agg</span>
<span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">high</span><span class="p">,</span> <span class="k">day</span><span class="p">)</span>
<span class="k">VALUES</span>
<span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">20</span><span class="p">.</span><span class="mi">00</span><span class="p">,</span> <span class="nb">DATE</span><span class="p">(</span><span class="s1">'2022-06-01'</span><span class="p">)),</span>
<span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">20</span><span class="p">.</span><span class="mi">00</span><span class="p">,</span> <span class="nb">DATE</span><span class="p">(</span><span class="s1">'2022-06-01'</span><span class="p">)),</span>
<span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">21</span><span class="p">.</span><span class="mi">00</span><span class="p">,</span> <span class="nb">DATE</span><span class="p">(</span><span class="s1">'2022-06-01'</span><span class="p">))</span>
<span class="k">ON</span> <span class="n">CONFLICT</span> <span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="k">day</span><span class="p">)</span>
<span class="k">DO</span> <span class="k">NOTHING</span>
</code></pre></div></div>
<p>Result:</p>
<table>
<thead>
<tr>
<th>id</th>
<th>high</th>
<th>day</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>15.00</td>
<td>2022-06-01</td>
</tr>
<tr>
<td>2</td>
<td>15.50</td>
<td>2022-06-01</td>
</tr>
<tr>
<td>3</td>
<td>21.00</td>
<td>2022-06-01</td>
</tr>
</tbody>
</table>
<h3 id="on-conflict-do-update">ON CONFLICT DO UPDATE</h3>
<p><code class="language-plaintext highlighter-rouge">ON CONFLICT ... DO UPDATE</code> inserts each row that doesn’t have a conflict and <em>updates</em> each row that does.</p>
<p>In the <code class="language-plaintext highlighter-rouge">DO UPDATE</code> statement you have access to both the existing row data and the new row data, though referencing them is not quite obvious. You reference them like so:</p>
<p>Existing data =&gt; <code class="language-plaintext highlighter-rouge">&lt;table_name&gt;.&lt;field&gt;</code> ie <code class="language-plaintext highlighter-rouge">temp_agg.high</code></p>
<p>If you have aliased the table, for example <code class="language-plaintext highlighter-rouge">INSERT INTO temp_agg as t</code>, then you will reference the existing data as <code class="language-plaintext highlighter-rouge">t.high</code></p>
<p>New data =&gt; <code class="language-plaintext highlighter-rouge">EXCLUDED.&lt;field&gt;</code> ie <code class="language-plaintext highlighter-rouge">EXCLUDED.high</code></p>
<p>If your insert statement already references a table named <code class="language-plaintext highlighter-rouge">excluded</code> you need to alias it to avoid any naming conflicts.</p>
<p>The result we’re looking for is to update the high temperature for locations 1 &amp; 2 on day <code class="language-plaintext highlighter-rouge">2022-06-01</code> but location 3 is unchanged.</p>
<p>The following SQL statements achieve this using different semantics</p>
<p>In the first example, the conflict rows are only updated if the new (<code class="language-plaintext highlighter-rouge">EXCLUDED</code>) high temperature is greater than the existing high temperature.</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">temp_agg</span>
<span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">high</span><span class="p">,</span> <span class="k">day</span><span class="p">)</span>
<span class="k">VALUES</span>
<span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">20</span><span class="p">.</span><span class="mi">00</span><span class="p">,</span> <span class="nb">DATE</span><span class="p">(</span><span class="s1">'2022-06-01'</span><span class="p">)),</span>
<span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">20</span><span class="p">.</span><span class="mi">00</span><span class="p">,</span> <span class="nb">DATE</span><span class="p">(</span><span class="s1">'2022-06-01'</span><span class="p">)),</span>
<span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">18</span><span class="p">.</span><span class="mi">00</span><span class="p">,</span> <span class="nb">DATE</span><span class="p">(</span><span class="s1">'2022-06-01'</span><span class="p">))</span>
<span class="k">ON</span> <span class="n">CONFLICT</span> <span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="k">day</span><span class="p">)</span>
<span class="k">DO</span> <span class="k">UPDATE</span> <span class="k">SET</span> <span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">high</span><span class="p">,</span> <span class="k">day</span><span class="p">)</span> <span class="o">=</span> <span class="p">(</span><span class="n">temp_agg</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">EXCLUDED</span><span class="p">.</span><span class="n">high</span><span class="p">,</span> <span class="n">temp_agg</span><span class="p">.</span><span class="k">day</span><span class="p">)</span>
<span class="k">WHERE</span> <span class="n">EXCLUDED</span><span class="p">.</span><span class="n">high</span> <span class="o">&gt;</span> <span class="n">temp_agg</span><span class="p">.</span><span class="n">high</span>
</code></pre></div></div>
<p>In the second example the <code class="language-plaintext highlighter-rouge">GREATEST</code> function is used to set the high temperature instead of filtering with a <code class="language-plaintext highlighter-rouge">WHERE</code> clause.</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">temp_agg</span>
<span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">high</span><span class="p">,</span> <span class="k">day</span><span class="p">)</span>
<span class="k">VALUES</span>
<span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">20</span><span class="p">.</span><span class="mi">00</span><span class="p">,</span> <span class="nb">DATE</span><span class="p">(</span><span class="s1">'2022-06-01'</span><span class="p">)),</span>
<span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">20</span><span class="p">.</span><span class="mi">00</span><span class="p">,</span> <span class="nb">DATE</span><span class="p">(</span><span class="s1">'2022-06-01'</span><span class="p">)),</span>
<span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">18</span><span class="p">.</span><span class="mi">00</span><span class="p">,</span> <span class="nb">DATE</span><span class="p">(</span><span class="s1">'2022-06-01'</span><span class="p">))</span>
<span class="k">ON</span> <span class="n">CONFLICT</span> <span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="k">day</span><span class="p">)</span>
<span class="k">DO</span> <span class="k">UPDATE</span> <span class="k">SET</span> <span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">high</span><span class="p">,</span> <span class="k">day</span><span class="p">)</span> <span class="o">=</span> <span class="p">(</span><span class="n">temp_agg</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">GREATEST</span><span class="p">(</span><span class="n">EXCLUDED</span><span class="p">.</span><span class="n">high</span><span class="p">,</span> <span class="n">temp_agg</span><span class="p">.</span><span class="n">high</span><span class="p">),</span> <span class="n">temp_agg</span><span class="p">.</span><span class="k">day</span><span class="p">)</span>
</code></pre></div></div>
<p>The result is that the high temperature for locations 1 &amp; 2 on day <code class="language-plaintext highlighter-rouge">2022-06-01</code> have been updated and the high temp for location 3 has stayed the same.</p>
<p>Result:</p>
<table>
<thead>
<tr>
<th>id</th>
<th>high</th>
<th>day</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>20.00</td>
<td>2022-06-01</td>
</tr>
<tr>
<td>2</td>
<td>20.00</td>
<td>2022-06-01</td>
</tr>
<tr>
<td>3</td>
<td>21.00</td>
<td>2022-06-01</td>
</tr>
</tbody>
</table></content><author><name>Dyana Rose</name></author><category term="blog" /><summary type="html">Upserting in Postgres lets you insert a new value or update an existing value in a single atomic statement. This avoids needing two separate statements, read and update/insert, wrapped in a transaction.</summary></entry><entry><title type="html">Building and Showing</title><link href="http://dyanarose.github.io/blog/2017/11/05/building-and-showing/" rel="alternate" type="text/html" title="Building and Showing" /><published>2017-11-05T10:57:57+00:00</published><updated>2017-11-05T10:57:57+00:00</updated><id>http://dyanarose.github.io/blog/2017/11/05/building-and-showing</id><content type="html" xml:base="http://dyanarose.github.io/blog/2017/11/05/building-and-showing/"><p>On and off I’ve been working on a site that displays candlestick charts using data from the Guild Wars 2 trading post, but I’ve never publicly linked to it.</p>
<p>Part of the reason is that it’s not a finished product and I’m not a web designer. But why should that stop me, eh? It’s already been an interesting product involving:</p>
<ul>
<li>a service retrieving the data from the GW2 API</li>
<li>optimising sql to calculate the candlesticks quickly</li>
<li>an RTL service that takes old data and moves to a better long term storage format and then populates a cache</li>
<li>a website which must merge cached and uncached data before returning the candlesticks back to the caller.</li>
</ul>
<p>It didn’t start out with 2 services, a website, and a cache though. It started with a program, written in Go that would call an endpoint once an hour and then save the results in a sqlite database.</p>
<p>At that time I didn’t know what I wanted to do with the data, I just knew I wanted to do <em>something</em>.</p>
<p>Eventually, that something became me answering the question ‘what questions do I want the answers to?’ As it turns out, what I want is to answer yet another question, ‘should I buy, sell, or make this item, and should I do it now.’ Now this question I could answer! But my tools at the time would become problematic.</p>
<p>I chose to display historical data using candlestick charts, as I like the view of how prices are moving in a given time period. Calculating open, close, min and max using sqlite proved to be an interesting problem. It was possible though, with some sub-queries, and with some tradeoffs. For example I could only get results for one item at a time, and the larger the dataset, the slower the calculation got. But, most importantly, it gave me a place to start. And it worked!</p>
<p>Once the requirements started shaking out, infrastructure changes became frequent. I was inside the AWS free tier, but as the sqlite file grew, I started to get worried about EBS storage and keeping things free. So my architecture had to change and I moved to using RDS and PostgreSQL.</p>
<p>Then the fact that the data, once inserted into PostgreSQL, was effectivley dead, started grating on me, and also filling up my free tier allotments of space in RDS.</p>
<p>So I brought in an RTL process to take the dead data out of PostgreSQL, store it more efficiently, and use it to populate a cache.</p>
<p>And then the cache grew too fast and I started seeing evictions. But new problem means new solution. I needed a better way (or even a way) of compressing the data going into the cache. I’ve worked with Protobuf before so after a bit of a search around to see if Avro would be an immediate better fit, I decided to go with compression via Protobuf. I also reworked the structure of the stored data, because once it was compressed, the keys were still a major source of bloat. And that worked beautifully. I went from being able to store a few months of data to at few years.</p>
<p>And that’s where things stand right now. The UI isn’t much to talk about but it’s clean (sparse some may say) and it’s pretty zippy. Which is a long way from the minutes it used to take to load only a single month of data.</p>
<p>http://www.gw2roar.com</p>
<p>I’m going to continue to talk more about the choices I made, and continue to make, and why I had to make them with this project. For example, the significant work around keeping both cache and database sizes sane, automating the ETL process, automating deployments, and anything that comes next.</p></content><author><name>Dyana Rose</name></author><category term="blog" /><summary type="html">On and off I’ve been working on a site that displays candlestick charts using data from the Guild Wars 2 trading post, but I’ve never publicly linked to it.</summary></entry><entry><title type="html">graceful shutdown of java apps under docker</title><link href="http://dyanarose.github.io/blog/2017/08/26/graceful-shutdown-of-java-apps-under-docker/" rel="alternate" type="text/html" title="graceful shutdown of java apps under docker" /><published>2017-08-26T13:39:00+01:00</published><updated>2017-08-26T13:39:00+01:00</updated><id>http://dyanarose.github.io/blog/2017/08/26/graceful-shutdown-of-java-apps-under-docker</id><content type="html" xml:base="http://dyanarose.github.io/blog/2017/08/26/graceful-shutdown-of-java-apps-under-docker/"><p>I ran across a problem recently when working on a Java application that needed to be allowed to finish processing its current batch before shutting down after the receipt of a shut down signal.</p>
<p>This app has a main loop that receives messages off an AWS SQS queue, processes those messages, takes action if required (making a put request to a third party), and then deletes the messages off the queue.</p>
<p>Each action must only ever send a unique record to the api. Because the third party doesn’t expose any unique identifier for a record, though it does provide an endpoint to get a list of all existing records, the application itself must handle the idea of uniqueness.</p>
<p>So, in the case of a double whammy of my app being shut down after taking action, but before deleting the message from the queue, and the third party being slow to update the list of previous requests, I could end up sending duplicate records when the new app starts up.</p>
<h2 id="no-problem-weve-got-sigterm">No problem, we’ve got SIGTERM</h2>
<p>When <code class="language-plaintext highlighter-rouge">docker stop</code> is called on a container, SIGTERM is sent to PID 1, and in Java a <a href="https://docs.oracle.com/javase/8/docs/technotes/guides/lang/hook-design.html">shutdown hook</a> is used to catch the SIGTERM and clean up any resources before finally stopping.</p>
<p>In this case though, it’s not resources that need cleaning up. This is a continuously running application that needs to finish any work currently in process before returning out of the main loop and stopping.</p>
<p>As it happens, Java has a good way of letting a Thread know that the application would like it to shut down before it starts its next itteration of work.</p>
<h2 id="interrupts">Interrupts</h2>
<p>A Thread in Java has an <code class="language-plaintext highlighter-rouge">interrupted</code> method which returns true if the Thread has been interrupted since the last time <code class="language-plaintext highlighter-rouge">Thread.interrupted()</code> was called.
(more on <a href="https://docs.oracle.com/javase/tutorial/essential/concurrency/interrupt.html">Interrupts</a>)</p>
<h3 id="why-are-interrupts-useful">Why are interrupts useful</h3>
<p>In the main loop a condition of <code class="language-plaintext highlighter-rouge">while(!Thread.interrupted())</code> will allow the while block to run to completion, but also prevent the next execution if an interrupt occured during the previous run, which is exactly what needs to happen to allow messages to complete processing before the app shuts down.</p>
<h4 id="how-to-interrupt-a-thread">How to interrupt a thread</h4>
<p>The short answer is ‘by invoking <code class="language-plaintext highlighter-rouge">Thread.interrupt</code>’ in the shutdown hook.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">private</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">threadInterruptedOnShutdown</span><span class="o">(</span><span class="kt">long</span> <span class="n">timeout</span><span class="o">)</span> <span class="o">{</span>
<span class="c1">// Setting interrupt doesn't cause the application to wait for the thread to exit.</span>
<span class="c1">// The statements inside the interrupt block in the loop may or may not be executed.</span>
<span class="nc">String</span> <span class="n">name</span> <span class="o">=</span> <span class="s">"threadInterruptedOnShutdown wait "</span> <span class="o">+</span> <span class="n">timeout</span><span class="o">;</span>
<span class="nc">Thread</span> <span class="n">t</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">Thread</span><span class="o">(</span><span class="k">new</span> <span class="nc">RunnableLoop</span><span class="o">(</span><span class="n">name</span><span class="o">,</span> <span class="n">timeout</span><span class="o">));</span>
<span class="n">t</span><span class="o">.</span><span class="na">start</span><span class="o">();</span>
<span class="nc">Runtime</span><span class="o">.</span><span class="na">getRuntime</span><span class="o">().</span><span class="na">addShutdownHook</span><span class="o">(</span><span class="k">new</span> <span class="nc">Thread</span><span class="o">(()</span> <span class="o">-&gt;</span> <span class="o">{</span>
<span class="nc">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">name</span> <span class="o">+</span> <span class="s">" thread: setting interrupt"</span><span class="o">);</span>
<span class="n">t</span><span class="o">.</span><span class="na">interrupt</span><span class="o">();</span>
<span class="nc">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">name</span> <span class="o">+</span> <span class="s">" thread: shutting down"</span><span class="o">);</span>
<span class="o">}));</span>
<span class="o">}</span>
</code></pre></div></div>
<p>Just calling interrupt on a Thread doesn’t give the control needed to ensure the current work is completed. It doesn’t necessarily wait on the Thread to exit before letting the application shut down. There’s no reason you couldn’t write the code to handle this of course, but the <a href="https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ExecutorService.html">ExecutorService</a> already provides that for us.</p>
<p>When submitting a Runnable (or Callable) to the ExecutorService, a handle for a Future is returned. Working together, the Future and the ExecutorService give control over interrupting Threads, waiting for them to exit, and if anything fails to exit, providing an opportunity to do any damage control.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">private</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">futureFromExecutorService</span><span class="o">(</span><span class="kt">long</span> <span class="n">timeout</span><span class="o">)</span> <span class="o">{</span>
<span class="c1">// the executor service submit method allows us to get a handle on the thread</span>
<span class="c1">// via a future and set the interrupt in the shutdown hook</span>
<span class="nc">String</span> <span class="n">name</span> <span class="o">=</span> <span class="s">"futureFromExecutorService wait "</span> <span class="o">+</span> <span class="n">timeout</span><span class="o">;</span>
<span class="nc">ExecutorService</span> <span class="n">service</span> <span class="o">=</span> <span class="nc">Executors</span><span class="o">.</span><span class="na">newSingleThreadExecutor</span><span class="o">();</span>
<span class="nc">Future</span><span class="o">&lt;?&gt;</span> <span class="n">app</span> <span class="o">=</span> <span class="n">service</span><span class="o">.</span><span class="na">submit</span><span class="o">(</span><span class="k">new</span> <span class="nc">RunnableLoop</span><span class="o">(</span><span class="n">name</span><span class="o">,</span> <span class="n">timeout</span><span class="o">));</span>
<span class="nc">Runtime</span><span class="o">.</span><span class="na">getRuntime</span><span class="o">().</span><span class="na">addShutdownHook</span><span class="o">(</span><span class="k">new</span> <span class="nc">Thread</span><span class="o">(()</span> <span class="o">-&gt;</span> <span class="o">{</span>
<span class="nc">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">name</span> <span class="o">+</span> <span class="s">" thread: setting interrupt"</span><span class="o">);</span>
<span class="n">app</span><span class="o">.</span><span class="na">cancel</span><span class="o">(</span><span class="kc">true</span><span class="o">);</span>
<span class="n">service</span><span class="o">.</span><span class="na">shutdown</span><span class="o">();</span>
<span class="k">try</span> <span class="o">{</span>
<span class="c1">// give the thread time to shutdown. This needs to be comfortably less than the</span>
<span class="c1">// time the docker stop command will wait for a container to terminate on its own</span>
<span class="c1">// before forcibly killing it.</span>
<span class="k">if</span> <span class="o">(!</span><span class="n">service</span><span class="o">.</span><span class="na">awaitTermination</span><span class="o">(</span><span class="mi">7</span><span class="o">,</span> <span class="nc">TimeUnit</span><span class="o">.</span><span class="na">SECONDS</span><span class="o">))</span> <span class="o">{</span>
<span class="nc">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">name</span> <span class="o">+</span> <span class="s">" thread: did not shutdown in time, forcing service shutdown"</span><span class="o">);</span>
<span class="n">service</span><span class="o">.</span><span class="na">shutdownNow</span><span class="o">();</span>
<span class="o">}</span> <span class="k">else</span> <span class="o">{</span>
<span class="nc">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">name</span> <span class="o">+</span> <span class="s">" thread: shutdown cleanly"</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">}</span> <span class="k">catch</span> <span class="o">(</span><span class="nc">InterruptedException</span> <span class="n">e</span><span class="o">)</span> <span class="o">{</span>
<span class="nc">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">name</span> <span class="o">+</span> <span class="s">" thread: shutdown timer interrupted, forcing service shutdown"</span><span class="o">);</span>
<span class="n">service</span><span class="o">.</span><span class="na">shutdownNow</span><span class="o">();</span>
<span class="o">}</span>
<span class="o">}));</span>
<span class="o">}</span>
</code></pre></div></div>
<p>To interrupt the Thread when SIGTERM is received, inside the shutdown hook call <code class="language-plaintext highlighter-rouge">.cancel(true)</code> on the Future. The boolean parameter allows the cancel method to interrupt a running Thread. Without that parameter, the Thread will not be interrupted.</p>
<p>Once the Future is cancelled, the service can begin shutting down. The ExecutorService has two types of shutdown. <code class="language-plaintext highlighter-rouge">.shutdown()</code> will stop any more work from being submitted to the service while allowing existing work to execute. <code class="language-plaintext highlighter-rouge">.shutdownNow()</code> on the other hand actively attempts to stop all running tasks.</p>
<p>After calling <code class="language-plaintext highlighter-rouge">.shutdown()</code> use the ExecutorService’s <code class="language-plaintext highlighter-rouge">.awaitTermination</code> method to both give the Threads time to finish any current work and also to handle those that do not return in time. Set the timout argument to be less than that of the <code class="language-plaintext highlighter-rouge">docker stop</code> command so that there will be time to do damage control and attempt a final <code class="language-plaintext highlighter-rouge">.shutdownNow()</code> before docker kills the container for being non-responsive.</p>
<p>All together, this allows a continously running application to respond to shutdown requests in a timely manner and also complete the work that is currently in process.</p>
<p>tl;dr <a href="https://github.com/dyanarose/application-interrupts">application-interrupts</a></p></content><author><name>Dyana Rose</name></author><category term="blog" /><summary type="html">I ran across a problem recently when working on a Java application that needed to be allowed to finish processing its current batch before shutting down after the receipt of a shut down signal.</summary></entry><entry><title type="html">so you need to edit a parquet file</title><link href="http://dyanarose.github.io/blog/2017/08/04/so-you-need-to-edit-a-parquet-file/" rel="alternate" type="text/html" title="so you need to edit a parquet file" /><published>2017-08-04T11:40:32+01:00</published><updated>2017-08-04T11:40:32+01:00</updated><id>http://dyanarose.github.io/blog/2017/08/04/so-you-need-to-edit-a-parquet-file</id><content type="html" xml:base="http://dyanarose.github.io/blog/2017/08/04/so-you-need-to-edit-a-parquet-file/"><p>You’ve uncovered a problem in your beautiful parquet files, some piece of data either snuck in, or was calculated incorrectly, or there was just a bug. You know exactly how to correct the data, but how do you update the files?</p>
<p>tl;dr: <a href="https://github.com/dyanarose/parquet-edit-examples">spark-edit-examples</a></p>
<h3 id="its-all-immutable">It’s all immutable</h3>
<p>The problem we have when we need to edit the data is that our data structures are immutable.</p>
<p>You can add partitions to Parquet files, but you can’t edit the data in place. Spark DataFrames are immutable.</p>
<p>But ultimately we can mutate the data, we just need to accept that we won’t be doing it in place. We will need to recreate the Parquet files using a combination of schemas and UDFs to correct the bad data.</p>
<h2 id="schemas">Schemas</h2>
<p>Reading in data using a schema gives you a lot of power over the resultant structure of the DataFrame (not to mention it makes reading in json files a lot faster, and will allow you to union compatible Parquet files)</p>
<h4 id="case-1-i-need-to-drop-an-entire-column">Case 1: I need to drop an entire column</h4>
<p>To drop an entire column, read the data in with a schema that doesn’t contain that column. When you write the DataFrame back out, the column will no longer exist</p>
<p><a href="https://github.com/dyanarose/parquet-edit-examples/blob/master/transform-examples/src/main/scala/com/dlr/transform/transformers/ColumnTransform.scala">ColumnTransform.scala</a></p>
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">object</span> <span class="nc">ColumnTransform</span> <span class="o">{</span>
<span class="k">def</span> <span class="nf">transform</span><span class="o">(</span><span class="n">spark</span><span class="k">:</span> <span class="kt">SparkSession</span><span class="o">,</span> <span class="n">sourcePath</span><span class="k">:</span> <span class="kt">String</span><span class="o">,</span> <span class="n">destPath</span><span class="k">:</span> <span class="kt">String</span><span class="o">)</span><span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="o">{</span>
<span class="c1">// read in the data with a new schema</span>
<span class="k">val</span> <span class="nv">allGoodData</span> <span class="k">=</span> <span class="nv">spark</span><span class="o">.</span><span class="py">read</span><span class="o">.</span><span class="py">schema</span><span class="o">(</span><span class="nv">ColumnDropSchema</span><span class="o">.</span><span class="py">schema</span><span class="o">).</span><span class="py">parquet</span><span class="o">(</span><span class="n">sourcePath</span><span class="o">)</span>
<span class="c1">// write out the final edited data</span>
<span class="nv">allGoodData</span><span class="o">.</span><span class="py">write</span><span class="o">.</span><span class="py">parquet</span><span class="o">(</span><span class="n">destPath</span><span class="o">)</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
<h4 id="case-2-i-need-to-drop-full-rows-of-data">Case 2: I need to drop full rows of data</h4>
<p>To drop full rows, read in the data and select the data you want to save into a new DataFrame using a where clause. When you write the new DataFrame it will only have the rows that match the where clause.</p>
<p><a href="https://github.com/dyanarose/parquet-edit-examples/blob/master/transform-examples/src/main/scala/com/dlr/transform/transformers/WhereTransform.scala">WhereTransform.scala</a></p>
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">object</span> <span class="nc">WhereTransform</span> <span class="o">{</span>
<span class="k">def</span> <span class="nf">transform</span><span class="o">(</span><span class="n">spark</span><span class="k">:</span> <span class="kt">SparkSession</span><span class="o">,</span> <span class="n">sourcePath</span><span class="k">:</span> <span class="kt">String</span><span class="o">,</span> <span class="n">destPath</span><span class="k">:</span> <span class="kt">String</span><span class="o">)</span><span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="o">{</span>
<span class="k">val</span> <span class="nv">originalData</span> <span class="k">=</span> <span class="nv">spark</span><span class="o">.</span><span class="py">read</span><span class="o">.</span><span class="py">schema</span><span class="o">(</span><span class="nv">RawDataSchema</span><span class="o">.</span><span class="py">schema</span><span class="o">).</span><span class="py">parquet</span><span class="o">(</span><span class="n">sourcePath</span><span class="o">)</span>
<span class="c1">// select only the good data rows</span>
<span class="k">val</span> <span class="nv">allGoodData</span> <span class="k">=</span> <span class="nv">originalData</span><span class="o">.</span><span class="py">where</span><span class="o">(</span><span class="s">"myField is null"</span><span class="o">)</span>
<span class="c1">// write out the final edited data</span>
<span class="nv">allGoodData</span><span class="o">.</span><span class="py">write</span><span class="o">.</span><span class="py">parquet</span><span class="o">(</span><span class="n">destPath</span><span class="o">)</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
<h2 id="user-defined-functions-udfs">User Defined Functions (UDFs)</h2>
<p><a href="https://blog.cloudera.com/blog/2017/02/working-with-udfs-in-apache-spark/">UDFs in Spark</a> are used to apply functions to a row of data. The result of the UDF becomes the field value.</p>
<p>Note that when using UDFs you must alias the resultant column otherwise it will end up renamed similar to <code class="language-plaintext highlighter-rouge">UDF(fieldName)</code></p>
<h4 id="case-3-i-need-to-edit-the-value-of-a-simple-type-string-boolean-">Case 3: I need to edit the value of a simple type (String, Boolean, …)</h4>
<p>To edit a simple type you first need to create a function that takes and returns the same type.</p>
<p>This function is then registered for use as a UDF and it can then be applied to a field in a select clause</p>
<p><a href="https://github.com/dyanarose/parquet-edit-examples/blob/master/transform-examples/src/main/scala/com/dlr/transform/transformers/SimpleTransform.scala">SimpleTransform.scala</a></p>
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">object</span> <span class="nc">SimpleTransform</span> <span class="o">{</span>
<span class="k">def</span> <span class="nf">transform</span><span class="o">(</span><span class="n">spark</span><span class="k">:</span> <span class="kt">SparkSession</span><span class="o">,</span> <span class="n">sourcePath</span><span class="k">:</span> <span class="kt">String</span><span class="o">,</span> <span class="n">destPath</span><span class="k">:</span> <span class="kt">String</span><span class="o">)</span><span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="o">{</span>
<span class="k">val</span> <span class="nv">originalData</span> <span class="k">=</span> <span class="nv">spark</span><span class="o">.</span><span class="py">read</span><span class="o">.</span><span class="py">schema</span><span class="o">(</span><span class="nv">RawDataSchema</span><span class="o">.</span><span class="py">schema</span><span class="o">).</span><span class="py">parquet</span><span class="o">(</span><span class="n">sourcePath</span><span class="o">)</span>
<span class="c1">// take in a String, return a String</span>
<span class="c1">// cleanFunc takes the String field value and return the empty string in its place</span>
<span class="c1">// you can interrogate the value and return any String here</span>
<span class="k">def</span> <span class="nf">cleanFunc</span><span class="k">:</span> <span class="o">(</span><span class="kt">String</span> <span class="o">=&gt;</span> <span class="kt">String</span><span class="o">)</span> <span class="k">=</span> <span class="o">{</span> <span class="k">_</span> <span class="k">=&gt;</span> <span class="s">""</span> <span class="o">}</span>
<span class="c1">// register the func as a udf</span>
<span class="k">val</span> <span class="nv">clean</span> <span class="k">=</span> <span class="nf">udf</span><span class="o">(</span><span class="n">cleanFunc</span><span class="o">)</span>
<span class="c1">// required for the $ column syntax</span>
<span class="k">import</span> <span class="nn">spark.sqlContext.implicits._</span>
<span class="c1">// if you have data that doesn't need editing, you can separate it out</span>
<span class="c1">// The data will need to be in a form that can be unioned with the edited data</span>
<span class="c1">// That can be done by selecting out the fields in the same way in both the good and transformed data sets.</span>
<span class="k">val</span> <span class="nv">alreadyGoodData</span> <span class="k">=</span> <span class="nv">originalData</span><span class="o">.</span><span class="py">where</span><span class="o">(</span><span class="s">"myField is null"</span><span class="o">).</span><span class="py">select</span><span class="o">(</span>
<span class="nc">Seq</span><span class="o">[</span><span class="kt">Column</span><span class="o">](</span>
<span class="n">$</span><span class="s">"myField"</span><span class="o">,</span>
<span class="n">$</span><span class="s">"myMap"</span><span class="o">,</span>
<span class="n">$</span><span class="s">"myStruct"</span>
<span class="o">)</span><span class="k">:_</span><span class="kt">*</span>
<span class="o">)</span>
<span class="c1">// apply the udf to the fields that need editing</span>
<span class="c1">// selecting out all the data that will be present in the final parquet file</span>
<span class="k">val</span> <span class="nv">transformedData</span> <span class="k">=</span> <span class="nv">originalData</span><span class="o">.</span><span class="py">where</span><span class="o">(</span><span class="s">"myField is not null"</span><span class="o">).</span><span class="py">select</span><span class="o">(</span>
<span class="nc">Seq</span><span class="o">[</span><span class="kt">Column</span><span class="o">](</span>
<span class="nf">clean</span><span class="o">(</span><span class="n">$</span><span class="s">"myField"</span><span class="o">).</span><span class="py">as</span><span class="o">(</span><span class="s">"myField"</span><span class="o">),</span>
<span class="n">$</span><span class="s">"myMap"</span><span class="o">,</span>
<span class="n">$</span><span class="s">"myStruct"</span>
<span class="o">)</span><span class="k">:_</span><span class="kt">*</span>
<span class="o">)</span>
<span class="c1">// union the two DataFrames</span>
<span class="k">val</span> <span class="nv">allGoodData</span> <span class="k">=</span> <span class="nv">alreadyGoodData</span><span class="o">.</span><span class="py">union</span><span class="o">(</span><span class="n">transformedData</span><span class="o">)</span>
<span class="c1">// write out the final edited data</span>
<span class="nv">allGoodData</span><span class="o">.</span><span class="py">write</span><span class="o">.</span><span class="py">parquet</span><span class="o">(</span><span class="n">destPath</span><span class="o">)</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
<h4 id="case-4-i-need-to-edit-the-value-of-a-maptype">Case 4: I need to edit the value of a MapType</h4>
<p>MapTypes follow the same pattern as simple types. You write a function that takes a Map of the correct key and value types and returns a Map of the same types.</p>
<p>In the following example, an entire entry in the Map[String,String] is removed from the final data by filtering on the keyset.</p>
<p><a href="https://github.com/dyanarose/parquet-edit-examples/blob/master/transform-examples/src/main/scala/com/dlr/transform/transformers/MapTypeTransform.scala">MapTypeTransform.scala</a></p>
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">object</span> <span class="nc">MapTypeTransform</span> <span class="o">{</span>
<span class="k">def</span> <span class="nf">transform</span><span class="o">(</span><span class="n">spark</span><span class="k">:</span> <span class="kt">SparkSession</span><span class="o">,</span> <span class="n">sourcePath</span><span class="k">:</span> <span class="kt">String</span><span class="o">,</span> <span class="n">destPath</span><span class="k">:</span> <span class="kt">String</span><span class="o">)</span><span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="o">{</span>
<span class="k">val</span> <span class="nv">originalData</span> <span class="k">=</span> <span class="nv">spark</span><span class="o">.</span><span class="py">read</span><span class="o">.</span><span class="py">schema</span><span class="o">(</span><span class="nv">RawDataSchema</span><span class="o">.</span><span class="py">schema</span><span class="o">).</span><span class="py">parquet</span><span class="o">(</span><span class="n">sourcePath</span><span class="o">)</span>
<span class="c1">// cleanFunc will simply take the MapType and return an edited Map</span>
<span class="c1">// in this example it removes one member of the map before returning</span>
<span class="k">def</span> <span class="nf">cleanFunc</span><span class="k">:</span> <span class="o">(</span><span class="kt">Map</span><span class="o">[</span><span class="kt">String</span>, <span class="kt">String</span><span class="o">]</span> <span class="o">=&gt;</span> <span class="nc">Map</span><span class="o">[</span><span class="kt">String</span>, <span class="kt">String</span><span class="o">])</span> <span class="k">=</span> <span class="o">{</span> <span class="n">m</span> <span class="k">=&gt;</span> <span class="nv">m</span><span class="o">.</span><span class="py">filterKeys</span><span class="o">(</span><span class="n">k</span> <span class="k">=&gt;</span> <span class="n">k</span> <span class="o">!=</span> <span class="s">"editMe"</span><span class="o">)</span> <span class="o">}</span>
<span class="c1">// register the func as a udf</span>
<span class="k">val</span> <span class="nv">clean</span> <span class="k">=</span> <span class="nf">udf</span><span class="o">(</span><span class="n">cleanFunc</span><span class="o">)</span>
<span class="c1">// required for the $ column syntax</span>
<span class="k">import</span> <span class="nn">spark.sqlContext.implicits._</span>
<span class="c1">// if you have data that doesn't need editing, you can separate it out</span>
<span class="c1">// The data will need to be in a form that can be unioned with the edited data</span>
<span class="c1">// I do that here by selecting out all the fields.</span>
<span class="k">val</span> <span class="nv">alreadyGoodData</span> <span class="k">=</span> <span class="nv">originalData</span><span class="o">.</span><span class="py">where</span><span class="o">(</span><span class="s">"myMap.editMe is null"</span><span class="o">).</span><span class="py">select</span><span class="o">(</span>
<span class="nc">Seq</span><span class="o">[</span><span class="kt">Column</span><span class="o">](</span>
<span class="n">$</span><span class="s">"myField"</span><span class="o">,</span>
<span class="n">$</span><span class="s">"myMap"</span><span class="o">,</span>
<span class="n">$</span><span class="s">"myStruct"</span>
<span class="o">)</span><span class="k">:_</span><span class="kt">*</span>
<span class="o">)</span>
<span class="c1">// apply the udf to the fields that need editing</span>
<span class="c1">// selecting out all the data that will be present in the final parquet file</span>
<span class="k">val</span> <span class="nv">transformedData</span> <span class="k">=</span> <span class="nv">originalData</span><span class="o">.</span><span class="py">where</span><span class="o">(</span><span class="s">"myMap.editMe is not null"</span><span class="o">).</span><span class="py">select</span><span class="o">(</span>
<span class="nc">Seq</span><span class="o">[</span><span class="kt">Column</span><span class="o">](</span>
<span class="n">$</span><span class="s">"myField"</span><span class="o">,</span>
<span class="nf">clean</span><span class="o">(</span><span class="n">$</span><span class="s">"myMap"</span><span class="o">).</span><span class="py">as</span><span class="o">(</span><span class="s">"myMap"</span><span class="o">),</span>
<span class="n">$</span><span class="s">"myStruct"</span>
<span class="o">)</span><span class="k">:_</span><span class="kt">*</span>
<span class="o">)</span>
<span class="c1">// union the two DataFrames</span>
<span class="k">val</span> <span class="nv">allGoodData</span> <span class="k">=</span> <span class="nv">alreadyGoodData</span><span class="o">.</span><span class="py">union</span><span class="o">(</span><span class="n">transformedData</span><span class="o">)</span>
<span class="c1">// write out the final edited data</span>
<span class="nv">allGoodData</span><span class="o">.</span><span class="py">write</span><span class="o">.</span><span class="py">parquet</span><span class="o">(</span><span class="n">destPath</span><span class="o">)</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
<h4 id="case-5-i-need-to-change-the-value-of-a-member-of-a-structtype">Case 5: I need to change the value of a member of a StructType</h4>
<p>Working with StructTypes requires an addition to the UDF registration statement. By supplying the schema of the StructType you are able to manipulate using a function that takes and returns a Row.</p>
<p>As Rows are immutable, a new Row must be created that has the same field order, type, and number as the schema. But, since the schema of the data is known, it’s relatively easy to reconstruct a new Row with the correct fields.</p>
<p><a href="https://github.com/dyanarose/parquet-edit-examples/blob/master/transform-examples/src/main/scala/com/dlr/transform/transformers/StructTypeTransform.scala">StructTypeTransform.scala</a></p>
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">object</span> <span class="nc">StructTypeTransform</span> <span class="o">{</span>
<span class="k">def</span> <span class="nf">transform</span><span class="o">(</span><span class="n">spark</span><span class="k">:</span> <span class="kt">SparkSession</span><span class="o">,</span> <span class="n">sourcePath</span><span class="k">:</span> <span class="kt">String</span><span class="o">,</span> <span class="n">destPath</span><span class="k">:</span> <span class="kt">String</span><span class="o">)</span><span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="o">{</span>
<span class="k">val</span> <span class="nv">originalData</span> <span class="k">=</span> <span class="nv">spark</span><span class="o">.</span><span class="py">read</span><span class="o">.</span><span class="py">schema</span><span class="o">(</span><span class="nv">RawDataSchema</span><span class="o">.</span><span class="py">schema</span><span class="o">).</span><span class="py">parquet</span><span class="o">(</span><span class="n">sourcePath</span><span class="o">)</span>
<span class="c1">// cleanFunc will take the struct as a Row and return a new Row with edited fields</span>
<span class="c1">// note that the ordering and count of the fields must remain the same</span>
<span class="k">def</span> <span class="nf">cleanFunc</span><span class="k">:</span> <span class="o">(</span><span class="kt">Row</span> <span class="o">=&gt;</span> <span class="kt">Row</span><span class="o">)</span> <span class="k">=</span> <span class="o">{</span> <span class="n">r</span> <span class="k">=&gt;</span> <span class="nv">RowFactory</span><span class="o">.</span><span class="py">create</span><span class="o">(</span><span class="nv">r</span><span class="o">.</span><span class="py">getAs</span><span class="o">[</span><span class="kt">BooleanType</span><span class="o">](</span><span class="mi">0</span><span class="o">),</span> <span class="s">""</span><span class="o">)</span> <span class="o">}</span>
<span class="c1">// register the func as a udf</span>
<span class="c1">// give the UDF a schema or the Row type won't be supported</span>
<span class="k">val</span> <span class="nv">clean</span> <span class="k">=</span> <span class="nf">udf</span><span class="o">(</span><span class="n">cleanFunc</span><span class="o">,</span>
<span class="nc">StructType</span><span class="o">(</span>
<span class="nc">StructField</span><span class="o">(</span><span class="s">"myField"</span><span class="o">,</span> <span class="nc">BooleanType</span><span class="o">,</span> <span class="kc">true</span><span class="o">)</span> <span class="o">::</span>
<span class="nc">StructField</span><span class="o">(</span><span class="s">"editMe"</span><span class="o">,</span> <span class="nc">StringType</span><span class="o">,</span> <span class="kc">true</span><span class="o">)</span> <span class="o">::</span>
<span class="nc">Nil</span>
<span class="o">)</span>
<span class="o">)</span>
<span class="c1">// required for the $ column syntax</span>
<span class="k">import</span> <span class="nn">spark.sqlContext.implicits._</span>
<span class="c1">// if you have data that doesn't need editing, you can separate it out</span>
<span class="c1">// The data will need to be in a form that can be unioned with the edited data</span>
<span class="c1">// I do that here by selecting out all the fields.</span>
<span class="k">val</span> <span class="nv">alreadyGoodData</span> <span class="k">=</span> <span class="nv">originalData</span><span class="o">.</span><span class="py">where</span><span class="o">(</span><span class="s">"myStruct.editMe is null"</span><span class="o">).</span><span class="py">select</span><span class="o">(</span>
<span class="nc">Seq</span><span class="o">[</span><span class="kt">Column</span><span class="o">](</span>
<span class="n">$</span><span class="s">"myField"</span><span class="o">,</span>
<span class="n">$</span><span class="s">"myStruct"</span><span class="o">,</span>
<span class="n">$</span><span class="s">"myMap"</span>
<span class="o">)</span><span class="k">:_</span><span class="kt">*</span>
<span class="o">)</span>
<span class="c1">// apply the udf to the fields that need editing</span>
<span class="c1">// selecting out all the data that will be present in the final parquet file</span>
<span class="k">val</span> <span class="nv">transformedData</span> <span class="k">=</span> <span class="nv">originalData</span><span class="o">.</span><span class="py">where</span><span class="o">(</span><span class="s">"myStruct.editMe is not null"</span><span class="o">).</span><span class="py">select</span><span class="o">(</span>
<span class="nc">Seq</span><span class="o">[</span><span class="kt">Column</span><span class="o">](</span>
<span class="n">$</span><span class="s">"myField"</span><span class="o">,</span>
<span class="nf">clean</span><span class="o">(</span><span class="n">$</span><span class="s">"myStruct"</span><span class="o">).</span><span class="py">as</span><span class="o">(</span><span class="s">"myStruct"</span><span class="o">),</span>
<span class="n">$</span><span class="s">"myMap"</span>
<span class="o">)</span><span class="k">:_</span><span class="kt">*</span>
<span class="o">)</span>
<span class="c1">// union the two DataFrames</span>
<span class="k">val</span> <span class="nv">allGoodData</span> <span class="k">=</span> <span class="nv">alreadyGoodData</span><span class="o">.</span><span class="py">union</span><span class="o">(</span><span class="n">transformedData</span><span class="o">)</span>
<span class="c1">// write out the final edited data</span>
<span class="nv">allGoodData</span><span class="o">.</span><span class="py">write</span><span class="o">.</span><span class="py">parquet</span><span class="o">(</span><span class="n">destPath</span><span class="o">)</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
<h3 id="finally">Finally</h3>
<p>Always test your transforms before you delete the original data!</p></content><author><name>Dyana Rose</name></author><category term="blog" /><summary type="html">You’ve uncovered a problem in your beautiful parquet files, some piece of data either snuck in, or was calculated incorrectly, or there was just a bug. You know exactly how to correct the data, but how do you update the files?</summary></entry><entry><title type="html">Exploring Spark SQL DataTypes</title><link href="http://dyanarose.github.io/blog/2016/04/09/exploring-spark-sql-datatypes/" rel="alternate" type="text/html" title="Exploring Spark SQL DataTypes" /><published>2016-04-09T12:57:06+01:00</published><updated>2016-04-09T12:57:06+01:00</updated><id>http://dyanarose.github.io/blog/2016/04/09/exploring-spark-sql-datatypes</id><content type="html" xml:base="http://dyanarose.github.io/blog/2016/04/09/exploring-spark-sql-datatypes/"><p>I’ve been exploring how different DataTypes in Spark SQL are imported from line delimited json to try to understand which DataTypes can be used for a semi-structured data set I’m converting to parquet files. The data won’t all be processed at once and the schema will need to grow, so it’s imperative that the parquet files have schemas that are compatible.</p>
<p>The only one I really can’t get working yet is the CalendarIntervalType.</p>
<p>Looking at the Spark source files <a href="https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala">literals.scala</a> and <a href="https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java">CalendarInterval.java</a> I would assume that <code class="language-plaintext highlighter-rouge">CalendarInterval.fromString</code> is called with the value, however I’m just getting nulls back when passing in a value like ‘interval 2 days’ which, when passed to <code class="language-plaintext highlighter-rouge">CalendarInterval.fromString</code>, returns a non-null <code class="language-plaintext highlighter-rouge">CalendarInterval</code>.</p>
<p>Source code for the tests is at: https://github.com/dyanarose/dlr-spark</p>
<p>Results:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-------------- DecimalType --------------
------- DecimalType Input
{'decimal': 1.2345}
{'decimal': 1}
{'decimal': 234.231}
{'decimal': Infinity}
{'decimal': -Infinity}
{'decimal': NaN}
{'decimal': '1'}
{'decimal': '1.2345'}
{'decimal': null}
------- DecimalType Inferred Schema
root
|-- decimal: string (nullable = true)
+-----------+
| decimal|
+-----------+
| 1.2345|
| 1|
| 234.231|
| "Infinity"|
|"-Infinity"|
| "NaN"|
| 1|
| 1.2345|
| null|
+-----------+
------- DecimalType Set Schema
root
|-- decimal: decimal(6,3) (nullable = true)
+-------+
|decimal|
+-------+
| 1.235|
| 1.000|
|234.231|
| null|
| null|
| null|
| null|
| null|
| null|
+-------+
-------------- BooleanType --------------
------- BooleanType Input
{'boolean': true}
{'boolean': false}
{'boolean': 'false'}
{'boolean': 'true'}
{'boolean': null}
{'boolean': 1}
{'boolean': 0}
{'boolean': '1'}
{'boolean': '0'}
{'boolean': 'a'}
------- BooleanType Inferred Schema
root
|-- boolean: string (nullable = true)
+-------+
|boolean|
+-------+
| true|
| false|
| false|
| true|
| null|
| 1|
| 0|
| 1|
| 0|
| a|
+-------+
------- BooleanType Set Schema
root
|-- boolean: boolean (nullable = true)
+-------+
|boolean|
+-------+
| true|
| false|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
+-------+
-------------- ByteType --------------
------- ByteType Input
{'byte': 'a'}
{'byte': 'b'}
{'byte': 1}
{'byte': 0}
{'byte': 5}
{'byte': null}
------- ByteType Inferred Schema
root
|-- byte: string (nullable = true)
+----+
|byte|
+----+
| a|
| b|
| 1|
| 0|
| 5|
|null|
+----+
------- ByteType Set Schema
root
|-- byte: byte (nullable = true)
+----+
|byte|
+----+
|null|
|null|
| 1|
| 0|
| 5|
|null|
+----+
-------------- CalendarIntervalType --------------
------- CalendarIntervalType Input
{'calendarInterval': 'interval 2 days'}
{'calendarInterval': 'interval 1 week'}
{'calendarInterval': 'interval 5 years'}
{'calendarInterval': 'interval 6 months'}
{'calendarInterval': 10}
{'calendarInterval': 'interval a'}
{'calendarInterval': null}
------- CalendarIntervalType Inferred Schema
root
|-- calendarInterval: string (nullable = true)
+-----------------+
| calendarInterval|
+-----------------+
| interval 2 days|
| interval 1 week|
| interval 5 years|
|interval 6 months|
| 10|
| interval a|
| null|
+-----------------+
------- CalendarIntervalType Set Schema
root
|-- calendarInterval: calendarinterval (nullable = true)
+----------------+
|calendarInterval|
+----------------+
| null|
| null|
| null|
| null|
| null|
| null|
| null|
+----------------+
-------------- DateType --------------
------- DateType Input
{'date': '2016-04-24'}
{'date': '0001-01-01'}
{'date': '9999-12-31'}
{'date': '2016-04-24 12:10:01'}
{'date': 1461496201000}
{'date': null}
------- DateType Inferred Schema
root
|-- date: string (nullable = true)
+-------------------+
| date|
+-------------------+
| 2016-04-24|
| 0001-01-01|
| 9999-12-31|
|2016-04-24 12:10:01|
| 1461496201000|
| null|
+-------------------+
------- DateType Set Schema
root
|-- date: date (nullable = true)
+----------+
| date|
+----------+
|2016-04-24|
|0001-01-01|
|9999-12-31|
|2016-04-24|
| null|
| null|
+----------+
-------------- DoubleType --------------
------- DoubleType Input
{'double': 1.23456}
{'double': 1}
{'double': 1.7976931348623157E308}
{'double': -1.7976931348623157E308}
{'double': Infinity}
{'double': -Infinity}
{'double': NaN}
{'double': '1'}
{'double': '1.23456'}
{'double': null}
------- DoubleType Inferred Schema
root
|-- double: string (nullable = true)
+--------------------+
| double|
+--------------------+
| 1.23456|
| 1|
|1.797693134862315...|
|-1.79769313486231...|
| "Infinity"|
| "-Infinity"|
| "NaN"|
| 1|
| 1.23456|
| null|
+--------------------+
------- DoubleType Set Schema
root
|-- double: double (nullable = true)
+--------------------+
| double|
+--------------------+
| 1.23456|
| 1.0|
|1.797693134862315...|
|-1.79769313486231...|
| Infinity|
| -Infinity|
| NaN|
| null|
| null|
| null|
+--------------------+
-------------- FloatType --------------
------- FloatType Input
{'float': 1.23456}
{'float': 1}
{'float': 3.4028235E38}
{'float': -3.4028235E38}
{'float': Infinity}
{'float': -Infinity}
{'float': NaN}
{'float': '1'}
{'float': '1.23456'}
{'float': null}
------- FloatType Inferred Schema
root
|-- float: string (nullable = true)
+-------------+
| float|
+-------------+
| 1.23456|
| 1|
| 3.4028235E38|
|-3.4028235E38|
| "Infinity"|
| "-Infinity"|
| "NaN"|
| 1|
| 1.23456|
| null|
+-------------+
------- FloatType Set Schema
root
|-- float: float (nullable = true)
+-------------+
| float|
+-------------+
| 1.23456|
| 1.0|
| 3.4028235E38|
|-3.4028235E38|
| Infinity|
| -Infinity|
| NaN|
| null|
| null|
| null|
+-------------+
-------------- IntegerType --------------
------- IntegerType Input
{'integer': 1}
{'integer': 2147483647}
{'integer': -2147483648}
{'integer': 2147483648}
{'integer': '1'}
{'integer': 1.23456}
{'integer': '1.23456'}
{'integer': null}
------- IntegerType Inferred Schema
root
|-- integer: string (nullable = true)
+-----------+
| integer|
+-----------+
| 1|
| 2147483647|
|-2147483648|
| 2147483648|
| 1|
| 1.23456|
| 1.23456|
| null|
+-----------+
------- IntegerType Set Schema
root
|-- integer: integer (nullable = true)
+-----------+
| integer|
+-----------+
| 1|
| 2147483647|
|-2147483648|
| null|
| null|
| null|
| null|
| null|
+-----------+
-------------- LongType --------------
------- LongType Input
{'long': 1}
{'long': 9223372036854775807}
{'long': -9223372036854775808}
{'long': '1'}
{'long': 1.23456}
{'long': '1.23456'}
{'long': null}
------- LongType Inferred Schema
root
|-- long: string (nullable = true)
+--------------------+
| long|
+--------------------+
| 1|
| 9223372036854775807|
|-9223372036854775808|
| 1|
| 1.23456|
| 1.23456|
| null|
+--------------------+
------- LongType Set Schema
root
|-- long: long (nullable = true)
+--------------------+
| long|
+--------------------+
| 1|
| 9223372036854775807|
|-9223372036854775808|
| null|
| null|