Home

Welcome to the unitex-pt-br wiki!

Balancing chuncks of data

To version control and spreadshit editition big files is difficult to maintain.

To split DELAS: chunks ranging from ~1000 to ~3000 lines.
Example: grep -E "^a[a-f]" DELAS.csv | wc -l (2605), ^a[g-m] (2185), ^a[n-q] (2023), ^a[r-z] (2743), ^b (3132), ^c[a-g] (2927), ^c[h-n] (1242), ...
To split DELACF: chunks of ~2000 lines.
Example: grep -E "^a-m" DELAS.csv | wc -l (2332), ^[n-z] (1745).

To test and to show convertion algorithms, use some basic samples... Need to check the most frequent ones... Electing random ones:

Most and less frequent graphs: select graph, count(*) as n from dataset.vw2_delas group by 1 order by 2 desc