Skip to content
Peter edited this page Jan 14, 2018 · 4 revisions

Welcome to the unitex-pt-br wiki!

Balancing chuncks of data

To version control and spreadshit editition big files is difficult to maintain.

  • To split DELAS: chunks ranging from ~1000 to ~3000 lines.
    Example: grep -E "^a[a-f]" DELAS.csv | wc -l (2605), ^a[g-m] (2185), ^a[n-q] (2023), ^a[r-z] (2743), ^b (3132), ^c[a-g] (2927), ^c[h-n] (1242), ...

  • To split DELACF: chunks of ~2000 lines.
    Example: grep -E "^a-m" DELAS.csv | wc -l (2332), ^[n-z] (1745).

Graphs, didactic samples

To test and to show convertion algorithms, use some basic samples... Need to check the most frequent ones... Electing random ones:

Most and less frequent graphs: select graph, count(*) as n from dataset.vw2_delas group by 1 order by 2 desc

graph n
A201 9984
N001 9402
V005 9381
N101 9053
A301 3856
... ...
ADV 2628
N301 1730
N004+Pr 1723
A218 1702
...
A201D081 739
...
A001D024 1
A001D026A01 1
...
A011 1
A038 1
A039 1
... ...

Related issues at unitex-lingua