This dataset contains the searchterms that users looked for on the Berlin Open Data Portal (https://daten.berlin.de). Terms are collected per month (starting in February 2019, when we started using our new analytics software), and ranked by how often they were searched (i.e., the number of page impressions).
The code to extract the searchterm statistics is written in Python. It has been tested with Python 3.12.4.
The required libraries are defined in requirements.txt. In particuler, these are:
- ckan_metadata_updater
- keyring
- python-dateutil
- requests
Set up a virtual environment and install the required libraries there:
berlin_dataportal_searchterms % python -m venv venv
(venv) berlin_dataportal_searchterms % pip install -r requirements.txt
You need to set two environment variables:
MAPP_URL
: The base URL of the Mapp service, most likely https://intelligence.eu.mapp.com.MAPP_USER
: The username of the Mapp user that is going to run the query to retrieve the statistics.MAPP_PW
: The password of the Mapp user that is going to run the query to retrieve the statistics.- Alternatively, on MacOS, you can install the
keyring
library (pip install keyring
) and set the password there, as a key calledmapp_api
and with the user name given byMAPP_PW
.
- Alternatively, on MacOS, you can install the
Download here: daten_berlin_de.searchterms.json
For each searchterm that was entered in a given month, the page impressions, visits, average page duration (in seconds) and exit rate (%) are listed.
The following example illustrates the structure of the data file:
{
"timestamp": "2020-05-29T15:21:32+02:00",
"source": "Webtrekk",
"stats": {
"site_uri": "daten.berlin.de",
"earliest": "2019-02",
"latest": "2020-04",
"months": {
"2020-04": {
"terms": {
"corona": {
"impressions": 27,
"visits": 20,
"page_duration_avg": 36.81,
"exit_rate": 20.0
},
"verkehr": {
"impressions": 24,
"visits": 8,
"page_duration_avg": 38.08,
"exit_rate": 0.0
},
...
"new york": {
"impressions": 1,
"visits": 1,
"page_duration_avg": 0.0,
"exit_rate": 0.0
}
},
"removed_items": {
"comment": "Removed 13 searchterms as potentially personal information.",
"count": 13
}
},
"2020-03": {
...
},
...
}
}
}
All searchterms that potentially contain personal information are removed from the data before publishing it here.
In particular, the following categories of searchterms are removed:
- personal names
- (postal) addresses
- geographic coordinates
- personal e-mail adresses
- phone numbers
- land lots (German „Flurstück“)
Instances of these categories are currently not detected automatically, but rather manually via the use of a blocklist (not included in this repository), which is being extended each time the dataset is updated (i.e., every month).
There are exceptions where searchterms are included in the data, even though they belong to one of the exclusion categories. In particular, we allow the following kinds of searchterms:
- Personal names of public figures
The criterion for being a public figure is: there is a (stable) Wikipedia page for that person. The criteria for people to have Wikipedia page are defined here.
Another possible criterion is that a name has an entry in a bibliographic authority file (something like a database of all known authors), such as the Gemeinsame Normdatei. In other words, a name is the name of a published author. - Functional e-mail addresses
Functional e-mail addresses (addresses not tied to a particular person, but to a role or a post such as[email protected]
,[email protected]
etc.) do not contain personal information and can therefore be included.
Searchterms are currently not normalized in any way. This means that different spellings of the same term (most importantly: differences in case) are treated as different searchterms. It is possible to sum page impressions for each term. This is not possible for visits, because the same visit might include two or more searchterms, and so the actual number of visits for a set of searchterms might be less than the sum of visits for all of them.
For example:
{
...
"terms": {
"corona": {
"impressions": 27,
"visits": 20,
"page_duration_avg": 36.81,
"exit_rate": 20.0
},
...
"Corona": {
"impressions": 8,
"visits": 6,
"page_duration_avg": 21.25,
"exit_rate": 0.0
},
...
"covid": {
"impressions": 2,
"visits": 2,
"page_duration_avg": 13.0,
"exit_rate": 0.0
},
...
}
People searched for corona
(lower case), Corona
(upper case), covid
and possibly other related searchterms.
It would be valid to say that the total number of page impressions for all spellings of corona
is 27+8=35
, and 27+8+2=37
for all Corona-related searches.
However, the total number of visits for all spellings of corona
is 20+6=26
or less because some of these searches may have occured within the same visit.
Because adding a new month to the data involves manually editing the blocklist and allowlist, it is more complicated than just running a make target.
The following information is based on the old Ruby-based code and is no longer valid:
Here is what needs to be done:
make unfiltered
This will extract the search data from Webtrekk Analytics up until the last day of the previous month.
The ouput is written to data/temp/daten_berlin_de.searchterms.unfiltered.json
.
We need to manually go through all the new search terms, pick those that are problematic and then either add them to the blocklist or allowlist.
To generate a simple list of terms (without hits, visits etc.) for a given month, do make data/temp/terms_YYYY-MM.json
, e.g.:
make data/temp/terms_2020-06.json
Manually go through the list and extract all potentially problematic search terms. What I do is simply delete all unproblematic ones, leaving me with the list of problematic ones.
For each problematic term, decide if it really needs to be filtered out or if maybe it should be allowed after all. See Filtering Personal Information.
Each new addition to the blocklist simply needs to be added to the appropriate category (though the categories are just a way to structure the list for humans, they are not used otherwise).
Each new addition to the allowlist looks like this:
{
...
"friedrich wilhelm förster": {
"variants": [
"friedrich wilhelm foerster",
"friedrich wilhelm förster"
] ,
"reference": [
"https://de.wikipedia.org/wiki/Friedrich_Wilhelm_Foerster" ,
"https://d-nb.info/gnd/118692038"
]
}
...
So, either add a new variant to an existing entry or create a new one.
Remarks:
- It's possible that an entry has only one
variant
. - The entry's key and the grouping of variants are irrelevant, just a way to structure the list for a human reader.
- Also, the
reference
is technically not necessary, but helpful as a reminder why the decision was made to include a searchterm in the allowlist.
Now that we have the updated block- and allowlist, we can filter the data and create the final data file, which goes into data/target
.
make final
- search logo by FontAwesome under CC BY 4.0.
All software in this repository is published under the MIT License.
All data in this repository (in particular the .json
files) is published under CC BY 3.0 DE.
Dataset URL: https://daten.berlin.de/datensaetze/suchbegriffe-daten-berlin-de
This page was generated from the github repository at https://github.com/berlinonline/berlin_dataportal_searchterms.
2024, Knud Möller, BerlinOnline GmbH
Last changed: 2024-12-17