Skip to content

Commit

Permalink
Merge pull request #709 from pandas-profiling/develop
Browse files Browse the repository at this point in the history
`v2.11.0
  • Loading branch information
sbrugman authored Feb 20, 2021
2 parents afc10c5 + 2cb1de1 commit 969c4e8
Show file tree
Hide file tree
Showing 25 changed files with 911 additions and 135 deletions.
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ repos:
additional_dependencies: [ pyupgrade==2.7.3 ]
args: [ --nbqa-mutate, --py36-plus ]
- repo: https://github.com/asottile/pyupgrade
rev: v2.9.0
rev: v2.10.0
hooks:
- id: pyupgrade
args: ['--py36-plus','--exit-zero-even-if-changed']
Expand Down
44 changes: 35 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,7 @@ For each column the following statistics - if relevant for the column type - are

## Announcements

**Version v2.10.1 released**: containing stability fixes for the previous release, which included a major overhaul of the type system, now fully reliant on visions.
See the changelog below to know what has changed.
**Version v2.11.0 released** featuring an exciting integration with Great Expectations that many of you requested (see details below).

**Spark backend in progress**: We can happily announce that we're nearing v1 for the Spark backend for generating profile reports.
Stay tuned.
Expand All @@ -52,18 +51,18 @@ It's extra exciting that GitHub **matches your contribution** for the first year

Find more information here:

- [Changelog v2.10.1](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/changelog.html#changelog-v2-10-1)
- [Changelog v2.11.0](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/changelog.html#changelog-v2-11-0)
- [Sponsor the project on GitHub](https://github.com/sponsors/sbrugman)

_February 7, 2021 💘_
_February 20, 2021 💘_

---

_Contents:_ **[Examples](#examples)** |
**[Installation](#installation)** | **[Documentation](#documentation)** |
**[Large datasets](#large-datasets)** | **[Command line usage](#command-line-usage)** |
**[Advanced usage](#advanced-usage)** | **[Support](#supporting-open-source)** |
**[Types](#types)** | **[How to contribute](#contributing)** |
**[Advanced usage](#advanced-usage)** | **[integrations](#integrations)** |
**[Support](#supporting-open-source)** | **[Types](#types)** | **[How to contribute](#contributing)** |
**[Editor Integration](#editor-integration)** | **[Dependencies](#dependencies)**

---
Expand Down Expand Up @@ -238,16 +237,43 @@ A set of options is available in order to adapt the report generated.
* `title` (`str`): Title for the report ('Pandas Profiling Report' by default).
* `pool_size` (`int`): Number of workers in thread pool. When set to zero, it is set to the number of CPUs available (0 by default).
* `progress_bar` (`bool`): If True, `pandas-profiling` will display a progress bar.
* `infer_dtypes` (`bool`): When `True` (default) the `dtype` of variables are inferred using `visions` using the typeset logic (for instance a column that has integers stored as string will be analyzed as if being numeric).

More settings can be found in the [default configuration file](https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_default.yaml), [minimal configuration file](https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_minimal.yaml) and [dark themed configuration file](https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_dark.yaml).

**Example**
You find the configuration docs on the advanced usage page [here](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/advanced_usage.html)

**Example**
```python
profile = df.profile_report(title='Pandas Profiling Report', plot={'histogram': {'bins': 8}})
profile.to_file("output.html")
```

## Integrations

### Great Expectations

<table>
<tr>
<td>

<img alt="Great Expectations" src="https://github.com/great-expectations/great_expectations/raw/develop/generic_dickens_protagonist.png" width="900" />

</td>
<td>

Profiling your data is closely related to data validation: often validation rules are defined in terms of well-known statistics.
For that purpose, `pandas-profiling` integrates with [Great Expectations](https://www.greatexpectations.io>).
This a world-class open-source library that helps you to maintain data quality and improve communication about data between teams.
Great Expectations allows you to create Expectations (which are basically unit tests for your data) and Data Docs (conveniently shareable HTML data reports).
`pandas-profiling` features a method to create a suite of Expectations based on the results of your ProfileReport, which you can store, and use to validate another (or future) dataset.

You can find more details on the Great Expectations integration [here](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/great_expectations_integration.html)

</td>
</tr>
</table>

## Supporting open source

Maintaining and developing the open-source code for pandas-profiling, with millions of downloads and thousands of users, would not be possible without support of our gracious sponsors.
Expand All @@ -269,15 +295,15 @@ Maintaining and developing the open-source code for pandas-profiling, with milli

We would like to thank our generous Github Sponsors supporters who make pandas-profiling possible:

Martin Sotir, Joseph Yuen, Brian Lee, Stephanie Rivera, nscsekhar, abdulAziz
Martin Sotir, Brian Lee, Stephanie Rivera, abdulAziz, gramster

More info if you would like to appear here: [Github Sponsor page](https://github.com/sponsors/sbrugman)


## Types

Types are a powerful abstraction for effective data analysis, that goes beyond the logical data types (integer, float etc.).
`pandas-profiling` currently recognizes the following types: _Boolean, Numerical, Date, Categorical, URL, Path, File_ and _Image_.
`pandas-profiling` currently, recognizes the following types: _Boolean, Numerical, Date, Categorical, URL, Path, File_ and _Image_.

We have developed a type system for Python, tailored for data analysis: [visions](https://github.com/dylan-profiler/visions).
Selecting the right typeset drastically reduces the complexity the code of your analysis.
Expand Down
1 change: 1 addition & 0 deletions docsrc/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
pages/sensitive_data
pages/metadata
pages/integrations
pages/great_expectations_integration
pages/changelog

.. toctree::
Expand Down
2 changes: 2 additions & 0 deletions docsrc/source/pages/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
Changelog
=========

.. include:: changelog/v2_11_0.rst

.. include:: changelog/v2_10_1.rst

.. include:: changelog/v2_10_0.rst
Expand Down
16 changes: 16 additions & 0 deletions docsrc/source/pages/changelog/v2_11_0.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
Changelog v2.11.0
----------------

🎉 Features
^^^^^^^^^^^
- Great Expectations integration `[430] <https://github.com/pandas-profiling/pandas-profiling/issues/430>`_ `docs <https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/great_expectations_integration.html>`_ (thanks @spbail, @talagluck and the Great Expectations team).
- Introduced the ``infer_dtypes`` parameter to control automatic inference of data types `[676] <https://github.com/pandas-profiling/pandas-profiling/issues/676>`_ (thanks @mohith7548 and @ieaves).
- Improved JSON representation for pd.Series, pd.DataFrame, numpy data and Samples.

🚨 Breaking changes
^^^^^^^^^^^^^^^^^^^
- Global config setting removed; config resets on report initialization.

⬆️ Dependencies
^^^^^^^^^^^^^^^^^^
- Update ``pyupgrade`` to ``2.10.0``.
30 changes: 30 additions & 0 deletions docsrc/source/pages/changelog/v2_12_0.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
Changelog v2.12.0
----------------

🎉 Features
^^^^^^^^^^^
-

🐛 Bug fixes
^^^^^^^^^^^^
-

👷‍♂️ Internal Improvements
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-

📖 Documentation
^^^^^^^^^^^^^^^^
-

⚠️ Deprecated
^^^^^^^^^^^^^^^^^
-

🚨 Breaking changes
^^^^^^^^^^^^^^^^^^^
-

⬆️ Dependencies
^^^^^^^^^^^^^^^^^^
-
150 changes: 150 additions & 0 deletions docsrc/source/pages/great_expectations_integration.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
====================================
Integration with Great Expectations
====================================

`Great Expectations <https://www.greatexpectations.io>`_ is a Python-based open-source library for validating, documenting, and profiling your data. It helps you to maintain data quality and improve communication about data between teams. With Great Expectations, you can assert what you expect from the data you load and transform, and catch data issues quickly – Expectations are basically *unit tests for your data*. Pandas Profiling features a method to create a suite of Expectations based on the results of your ProfileReport!


About Great Expectations
-------------------------

*Expectations* are assertions about your data. In Great Expectations, those assertions are expressed in a declarative language in the form of simple, human-readable Python methods. For example, in order to assert that you want values in a column ``passenger_count`` in your dataset to be integers between 1 and 6, you can say:

``expect_column_values_to_be_between(column="passenger_count", min_value=1, max_value=6)``

Great Expectations then uses this statement to validate whether the column ``passenger_count`` in a given table is indeed between 1 and 6, and returns a success or failure result. The library currently provides :ref:`several dozen highly expressive built-in Expectations<expectation_glossary>`, and allows you to write custom Expectations.

Great Expectations renders Expectations to clean, human-readable documentation called *Data Docs*. These HTML docs contain both your Expectation Suites as well as your data validation results each time validation is run – think of it as a continuously updated data quality report.

For more information about Great Expectations, check out the `Great Expectations documentation <https://docs.greatexpectations.io/en/latest/>`_ and join the `Great Expectations Slack channel <https://www.greatexpectations.io/slack>` for help.


Creating Expectation Suites with Pandas Profiling
--------------------------------------------------

An *Expectation Suite* is simply a set of Expectations. You can create Expectation Suites by writing out individual statements, such as the one above, or by automatically generating them based on profiler results.

Pandas Profiling provides a simple ``to_expectation_suite()`` method that returns a Great Expectations ``ExpectationSuite`` object which contains a set of Expectations.

**Pre-requisites**: In order to run the ``to_expectation_suite()`` method, you will need to install Great Expectations:
`pip install great_expectations`

If you would like to use the additional features such as saving the Suite and building Data Docs, you will also need to configure a Great Expectations Data Context by running ``great_expectations init`` while in your project directory.

.. code-block:: python
import pandas as pd
from pandas_profiling import ProfileReport
df = pd.read_csv("titanic.csv")
profile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)
# Obtain an Expectation Suite with a set of default Expectations
# By default, this also profiles the dataset, saves the suite, runs validation, and builds Data Docs
suite = profile.to_expectation_suite()
This assumes that the ``great_expectations`` Data Context directory is in the *same path* where you run the script. In order to specify the location of your Data Context, pass it in as an argument:

.. code-block:: python
import great_expectations as ge
data_context = ge.data_context.DataContext(context_root_dir="/Users/panda/code/my_ge_project/")
suite = profile.to_expectation_suite(data_context=data_context)
You can also configure each feature individually in the function call:

.. code-block:: python
suite = profile.to_expectation_suite(
suite_name="titanic_expectations",
data_context=data_context,
save_suite=False,
run_validation=False,
build_data_docs=False,
handler=handler
)
See `the Great Expectations Examples <https://pandas-profiling.github.io/pandas-profiling/examples/master/features/great_expectations_example.html>`_ for complete examples.


Included Expectation types
--------------------------

The ``to_expectation_suite`` method returns a default set of Expectations if Pandas Profiling determines that the assertion holds true for the profiled dataset.
The Expectation types depend on the datatype of a column:

**All columns**

* ``expect_column_values_to_not_be_null``
* ``expect_column_values_to_be_unique``

**Numeric columns**

* ``expect_column_values_to_be_in_type_list``
* ``expect_column_values_to_be_increasing``
* ``expect_column_values_to_be_decreasing``
* ``expect_column_values_to_be_between``

**Categorical columns**

* ``expect_column_values_to_be_in_set``

**Datetime columns**

* ``expect_column_values_to_be_between``

**Filename columns**

* ``expect_file_to_exist``


The default logic is straight forward and can be found here in `expectation_algorithms.py <https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/model/expectation_algorithms.py>`_.

Rolling your own Expectation Generation Logic
---------------------------------------------

If you would like to profile datasets at scale, your use case might require changing the default expectations logic.
The ``to_expectation_suite`` takes the ``handler`` parameter, which allows you to take full control of the generation process.
Generating expectations takes place in two steps:

- mapping the detected type of each column to a generator function (that receives the columns' summary statistics);
- generating expectations based on the summary (e.g. ``expect_column_values_to_not_be_null`` if ``summary["n_missing"] == 0``)

Adding an expectation to columns with constant length can be achieved for instance using this code:

.. code-block:: python
def fixed_length(name, summary, batch, *args):
"""Add a length expectation to columns with constant length values"""
if summary["min_length"] == summary["max_length"]:
batch.expect_column_value_lengths_to_equal(summary["min_length"])
return name, summary, batch
class MyExpectationHandler(Handler):
def __init__(self, typeset, *args, **kwargs):
mapping = {
Unsupported: [expectation_algorithms.generic_expectations],
Categorical: [expectation_algorithms.categorical_expectations, fixed_length],
Boolean: [expectation_algorithms.categorical_expectations],
Numeric: [expectation_algorithms.numeric_expectations],
URL: [expectation_algorithms.url_expectations],
File: [expectation_algorithms.file_expectations],
Path: [expectation_algorithms.path_expectations],
DateTime: [expectation_algorithms.datetime_expectations],
Image: [expectation_algorithms.image_expectations],
}
super().__init__(mapping, typeset, *args, **kwargs)
# (initiate report)
suite = report.to_expectation_suite(
handler=MyExpectationHandler(report.typeset)
)
You can automate even more by extending the typeset (by default the ``ProfilingTypeSet``) with semantic data types specific to your company or use case (for instance disease classification in healthcare or currency and IBAN in finance).
For that, you can find details in the `visions <https://github.com/dylan-profiler/visions>`_ documentation.
68 changes: 68 additions & 0 deletions examples/features/great_expectations_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
import great_expectations as ge
import pandas as pd

from pandas_profiling import ProfileReport
from pandas_profiling.utils.cache import cache_file

file_name = cache_file(
"titanic.csv",
"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv",
)

df = pd.read_csv(file_name)

profile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)

# Example 1
# Obtain expectation suite, this includes profiling the dataset, saving the expectation suite, validating the
# dataframe, and building data docs
suite = profile.to_expectation_suite(suite_name="titanic_expectations")

# Example 2
# Run Great Expectations while specifying the directory with an existing Great Expectations set-up by passing in a
# Data Context
data_context = ge.data_context.DataContext(context_root_dir="my_ge_root_directory/")

suite = profile.to_expectation_suite(
suite_name="titanic_expectations", data_context=data_context
)

# Example 3
# Just build the suite
suite = profile.to_expectation_suite(
suite_name="titanic_expectations",
save_suite=False,
run_validation=False,
build_data_docs=False,
)

# Example 4
# If you would like to use the method to just build the suite, and then manually save the suite, validate the dataframe,
# and build data docs

# First instantiate a data_context
data_context = ge.data_context.DataContext(context_root_dir="my_ge_root_directory/")

# Create the suite
suite = profile.to_expectation_suite(
suite_name="titanic_expectations",
data_context=data_context,
save_suite=False,
run_validation=False,
build_data_docs=False,
)

# Save the suite
data_context.save_expectation_suite(suite)

# Run validation on your dataframe
batch = ge.dataset.PandasDataset(df, expectation_suite=suite)

results = data_context.run_validation_operator(
"action_list_operator", assets_to_validate=[batch]
)
validation_result_identifier = results.list_validation_result_identifiers()[0]

# Build and open data docs
data_context.build_data_docs()
data_context.open_data_docs(validation_result_identifier)
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
with (source_root / "requirements.txt").open(encoding="utf8") as f:
requirements = f.readlines()

version = "2.10.1"
version = "2.11.0"

with (source_root / "src" / "pandas_profiling" / "version.py").open(
"w", encoding="utf-8"
Expand Down
Loading

0 comments on commit 969c4e8

Please sign in to comment.