Quickstart#
With a valid Python environment, nested-dask and its dependencies are easy to install using the pip package manager. The following command can be used to install it:
[1]:
# % pip install nested-dask
Nested-Dask is a package that enables parallelized computation of nested associated datasets.
Usage of Nested-Dask very closely follows the usage of Nested-Pandas, but with a layer of Dask concepts introduced on top. This quickstart guide will step through a basic example that mirrors the quickstart guide of Nested-Pandas. First, let’s load some toy data:
[2]:
from nested_dask.datasets import generate_data
# generate_data creates some toy data
ndf = generate_data(10, 100) # 10 rows, 100 nested rows per row
ndf
[2]:
| a | b | nested | |
|---|---|---|---|
| npartitions=1 | |||
| 0 | float64 | float64 | nested<t: [double], flux: [double], band: [string]> |
| 9 | ... | ... | ... |
The above is a Nested-Dask NestedFrame object. It’s currently a “lazy” representation of the data, meaning that no data has actually been brought into memory yet. This lazy view gives us some useful information on the structure of the data, with notable pieces of information being:
Shows us which columns are in the dataset and their respective dtypes.
npartitions=1indicates how many partitions the dataset has been split into.The
0and9tell us the “divisions” of the partitions. When the dataset is sorted by the index, these divisions are ranges to show which index values reside in each partition.
We can signal to Dask that we’d like to actually obtain the data as nested_pandas.NestedFrame by using compute.
[3]:
ndf.compute() # or could use ndf.head(n) to peak at the first n rows
[3]:
| a | b | nested | |||||||
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.553232 | 0.694558 |
100 rows × 3 columns |
||||||
| 1 | 0.197014 | 1.166842 |
100 rows × 3 columns |
||||||
| 2 | 0.256170 | 1.864761 |
100 rows × 3 columns |
||||||
| 3 | 0.409416 | 1.070263 |
100 rows × 3 columns |
||||||
| 4 | 0.747570 | 1.222694 |
100 rows × 3 columns |
||||||
| 5 | 0.051989 | 1.041328 |
100 rows × 3 columns |
||||||
| 6 | 0.485369 | 0.831962 |
100 rows × 3 columns |
||||||
| 7 | 0.111242 | 1.332449 |
100 rows × 3 columns |
||||||
| 8 | 0.056775 | 1.770634 |
100 rows × 3 columns |
||||||
| 9 | 0.544903 | 1.191014 |
100 rows × 3 columns |
As with Nested-Pandas, this NestedFrame holds special nested columns in addition to normal Pandas columns. In this case, we have the top level dataframe with 10 rows and 2 typical columns, “a” and “b”. The “nested” column contains a dataframe in each row. We can inspect the contents of the “nested” column using the standard Pandas/Dask API.
[4]:
ndf.nested.compute()[0]
[4]:
| t | flux | band | |
|---|---|---|---|
| 0 | 12.175377 | 42.164104 | g |
| 1 | 19.233503 | 69.481572 | g |
| ... | ... | ... | ... |
| 98 | 6.708918 | 73.296905 | g |
| 99 | 9.784017 | 28.373037 | g |
100 rows × 3 columns
Here we see that within the “nested” column there are Nested-Pandas NestedFrame objects with their own data. In this case we have 3 columns (“t”, “flux”, and “band”).
Nested-Dask functionality mirrors Nested-Pandas, as we can see via the query function. In this case, we use a Nested-Pandas specific feature to query nested layers using a hierarchical column name (“nested.t” queries the “t” sub-column from the “nested” column of ndf).
[5]:
# Applies the query to "nested", filtering based on "t >17"
result = ndf.query("nested.t > 17.0")
result
[5]:
| a | b | nested | |
|---|---|---|---|
| npartitions=1 | |||
| 0 | float64 | float64 | nested<t: [double], flux: [double], band: [string]> |
| 9 | ... | ... | ... |
Once again, the result is lazy and no work has been performed. We can kick off some computation using compute as above or this time using head to just peek at the result:
[6]:
result.head(5)
[6]:
| a | b | nested | |||||||
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.553232 | 0.694558 |
14 rows × 3 columns |
||||||
| 1 | 0.197014 | 1.166842 |
14 rows × 3 columns |
||||||
| 2 | 0.256170 | 1.864761 |
13 rows × 3 columns |
||||||
| 3 | 0.409416 | 1.070263 |
15 rows × 3 columns |
||||||
| 4 | 0.747570 | 1.222694 |
13 rows × 3 columns |
In this case, the query has actually affected the rows of the “nested” column.
[7]:
result.head(5).nested[0] # no t value lower than 17.0
[7]:
| t | flux | band | |
|---|---|---|---|
| 0 | 19.233503 | 69.481572 | g |
| 1 | 18.837225 | 27.419595 | g |
| 2 | 18.575388 | 61.390772 | g |
| 3 | 18.964309 | 18.795985 | g |
| 4 | 17.992833 | 23.928589 | g |
| 5 | 19.359389 | 66.015598 | r |
| 6 | 19.221336 | 80.603016 | g |
| 7 | 19.726043 | 63.736115 | g |
| 8 | 19.202128 | 34.520226 | r |
| 9 | 17.117676 | 64.947844 | r |
| 10 | 18.935078 | 32.638825 | r |
| 11 | 17.736965 | 13.276942 | r |
| 12 | 19.753557 | 77.232290 | r |
| 13 | 18.749949 | 8.213464 | g |
Nested-Dask reduce functions near-identically to Nested-Pandas reduce, providing a way to call custom functions on NestedFrame data. The one addition is that we’ll need to provide the Dask meta value for the result. This is a dataframe-like or series-like object that has the same structure as the expected output. Let’s compute the mean flux for each dataframe in the “nested” column.
[8]:
import numpy as np
import pandas as pd
# The result will be a series with float values
meta = pd.DataFrame(columns=[0], dtype=float)
# use hierarchical column names to access the flux column
# passed as an array to np.mean
means = ndf.reduce(np.mean, "nested.flux", meta=meta)
means.compute()
[8]:
| 0 | |
|---|---|
| 0 | 49.355251 |
| 1 | 53.378265 |
| 2 | 53.379516 |
| 3 | 45.582065 |
| 4 | 48.958388 |
| 5 | 46.035598 |
| 6 | 48.913822 |
| 7 | 56.935945 |
| 8 | 46.562114 |
| 9 | 52.938622 |
10 rows × 1 columns
[ ]: