Quickstart#

With a valid Python environment, nested-dask and its dependencies are easy to install using the pip package manager. The following command can be used to install it:

[1]:

# % pip install nested-dask

Nested-Dask is a package that enables parallelized computation of nested associated datasets.

Usage of Nested-Dask very closely follows the usage of Nested-Pandas, but with a layer of Dask concepts introduced on top. This quickstart guide will step through a basic example that mirrors the quickstart guide of Nested-Pandas. First, let’s load some toy data:

[2]:

from nested_dask.datasets import generate_data

# generate_data creates some toy data
ndf = generate_data(10, 100)  # 10 rows, 100 nested rows per row
ndf

[2]:

Nested-Dask NestedFrame Structure:

	a	b	nested
npartitions=1
0	float64	float64	nested<t: [double], flux: [double], band: [string]>
9	...	...	...

Dask Name: repartition, 3 expressions

The above is a Nested-Dask NestedFrame object. It’s currently a “lazy” representation of the data, meaning that no data has actually been brought into memory yet. This lazy view gives us some useful information on the structure of the data, with notable pieces of information being:

Shows us which columns are in the dataset and their respective dtypes.
npartitions=1 indicates how many partitions the dataset has been split into.
The 0 and 9 tell us the “divisions” of the partitions. When the dataset is sorted by the index, these divisions are ranges to show which index values reside in each partition.

We can signal to Dask that we’d like to actually obtain the data as nested_pandas.NestedFrame by using compute.

[3]:

ndf.compute()  # or could use ndf.head(n) to peak at the first n rows

[3]:

nested

0.553232

0.694558

t	flux	band

12.175377

42.164104

100 rows × 3 columns

0.197014

1.166842

12.536513

37.396248

100 rows × 3 columns

0.256170

1.864761

7.488097

3.570787

100 rows × 3 columns

0.409416

1.070263

8.002979

13.827631

100 rows × 3 columns

0.747570

1.222694

15.813757

93.281329

100 rows × 3 columns

0.051989

1.041328

10.481528

55.746571

100 rows × 3 columns

0.485369

0.831962

7.587258

74.04039

100 rows × 3 columns

0.111242

1.332449

6.925255

71.580862

100 rows × 3 columns

0.056775

1.770634

18.560089

14.05368

100 rows × 3 columns

0.544903

1.191014

3.610144

85.842111

100 rows × 3 columns

10 rows x 3 columns

As with Nested-Pandas, this NestedFrame holds special nested columns in addition to normal Pandas columns. In this case, we have the top level dataframe with 10 rows and 2 typical columns, “a” and “b”. The “nested” column contains a dataframe in each row. We can inspect the contents of the “nested” column using the standard Pandas/Dask API.

[4]:

ndf.nested.compute()[0]

[4]:

	t	flux	band
0	12.175377	42.164104	g
1	19.233503	69.481572	g
...	...	...	...
98	6.708918	73.296905	g
99	9.784017	28.373037	g

100 rows × 3 columns

Here we see that within the “nested” column there are Nested-Pandas NestedFrame objects with their own data. In this case we have 3 columns (“t”, “flux”, and “band”).

Nested-Dask functionality mirrors Nested-Pandas, as we can see via the query function. In this case, we use a Nested-Pandas specific feature to query nested layers using a hierarchical column name (“nested.t” queries the “t” sub-column from the “nested” column of ndf).

[5]:

# Applies the query to "nested", filtering based on "t >17"
result = ndf.query("nested.t > 17.0")
result

[5]:

Nested-Dask NestedFrame Structure:

	a	b	nested
npartitions=1
0	float64	float64	nested<t: [double], flux: [double], band: [string]>
9	...	...	...

Dask Name: lambda, 4 expressions

Once again, the result is lazy and no work has been performed. We can kick off some computation using compute as above or this time using head to just peek at the result:

[6]:

result.head(5)

[6]:

nested

0.553232

0.694558

t	flux	band

19.233503

69.481572

14 rows × 3 columns

0.197014

1.166842

17.560908

30.381616

14 rows × 3 columns

0.256170

1.864761

19.802621

74.559191

13 rows × 3 columns

0.409416

1.070263

17.40892

79.339289

15 rows × 3 columns

0.747570

1.222694

17.217121

9.007239

13 rows × 3 columns

5 rows x 3 columns

In this case, the query has actually affected the rows of the “nested” column.

[7]:

result.head(5).nested[0]  # no t value lower than 17.0

[7]:

	t	flux	band
0	19.233503	69.481572	g
1	18.837225	27.419595	g
2	18.575388	61.390772	g
3	18.964309	18.795985	g
4	17.992833	23.928589	g
5	19.359389	66.015598	r
6	19.221336	80.603016	g
7	19.726043	63.736115	g
8	19.202128	34.520226	r
9	17.117676	64.947844	r
10	18.935078	32.638825	r
11	17.736965	13.276942	r
12	19.753557	77.232290	r
13	18.749949	8.213464	g

Nested-Dask reduce functions near-identically to Nested-Pandas reduce, providing a way to call custom functions on NestedFrame data. The one addition is that we’ll need to provide the Dask meta value for the result. This is a dataframe-like or series-like object that has the same structure as the expected output. Let’s compute the mean flux for each dataframe in the “nested” column.

[8]:

import numpy as np
import pandas as pd

# The result will be a series with float values
meta = pd.DataFrame(columns=[0], dtype=float)

# use hierarchical column names to access the flux column
# passed as an array to np.mean
means = ndf.reduce(np.mean, "nested.flux", meta=meta)
means.compute()

[8]:

	0
0	49.355251
1	53.378265
2	53.379516
3	45.582065
4	48.958388
5	46.035598
6	48.913822
7	56.935945
8	46.562114
9	52.938622

10 rows × 1 columns

[ ]: