Quickstart

Quickstart#

With a valid Python environment, nested-dask and its dependencies are easy to install using the pip package manager. The following command can be used to install it:

[1]:
# % pip install nested-dask

Nested-Dask is a package that enables parallelized computation of nested associated datasets.

Usage of Nested-Dask very closely follows the usage of Nested-Pandas, but with a layer of Dask concepts introduced on top. This quickstart guide will step through a basic example that mirrors the quickstart guide of Nested-Pandas. First, let’s load some toy data:

[2]:
from nested_dask.datasets import generate_data

# generate_data creates some toy data
ndf = generate_data(10, 100)  # 10 rows, 100 nested rows per row
ndf
[2]:
Nested-Dask NestedFrame Structure:
a b nested
npartitions=1
0 float64 float64 nested<t: [double], flux: [double], band: [string]>
9 ... ... ...
Dask Name: repartition, 3 expressions

The above is a Nested-Dask NestedFrame object. It’s currently a “lazy” representation of the data, meaning that no data has actually been brought into memory yet. This lazy view gives us some useful information on the structure of the data, with notable pieces of information being:

  • Shows us which columns are in the dataset and their respective dtypes.

  • npartitions=1 indicates how many partitions the dataset has been split into.

  • The 0 and 9 tell us the “divisions” of the partitions. When the dataset is sorted by the index, these divisions are ranges to show which index values reside in each partition.

We can signal to Dask that we’d like to actually obtain the data as nested_pandas.NestedFrame by using compute.

[3]:
ndf.compute()  # or could use ndf.head(n) to peak at the first n rows
[3]:
  a b nested
0 0.553232 0.694558
t flux band
12.175377 42.164104 g

100 rows × 3 columns

1 0.197014 1.166842
12.536513 37.396248 g

100 rows × 3 columns

2 0.256170 1.864761
7.488097 3.570787 r

100 rows × 3 columns

3 0.409416 1.070263
8.002979 13.827631 r

100 rows × 3 columns

4 0.747570 1.222694
15.813757 93.281329 g

100 rows × 3 columns

5 0.051989 1.041328
10.481528 55.746571 g

100 rows × 3 columns

6 0.485369 0.831962
7.587258 74.04039 r

100 rows × 3 columns

7 0.111242 1.332449
6.925255 71.580862 r

100 rows × 3 columns

8 0.056775 1.770634
18.560089 14.05368 g

100 rows × 3 columns

9 0.544903 1.191014
3.610144 85.842111 g

100 rows × 3 columns

10 rows x 3 columns

As with Nested-Pandas, this NestedFrame holds special nested columns in addition to normal Pandas columns. In this case, we have the top level dataframe with 10 rows and 2 typical columns, “a” and “b”. The “nested” column contains a dataframe in each row. We can inspect the contents of the “nested” column using the standard Pandas/Dask API.

[4]:
ndf.nested.compute()[0]
[4]:
t flux band
0 12.175377 42.164104 g
1 19.233503 69.481572 g
... ... ... ...
98 6.708918 73.296905 g
99 9.784017 28.373037 g

100 rows × 3 columns

Here we see that within the “nested” column there are Nested-Pandas NestedFrame objects with their own data. In this case we have 3 columns (“t”, “flux”, and “band”).

Nested-Dask functionality mirrors Nested-Pandas, as we can see via the query function. In this case, we use a Nested-Pandas specific feature to query nested layers using a hierarchical column name (“nested.t” queries the “t” sub-column from the “nested” column of ndf).

[5]:
# Applies the query to "nested", filtering based on "t >17"
result = ndf.query("nested.t > 17.0")
result
[5]:
Nested-Dask NestedFrame Structure:
a b nested
npartitions=1
0 float64 float64 nested<t: [double], flux: [double], band: [string]>
9 ... ... ...
Dask Name: lambda, 4 expressions

Once again, the result is lazy and no work has been performed. We can kick off some computation using compute as above or this time using head to just peek at the result:

[6]:
result.head(5)
[6]:
  a b nested
0 0.553232 0.694558
t flux band
19.233503 69.481572 g

14 rows × 3 columns

1 0.197014 1.166842
17.560908 30.381616 r

14 rows × 3 columns

2 0.256170 1.864761
19.802621 74.559191 g

13 rows × 3 columns

3 0.409416 1.070263
17.40892 79.339289 g

15 rows × 3 columns

4 0.747570 1.222694
17.217121 9.007239 r

13 rows × 3 columns

5 rows x 3 columns

In this case, the query has actually affected the rows of the “nested” column.

[7]:
result.head(5).nested[0]  # no t value lower than 17.0
[7]:
t flux band
0 19.233503 69.481572 g
1 18.837225 27.419595 g
2 18.575388 61.390772 g
3 18.964309 18.795985 g
4 17.992833 23.928589 g
5 19.359389 66.015598 r
6 19.221336 80.603016 g
7 19.726043 63.736115 g
8 19.202128 34.520226 r
9 17.117676 64.947844 r
10 18.935078 32.638825 r
11 17.736965 13.276942 r
12 19.753557 77.232290 r
13 18.749949 8.213464 g

Nested-Dask reduce functions near-identically to Nested-Pandas reduce, providing a way to call custom functions on NestedFrame data. The one addition is that we’ll need to provide the Dask meta value for the result. This is a dataframe-like or series-like object that has the same structure as the expected output. Let’s compute the mean flux for each dataframe in the “nested” column.

[8]:
import numpy as np
import pandas as pd

# The result will be a series with float values
meta = pd.DataFrame(columns=[0], dtype=float)

# use hierarchical column names to access the flux column
# passed as an array to np.mean
means = ndf.reduce(np.mean, "nested.flux", meta=meta)
means.compute()
[8]:
0
0 49.355251
1 53.378265
2 53.379516
3 45.582065
4 48.958388
5 46.035598
6 48.913822
7 56.935945
8 46.562114
9 52.938622

10 rows × 1 columns

[ ]: