Using the Nested-Dask nest Accessor

Using the Nested-Dask nest Accessor#

The nest accessor implements an additional API layer to support working with nested columns of a NestedFrame.

NOTE: The nest accessor in Nested-Dask has a limited implementation compared to Nested-Pandas

[1]:
from nested_dask.datasets import generate_data

# generate_data creates some toy data
ndf = generate_data(10, 5)  # 10 rows, 5 nested rows per row
ndf
[1]:
Nested-Dask NestedFrame Structure:
a b nested
npartitions=1
0 float64 float64 nested<t: [double], flux: [double], band: [string]>
9 ... ... ...
Dask Name: repartition, 3 expressions

The nest accessor is available when selecting a nested column of a NestedFrame. For example:

[2]:
ndf["nested"].nest
[2]:
<nested_dask.accessor.DaskNestSeriesAccessor at 0x77bd546e1420>

Nested column labels can be viewed using the fields property:

[3]:
ndf["nested"].nest.fields
[3]:
['t', 'flux', 'band']

Nested data can be viewed in different formats using nest accessor functions.

to_flat will take the nested data and send it to a single flat DataFrame:

[4]:
flat_nested = ndf["nested"].nest.to_flat()
flat_nested
[4]:
Dask DataFrame Structure:
t flux band
npartitions=1
0 double[pyarrow] double[pyarrow] string[pyarrow]
9 ... ... ...
Dask Name: lambda, 5 expressions
[5]:
flat_nested.head(20)
[5]:
t flux band
0 1.325903 61.153573 r
0 6.033141 72.901042 r
0 12.220836 43.097841 r
0 0.091875 38.133159 r
0 4.400336 61.888155 r
1 16.782632 1.105326 g
1 17.605098 14.738311 r
1 1.457988 49.323012 r
1 10.796511 16.801297 g
1 13.16986 68.882152 r
2 8.908296 42.003535 g
2 5.024717 51.351708 g
2 4.938491 40.938804 g
2 16.43555 37.173199 g
2 12.318743 72.938537 r
3 10.552214 20.061313 g
3 19.585625 44.054317 r
3 13.707732 3.786506 g
3 16.863835 21.510408 g
3 1.644076 92.095755 g

The index of the resulting flat dataframe is repeated and maps directly to the index of the original NestedFrame.

Alternatively, to_lists can be used to package the data into numpy arrays:

[6]:
list_nested = ndf["nested"].nest.to_lists()
list_nested.compute()
[6]:
t flux band
0 [ 1.32590268 6.03314089 12.22083592 0.091874... [61.15357327 72.9010419 43.09784106 38.133159... ['r' 'r' 'r' 'r' 'r']
1 [16.78263201 17.60509845 1.45798796 10.796510... [ 1.10532635 14.73831052 49.32301171 16.801296... ['g' 'r' 'r' 'g' 'r']
2 [ 8.90829613 5.02471726 4.93849145 16.435550... [42.00353481 51.35170808 40.93880372 37.173198... ['g' 'g' 'g' 'g' 'r']
3 [10.55221447 19.58562539 13.70773219 16.863835... [20.06131347 44.05431747 3.78650584 21.510407... ['g' 'r' 'g' 'g' 'g']
4 [14.33024211 9.54222321 4.15897558 9.475802... [64.34397867 72.18977285 87.4422239 41.103670... ['r' 'g' 'r' 'r' 'g']
5 [10.15234846 14.96664897 2.81065012 13.992988... [37.81706109 38.28169835 1.1854475 88.244109... ['r' 'r' 'g' 'r' 'g']
6 [10.12681637 5.03381176 11.52850486 6.838267... [44.87793133 24.3005897 42.6610081 23.490047... ['r' 'r' 'r' 'g' 'g']
7 [7.57029852 9.57362512 5.9748561 0.14549093 9... [28.17126776 86.51496141 88.49730501 23.257712... ['r' 'g' 'g' 'g' 'g']
8 [ 9.25392718 7.50260614 12.75698421 8.255835... [ 0.68343356 67.82475979 11.49455173 45.731410... ['g' 'r' 'r' 'r' 'r']
9 [14.14808928 7.02496436 7.50543544 11.957956... [31.83960187 97.7125154 58.38161137 20.607829... ['r' 'r' 'r' 'r' 'g']