nested_dask.core
================

.. py:module:: nested_dask.core


Classes
-------

.. autoapisummary::

   nested_dask.core.NestedFrame


Module Contents
---------------

.. py:class:: NestedFrame(expr)

   Bases: :py:obj:`_Frame`, :py:obj:`dask.dataframe.DataFrame`


   An extension for a Dask Dataframe that has Nested-Pandas functionality.

   .. rubric:: Examples

   >>> import nested_dask as nd
   >>> base = nd.NestedFrame(base_data)
   >>> layer = nd.NestedFrame(layer_data)
   >>> base.add_nested(layer, "layer")


   .. py:method:: __getitem__(item)

      Adds custom __getitem__ functionality for nested columns


   .. py:method:: __setitem__(key, value)

      Adds custom __setitem__ behavior for nested columns


   .. py:method:: from_pandas(data, npartitions=None, chunksize=None, sort=True) -> NestedFrame
      :classmethod:


      Returns an Nested-Dask NestedFrame constructed from a Nested-Pandas
      NestedFrame or Pandas DataFrame.

      :param data: Nested-Pandas NestedFrame containing the underlying data
      :type data: `NestedFrame` or `DataFrame`
      :param npartitions: The number of partitions of the index to create. Note that depending on
                          the size and index of the dataframe, the output may have fewer
                          partitions than requested.
      :type npartitions: `int`, optional
      :param chunksize: The desired number of rows per index partition to use. Note that
                        depending on the size and index of the dataframe, actual partition
                        sizes may vary.
      :type chunksize: `int`, optional
      :param sort: Whether to sort the frame by a default index.
      :type sort: `bool`, optional

      :returns: **result** -- The constructed Dask-Nested NestedFrame object.
      :rtype: `NestedFrame`


   .. py:method:: from_dask_dataframe(df: dask.dataframe.DataFrame) -> NestedFrame
      :classmethod:


      Converts a Dask Dataframe to a Dask-Nested NestedFrame

      :param df: A Dask Dataframe to convert

      :rtype: `nested_dask.NestedFrame`


   .. py:method:: from_delayed(dfs, meta=None, divisions=None, prefix='from-delayed', verify_meta=True)
      :classmethod:


      Create Nested-Dask NestedFrames from many Dask Delayed objects.

      Docstring is copied from `dask.dataframe.from_delayed`.

      :param dfs: A ``dask.delayed.Delayed``, a ``distributed.Future``, or an iterable of either
                  of these objects, e.g. returned by ``client.submit``. These comprise the
                  individual partitions of the resulting dataframe.
                  If a single object is provided (not an iterable), then the resulting dataframe
                  will have only one partition.
      :param meta: An empty NestedFrame, pd.DataFrame, or pd.Series that matches the dtypes and column names of
                   the output. This metadata is necessary for many algorithms in dask dataframe
                   to work. For ease of use, some alternative inputs are also available. Instead of a
                   DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that
                   the order of the names should match the order of the columns). Instead of a series, a tuple of
                   (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead
                   to unexpected results, so providing meta is recommended. For more information, see
                   dask.dataframe.utils.make_meta.
      :param divisions: Partition boundaries along the index.
                        For tuple, see https://docs.dask.org/en/latest/dataframe-design.html#partitions
                        For string 'sorted' will compute the delayed values to find index
                        values.  Assumes that the indexes are mutually sorted.
                        If None, then won't use index information
      :param prefix: Prefix to prepend to the keys.
      :param verify_meta: If True check that the partitions have consistent metadata, defaults to True.


   .. py:method:: from_map(func, *iterables, args=None, meta=None, divisions=None, label=None, enforce_metadata=True, **kwargs)
      :classmethod:


      Create a DataFrame collection from a custom function map

      WARNING: The ``from_map`` API is experimental, and stability is not
      yet guaranteed. Use at your own risk!

      :param func: Function used to create each partition. If ``func`` satisfies the
                   ``DataFrameIOFunction`` protocol, column projection will be enabled.
      :type func: callable
      :param \*iterables: Iterable objects to map to each output partition. All iterables must
                          be the same length. This length determines the number of partitions
                          in the output collection (only one element of each iterable will
                          be passed to ``func`` for each partition).
      :type \*iterables: Iterable objects
      :param args: Positional arguments to broadcast to each output partition. Note
                   that these arguments will always be passed to ``func`` after the
                   ``iterables`` positional arguments.
      :type args: list or tuple, optional
      :param meta: An empty NestedFrame, pd.DataFrame, or pd.Series that matches the dtypes and column names of
                   the output. This metadata is necessary for many algorithms in dask dataframe
                   to work. For ease of use, some alternative inputs are also available. Instead of a
                   DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that
                   the order of the names should match the order of the columns). Instead of a series, a tuple of
                   (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead
                   to unexpected results, so providing meta is recommended. For more information, see
                   dask.dataframe.utils.make_meta.
      :param divisions: Partition boundaries along the index.
                        For tuple, see https://docs.dask.org/en/latest/dataframe-design.html#partitions
                        For string 'sorted' will compute the delayed values to find index
                        values.  Assumes that the indexes are mutually sorted.
                        If None, then won't use index information
      :type divisions: tuple, str, optional
      :param label: String to use as the function-name label in the output
                    collection-key names.
      :type label: str, optional
      :param enforce_metadata: Whether to enforce at runtime that the structure of the DataFrame
                               produced by ``func`` actually matches the structure of ``meta``.
                               This will rename and reorder columns for each partition,
                               and will raise an error if this doesn't work,
                               but it won't raise if dtypes don't match.
      :type enforce_metadata: bool, default True
      :param \*\*kwargs: Key-word arguments to broadcast to each output partition. These
                         same arguments will be passed to ``func`` for every output partition.


   .. py:method:: from_flat(df, base_columns, nested_columns=None, on=None, name='nested')
      :classmethod:


      Creates a NestedFrame with base and nested columns from a flat
      dataframe.

      :param df: A flat dataframe.
      :type df: dd.DataFrame or nd.NestedFrame
      :param base_columns: The columns that should be used as base (flat) columns in the
                           output dataframe.
      :type base_columns: list-like
      :param nested_columns: The columns that should be packed into a nested column. All columns
                             in the list will attempt to be packed into a single nested column
                             with the name provided in `nested_name`. If None, is defined as all
                             columns not in `base_columns`.
      :type nested_columns: list-like, or None
      :param on: The name of a column to use as the new index. Typically, the index
                 should have a unique value per row for base columns, and should
                 repeat for nested columns. For example, a dataframe with two
                 columns; a=[1,1,1,2,2,2] and b=[5,10,15,20,25,30] would want an
                 index like [0,0,0,1,1,1] if a is chosen as a base column. If not
                 provided the current index will be used.
      :type on: str or None
      :param name: The name of the output column the `nested_columns` are packed into.

      :returns: A NestedFrame with the specified nesting structure.
      :rtype: NestedFrame


   .. py:method:: from_lists(df, base_columns=None, list_columns=None, name='nested')
      :classmethod:


      Creates a NestedFrame with base and nested columns from a flat
      dataframe.

      :param df: A dataframe with list columns.
      :type df: dd.DataFrame or nd.NestedFrame
      :param base_columns: Any columns that have non-list values in the input df. These will
                           simply be kept as identical columns in the result
      :type base_columns: list-like, or None
      :param list_columns: The list-value columns that should be packed into a nested column.
                           All columns in the list will attempt to be packed into a single
                           nested column with the name provided in `nested_name`. All columns
                           in list_columns must have pyarrow list dtypes, otherwise the
                           operation will fail. If None, is defined as all columns not in
                           `base_columns`.
      :type list_columns: list-like, or None
      :param name: The name of the output column the `nested_columns` are packed into.

      :returns: A NestedFrame with the specified nesting structure.
      :rtype: NestedFrame

      .. note::

         As noted above, all columns in `list_columns` must have a pyarrow
         ListType dtype. This is needed for proper meta propagation. To convert
         a list column to this dtype, you can use this command structure:
         `nf= nf.astype({"colname": pd.ArrowDtype(pa.list_(pa.int64()))})`

         Where pa.int64 above should be replaced with the correct dtype of the
         underlying data accordingly.

         Additionally, it's a known issue in Dask
         (https://github.com/dask/dask/issues/10139) that columns with list
         values will by default be converted to the string type. This will
         interfere with the ability to recast these to pyarrow lists. We
         recommend setting the following dask config setting to prevent this:
         `dask.config.set({"dataframe.convert-string":False})`


   .. py:method:: compute(**kwargs)

      Compute this Dask collection, returning the underlying dataframe or series.


   .. py:property:: all_columns
      :type: dict


      returns a dictionary of columns for each base/nested dataframe


   .. py:property:: nested_columns
      :type: list


      retrieves the base column names for all nested dataframes


   .. py:method:: add_nested(nested, name, how='outer') -> NestedFrame

      Packs a dataframe into a nested column

      :param nested: A flat dataframe to pack into a nested column
      :param name: The name given to the nested column
      :param how: How to handle the operation of the two objects.

                  * left: use calling frame’s index (or column if on is specified)

                  * right: use other’s index.

                  * outer: form union of calling frame’s index (or column if on is
                  specified) with other’s index, and sort it lexicographically.

                  * inner: form intersection of calling frame’s index (or column if
                  on is specified) with other’s index, preserving the order of the
                  calling’s one.

                  * cross: creates the cartesian product from both frames, preserves
                  the order of the left keys.
      :type how: {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘outer’

      :rtype: `nested_dask.NestedFrame`


   .. py:method:: query(expr) -> Self

      Query the columns of a NestedFrame with a boolean expression. Specified
      queries can target nested columns in addition to the typical column set

      Docstring copied from nested-pandas query

      :param expr: The query string to evaluate.

                   Access nested columns using `nested_df.nested_col` (where
                   `nested_df` refers to a particular nested dataframe and
                   `nested_col` is a column of that nested dataframe).

                   You can refer to variables
                   in the environment by prefixing them with an '@' character like
                   ``@a + b``.

                   You can refer to column names that are not valid Python variable names
                   by surrounding them in backticks. Thus, column names containing spaces
                   or punctuations (besides underscores) or starting with digits must be
                   surrounded by backticks. (For example, a column named "Area (cm^2)" would
                   be referenced as ```Area (cm^2)```). Column names which are Python keywords
                   (like "list", "for", "import", etc) cannot be used.

                   For example, if one of your columns is called ``a a`` and you want
                   to sum it with ``b``, your query should be ```a a` + b``.
      :type expr: str

      :returns: DataFrame resulting from the provided query expression.
      :rtype: DataFrame

      .. rubric:: Notes

      Queries that target a particular nested structure return a dataframe
      with rows of that particular nested structure filtered. For example,
      querying the NestedFrame "df" with nested structure "my_nested" as
      below will return all rows of df, but with mynested filtered by the
      condition:

      >>> df.query("mynested.a > 2")


   .. py:method:: dropna(*, axis: pandas._typing.Axis = 0, how: str | pandas._libs.lib.NoDefault = no_default, thresh: int | pandas._libs.lib.NoDefault = no_default, on_nested: bool = False, subset: pandas._typing.IndexLabel | None = None, inplace: bool = False, ignore_index: bool = False) -> Self

      Remove missing values for one layer of the NestedFrame.

      :param axis: Determine if rows or columns which contain missing values are
                   removed.

                   * 0, or 'index' : Drop rows which contain missing values.
                   * 1, or 'columns' : Drop columns which contain missing value.

                   Only a single axis is allowed.
      :type axis: {0 or 'index', 1 or 'columns'}, default 0
      :param how: Determine if row or column is removed from DataFrame, when we have
                  at least one NA or all NA.

                  * 'any' : If any NA values are present, drop that row or column.
                  * 'all' : If all values are NA, drop that row or column.
      :type how: {'any', 'all'}, default 'any'
      :param thresh: Require that many non-NA values. Cannot be combined with how.
      :type thresh: int, optional
      :param on_nested: If not False, applies the call to the nested dataframe in the
                        column with label equal to the provided string. If specified,
                        the nested dataframe should align with any columns given in
                        `subset`.
      :type on_nested: str or bool, optional
      :param subset: Labels along other axis to consider, e.g. if you are dropping rows
                     these would be a list of columns to include.

                     Access nested columns using `nested_df.nested_col` (where
                     `nested_df` refers to a particular nested dataframe and
                     `nested_col` is a column of that nested dataframe).
      :type subset: column label or sequence of labels, optional
      :param inplace: Whether to modify the DataFrame rather than creating a new one.
      :type inplace: bool, default False
      :param ignore_index: If ``True``, the resulting axis will be labeled 0, 1, …, n - 1.

                           .. versionadded:: 2.0.0
      :type ignore_index: bool, default ``False``

      :returns: DataFrame with NA entries dropped from it or None if ``inplace=True``.
      :rtype: DataFrame or None

      .. rubric:: Notes

      Operations that target a particular nested structure return a dataframe
      with rows of that particular nested structure affected.

      Values for `on_nested` and `subset` should be consistent in pointing
      to a single layer, multi-layer operations are not supported at this
      time.


   .. py:method:: sort_values(by: str | list[str], npartitions: int | None = None, ascending: bool | list[bool] = True, na_position: Literal['first'] | Literal['last'] = 'last', partition_size: float = 128000000.0, sort_function: collections.abc.Callable[[pandas.DataFrame], pandas.DataFrame] | None = None, sort_function_kwargs: collections.abc.Mapping[str, Any] | None = None, upsample: float = 1.0, ignore_index: bool | None = False, shuffle_method: str | None = None, **options) -> Self

      Sort the dataset by a single column.

      Sorting a parallel dataset requires expensive shuffles and is generally
      not recommended. See ‘set_index‘ for implementation details.

      Parameters:
      -----------
      by: str or list[str]
          Column(s) to sort by.
      npartitions: int, None, or ‘auto’
          The ideal number of output partitions. If None, use the same as the
          input. If ‘auto’ then decide by memory use. Not used when sorting
          nested layers.
      ascending: bool or list[bool], optional
          Sort ascending vs. descending. Defaults to True. Specify list for
          multiple sort orders. If this is a list of bools, must match the
          length of the by.
      na_position: {‘last’, ‘first’}, optional
          Puts NaNs at the beginning if ‘first’, puts NaN at the end if
          ‘last’. Defaults to ‘last’.
      partition_size: float, optional
          The desired size of each partition in bytes. Defaults to 128e6
          (128 MB). Not used in nested sorting.
      sort_function: function, optional
          Sorting function to use when sorting underlying partitions. If
          None, defaults to M.sort_values (the partition library’s
          implementation of sort_values). Not used when sorting nested
          layers.
      sort_function_kwargs: dict, optional
          Additional keyword arguments to pass to the partition sorting
          function. By default, by, ascending, and na_position are provided.
      upsample: float, optional
          Used to increase the number of samples for quantiles. Not used
          in nested sorting
      ignore_index: bool, optional
          If True, the resulting axis will be labeled 0, 1, …, n - 1.
          Defaults to False.
      shuffle_method: str, optional
          The method to use for shuffling data. Defaults to None. Not used
          in nested sorting
      **options: keyword arguments, optional
          Additional options to pass to the sorting function.
      Returns:
      --------
      DataFrame
          DataFrame with sorted values.


   .. py:method:: reduce(func, *args, meta=dsk_no_default, infer_nesting=True, **kwargs) -> NestedFrame

      Takes a function and applies it to each top-level row of the NestedFrame.

      docstring copied from nested-pandas

      The user may specify which columns the function is applied to, with
      columns from the 'base' layer being passsed to the function as
      scalars and columns from the nested layers being passed as numpy arrays.

      :param func: Function to apply to each nested dataframe. The first arguments to `func` should be which
                   columns to apply the function to. See the Notes for recommendations
                   on writing func outputs.
      :type func: callable
      :param args: Positional arguments to pass to the function, the first *args should be the names of the
                   columns to apply the function to.
      :type args: positional arguments
      :param meta: The dask meta of the output. If not provided, dask will try to
                   infer the metadata. This may lead to unexpected results, so
                   providing meta is recommended.
      :type meta: dataframe or series-like, optional
      :param infer_nesting: If True, the function will pack output columns into nested
                            structures based on column names adhering to a nested naming
                            scheme. E.g. "nested.b" and "nested.c" will be packed into a column
                            called "nested" with columns "b" and "c". If False, all outputs
                            will be returned as base columns.
      :type infer_nesting: bool, default True
      :param kwargs: Keyword arguments to pass to the function.
      :type kwargs: keyword arguments, optional

      :returns: `NestedFrame` with the results of the function applied to the columns of the frame.
      :rtype: `NestedFrame`

      .. rubric:: Notes

      By default, `reduce` will produce a `NestedFrame` with enumerated
      column names for each returned value of the function. For more useful
      naming, it's recommended to have `func` return a dictionary where each
      key is an output column of the dataframe returned by `reduce`.

      Example User Function:

      >>> def my_sum(col1, col2):
      >>>    '''reduce will return a NestedFrame with two columns'''
      >>>    return {"sum_col1": sum(col1), "sum_col2": sum(col2)}

      When using nesting inference (infer_nesting=True), the output may
      contain nested columns. In such cases, the meta should be provided with
      the appropriate dtype for these columns. For example, the following
      function, which produces a nested column "lc":

      >>> def complex_output(flux):
      >>>   return {"max_flux": np.max(flux),
      >>>           "lc.flux_quantiles": np.quantile(flux, [0.1, 0.2, 0.3, 0.4, 0.5]),
      >>>           "lc.labels": [0.1, 0.2, 0.3, 0.4, 0.5]}

      Would require the following meta:

      >>> # create a NestedDtype for the nested column "lc"
      >>> from nested_pandas.series.dtype import NestedDtype
      >>> lc_dtype = NestedDtype(pa.struct([pa.field("flux_quantiles", pa.list_(pa.float64())),
      >>>                                   pa.field("labels", pa.list_(pa.float64()))]))
      >>> # use the lc_dtype in meta creation
      >>> result_meta = npd.NestedFrame({'max_flux':pd.Series([], dtype='float'),
      >>>                 'lc':pd.Series([], dtype=lc_dtype)})


   .. py:method:: to_parquet(path, by_layer=True, **kwargs) -> None

      Creates parquet file(s) with the data of a NestedFrame, either
      as a single parquet file directory where each nested dataset is packed
      into its own column or as an individual parquet file directory for each
      layer.

      Docstring copied from nested-pandas.

      Note that here we always opt to use the pyarrow engine for writing
      parquet files.

      :param path: The path to the parquet directory to be written.
      :type path: str
      :param by_layer: NOTE: by_layer=False will not reliably preserve divisions currently,
                       be warned when using it that loading from such a dataset will
                       likely require you to reset and set the index to generate divisions
                       information.

                       If False, writes the entire NestedFrame to a single parquet
                       directory.

                       If True, writes each layer to a separate parquet sub-directory
                       within the directory specified by path. The filename for each
                       outputted file will be named after its layer. For example for the
                       base layer this is always "base".
      :type by_layer: bool, default True
      :param kwargs: Keyword arguments to pass to the function.
      :type kwargs: keyword arguments, optional

      :rtype: None