pandas

Top #66 Contributor

This is the collection of my open source contributions to pandas, a powerful data analysis toolkit in Python. It has its code base maintained on GitHub, with nearly 3000 contributors.

I have contributed 29 merged pull requests to pandas, and I am currently its Top #66 contributor.Note that throughout this post, when saying a bug existed in pandas a.b.c, it does not take into consideration backporting. For instance, "a bug existed in pandas 2.1.3" and "fixed in pandas 2.1.4" only implies that the bug was fixed after the release of pandas 2.1.3, but does not gurantee that one would see the bug with pandas 2.1.3 now since the fix for pandas 2.1.4 may be backported to all 2.1.x.

Code Contributions

Items in each section are sorted in reverse chronological order by the time of merge.

Groupby

BUG: groupby sum turning inf+inf and (-inf)+(-inf) into nan #53623

In pandas 2.0.3, the sum method of GroupBy objected summed inf + inf and (-inf) + (-inf) to nan instead of inf and -inf respectively, which is incorrect. For instance,

>>> import numpy as np
>>> import pandas as pd
>>> ser = pd.Series([np.inf, np.inf, np.inf])
>>> ser.groupby([0, 1, 1]).sum()
0    inf
1    NaN
dtype: float64

Moreover, this behavior was inconsistent with calling the apply with a standard summation function, which returns the correct result.

>>> ser.groupby([0, 1, 1]).apply(lambda _grp: _grp.sum())
0    inf
1    inf
dtype: float64

The problem was caused by the Cython function group_sum which implements Kahan's summation to reduce numerical error caused by finite-precision floating-point operations. It maintains a compensation for low-order bits, and fix the excess in further rounds, which can be interpreted as follows.

var y = input[i] - c
var t = sum + y
c = (t - sum) - y    // c is initialized to zero before the loop
sum = t              // sum is initialized to zero before the loop
next i

In the case where input is inf, y and t would become inf. Thus when computing c, i.e., the compensation, we would be performing inf-inf which gives nan. To fix this, I manually set the compensation back to zero whenever it becomes nan. Also note that for efficiency, this group_sum function is written with cython.nogil, so the safe way to do utils.is_nan(c) is by comparing c with itself, i.e., c != c. From pandas 2.1.0, this bug is fixed and inf + inf and (-inf) + (-inf) are correctly summed to inf and -inf respectively.

FIX groupby with column selection not returning tuple when grouping by list of a single element #53517

From pandas 2.0.3, when grouping by a singleton list in pandas and iterating over the resulting GroupBy object, the correct behavior is that each group name is a tuple of length one, in order to be consistent with non-singleton lists. For instance,

>>> import pandas as pd
>>> df = pd.DataFrame({"a": [1, 1, 2], "b": [4, 5, 4], "c":[7, 8, 9]})
>>> for name, obj in df.groupby(["a"]):
...     print(name)
...
(1,)
(2,)

However in pandas 2.0.3, performing column selection on the GroupBy object would violate the desired behavior.

>>> for name, obj in df.groupby(["a"])[["a", "b", "c"]]:
...     print(name)
...
1
2

This was caused by keys getting lost in the first place when creating the GroupBy objects since the grouper was passed in instead. With @rhshadrach's help, I fixed this by passing keys explicitly instead of implying from the grouper. From pandas 2.1.0, the result would be consistent with or without column selection.

BUG DataFrameGroupBy.agg with list not respecting as_index=False #53237

groupby with as_index=False should behave in SQL-style, i.e., the group labels should be columns after aggregation rather than being moved as index. For instance,

>>> import pandas as pd
>>> df = pd.DataFrame({"a1": [0, 0, 1], "a2": [2, 3, 3], "b": [4, 5, 6]})
>>> gb = df.groupby(by=["a1", "a2"], as_index=False)
>>> gb.agg("sum")
   a1  a2  b
0   0   2  4
1   0   3  5
2   1   3  6

However, if the aggregation functions are passed in as a list, as_index=False was not respected in pandas 2.0.3.

>>> gb.agg(["sum"])
print(result)
        b
      sum
a1 a2
0  2    4
   3    5
1  3    6

This was caused by the logic of result processing, which fell into an early-returning case for the above scenario without going through the step of treating as_index=False. I made a dummy fix, i.e., call the reset_index method on the result when as_index=False and aggregation functions are list-like. From pandas 2.1.0, the aggregation result would respect as_index=False in the above scenario, i.e.,

>>> gb.agg(["sum"])
  a1 a2   b
        sum
0  0  2   4
1  0  3   5
2  1  3   6

BUG: GroupBy.quantile implicitly sorts index.levels #53049

In pandas 2.0.3, indices may get sorted even if groupby is called with sort=False. As an illustration, first create a Series with an unsorted column "category", i.e., the B's are all before the A's.

>>> import pandas as pd
>>> ind = pd.MultiIndex.from_tuples(
...     [(0, "B"), (0, "A"), (1, "B"), (1, "A")],
...     names=["sample", "category"],
... )
>>> ser = pd.Series(range(4), index=ind)
>>> ser
sample  category
0       B           0
        A           1
1       B           2
        A           3
dtype: int64

Now if we call groupby with sort=False, then perform .quantile(...).unstack(), the result would be unexpectedly sorted, such that A goes before B.

>>> grp = ser.groupby(level="category", sort=False).quantile([0.2, 0.8])
>>> grp.unstack()
          0.2  0.8
category
A         1.4  2.6
B         0.4  1.6

This was caused by index.levels being sorted, while some methods were using those levels. index.levels would, however, not be sorted if manually created instead of being created using class methods like MultiIndex.from_product. Manual creation of MultiIndex in the _insert_quantile_level function resolved the issue, but at the cost of nearly 50% performance regression which came from always using 64-bit indices. This was fixed using coerce_indexer_dtype to find the most appropriate dtype. From pandas 2.1.0, the result would be unsorted when specified, at least for the quantile method, without experience performance regression.

IO

BUG: complex Series/DataFrame display all complex nans as nan+0j #53844

This is a follow-up on pandas-dev/pandas#53764 which is made also by me, fixing the display of complex nan values. However, the trick that if complex dtype then nan values are represented as NaN+0.0j is wrong. In fact, both the real and the imaginary part of a complex nan value can be nan. Thus the previous pull request led to all types of complex nan values being displayed as NaN+0.0j, even though the underlying data is stored correctly. In this pull request, I split the real and imaginary parts, format them separately, and concatenate them (hopefully) properly. Since the imaginary part can also be NaN, it also has to be padded manually to the maximum length. For instance, from pandas 2.1.0, Series with complex nan values would be displayed like

>>> import numpy as np
>>> import pandas as pd
>>> pd.Series([complex(np.nan, -1.2), complex(np.nan, np.nan), complex(-1.23, np.nan)])
0     NaN-1.20j
1     NaN+ NaNj
2   -1.23+ NaNj
dtype: complex128

BUG: bad display for complex series with nan #53764

This is a follow-up on pandas-dev/pandas#53862, which fixed the bug in pandas 2.0.3 of being unable to construct Series with complex nan values. However, that pull request did not account for the display issues, which could result in for instance N000a000N being shown. The internal reason was that pandas 2.0.3 did not consider complex numbers to have nan values, thus splitting by +, -, and j while representing complex nan values still as NaN. I applied some tricks here: if complex dtype then nan values are represented as NaN+0.0j. Then I split the real and imaginary parts with some regex and pad both the real and imaginary parts of a single number, but also all real (resp. imaginary) parts of all the numbers using some existing helper function. With the above fix, complex("nan") can be displayed properly, starting from pandas 2.1.0.

>>> import pandas as pd
>>> pd.Series([1.23, complex("nan"), -1.2j])
0   1.23+0.00j
1    NaN+0.00j
2   0.00-1.20j
dtype: complex128

Update: This implementation turned out to be not considerate enough. Please refer to my follow-up fix in pandas-dev/pandas#53844 or the item above.

BUG: MultiIndex displays incorrectly with a long element #53044

In pandas 2.0.3, displaying MultiIndex objects with long elements would fail. For instance,

>>> import pandas as pd
>>> pd.MultiIndex.from_tuples([("c" * 62,)])
AttributeError: 'tuple' object has no attribute 'rstrip'

This was caused an important variable being set only when breaking out of a for loop during formatting, not addressding the case where the loop terminates naturally and thus causing further errors. With an easy fix, the above code snippet would print correctly starting from pandas 2.1.0.

>>> pd.MultiIndex.from_tuples([("c" * 62,)])
MultiIndex([('cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc',)],
           )

Missing Data

BUG: interpolate with fillna methods fail to fill across multiblocks #53962

In pandas 2.0.3, calling the interpolate method with method being some fillna method (e.g., "ffill") failed to fill across blocks. As an example, we first make a multi-block DataFrame object.

>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame(np.random.randn(3, 3), columns=list("ABC"))
>>> df["D"] = np.nan
>>> df["E"] = 1.0
>>> df
          A         B         C   D    E
0 -0.296721 -0.655025  0.220960 NaN  1.0
1  0.016386 -0.605724 -0.593538 NaN  1.0
2  0.274373 -0.203605  0.510585 NaN  1.0

Though not explicitly displayed, this DataFrame is internally multi-block, i.e., columns A through C are in one block and columns D and E are one block each. In pandas 2.0.3, interpolating it with method="ffill" yielded an incorrect result.

>>> df.interpolate(method="ffill", axis=1)
          A         B         C   D    E
0 -0.656239  0.898067  0.842284 NaN  1.0
1 -0.914892 -0.018121 -0.382542 NaN  1.0
2  1.818251 -0.148962  0.550328 NaN  1.0

Clearly the interpolation failed to pass across the block of column D. Although such a usage has been deprecated, i.e., it is recommended to directly use df.ffill(axis=1) instead, it was still meaningful to fix the bug since deprecation warnings are not errors. The reason for this bug was that, internally such a multi-block DataFrame needed to be transposed before passing to the block manager, and then transposed back afterwards. However, the originally logic was to transpose only when axis=1 and method is not a fillnamethod. I updated the logic so that it transposes also when detecting a multi-block internal layout (for reference, this information can be accessed via the _mgr private attribute). Starting from pandas 2.1.0 and until such a usage is fully removed, the functionality works correctly despite the deprecation warning.

>>> df.interpolate(method="ffill", axis=1)
<stdin>:1: FutureWarning: DataFrame.interpolate with method=ffill is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.
          A         B         C         D    E
0 -0.296721 -0.655025  0.220960  0.220960  1.0
1  0.016386 -0.605724 -0.593538 -0.593538  1.0
2  0.274373 -0.203605  0.510585  0.510585  1.0

Numeric

BUG: series with complex nan #53682

In pandas 2.0.3, constructing Series or DataFrame objects that contain complex nan values (e.g., complex("nan")) would raise an error. For instance,

>>> import pandas as pd
>>> pd.Series([complex("nan")])
TypeError: must be real number, not complex

This was caused by a hard conversion from complex nan into float dtype in the Cython code. When designing the constructor, pandas did not really take into consideration complex nan values, thus treating all nan as float nan. I added a check for complex nan to avoid this invalid converstion. Starting from pandas 2.1.0, complex("nan") is allowed in Series and DataFrame, displayed as NaN+0.0j.

>>> pd.Series([complex("nan")])
0   NaN+0.0j
dtype: complex128

Update: The implementation is correct but the claim about the display is not considerate enough. Please refer to my follow-up fix in pandas-dev/pandas#53844 or the corresponding item in the IO section.

FIX RangeIndex rsub by constant #53288

In pandas 2.0.3, the step size of a RangeIndex objects was not correctly reverted when subtracted from a constant. For instance,

>>> import pandas as pd
>>> idx = pd.Index(range(4))
>>> idx
RangeIndex(start=0, stop=4, step=1)
>>> 3 - idx
RangeIndex(start=3, stop=-1, step=1)
>>> list(3 - idx)
[]

I made an easy fix to change step to -step if the operation is rsub. From pandas 2.1.0, constant minus RangeIndex would work correctly.

>>> 3 - idx
RangeIndex(start=3, stop=-1, step=-1)
>>> list(3 - idx)
[3, 2, 1, 0]

Resample

BUG: resampling empty series loses time zone from dtype #53736

In pandas 2.0.3, empty DataFrame or Series would lose timezone information when resampled. For instance,

>>> import pandas as pd
>>> df = pd.DataFrame({"ts": [], "values": []}).astype(
...     {"ts": "datetime64[ns, Atlantic/Faroe]"}
... )
>>> res = df.resample("2MS", on="ts", closed="left", label="left", origin="start")[
...     "values"
... ].sum()
>>> res
Series([], Freq: 2MS, Name: values, dtype: float64)
>>> res.index
DatetimeIndex([], dtype='datetime64[ns]', name='ts', freq='2MS')

The reason for this bug was that, the case of empty DataFrame or Series was handled separately from the case of non-empty ones, and in the former case it simply created empty DatetimeIndex with specified freq and name without explicitly setting dtype. The case is similar for PeriodIndex. I fixed this bug by just explicitly passing in the correct dtype, so from pandas 2.1.0 the behavior would be correct.

>>> res.index
DatetimeIndex([], dtype='datetime64[ns, Atlantic/Faroe]', name='ts', freq='2MS')

Reshaping

BUG: pd.concat dataframes with different datetime64 resolutions #53641

This is about a bug introduced during the development of pandas 2.1.0, so it does not exist in any actual release. In short, when concatenating DataFrame objects with different datetime64 resolutions (e.g. "datetime64[s]" versus "datetime64[ns]"), an extremly wierd ValueError would be raised. In particular,

>>> import pandas as pd
>>> df1 = pd.DataFrame({"a": range(2), "b": range(2)}, dtype="datetime64[s]")
>>> df2 = pd.DataFrame({"a": range(2), "b": range(2)}, dtype="datetime64[ns]")
>>> pd.concat([df1, df2])
ValueError: Shape of passed values is (2, 2), indices imply (4, 2)

This was because pandas internally implements a class method _concat_same_type for extension arrays and so on, some with default axis=0 and some not supporting the axis keyword. Consequently, it originally called the class method without specifying axis, which was fine in many cases but not fine for datetime arrays who should have axis=1. I noticed that a ea_compat_axis argument is available, such that the axis keyword is not supported if ea_compat_axis=True. By this observation I modified the logic to explicitly pass in the axis keyword whenever possible instead of always using the default value. This avoided the regression in the function pd.concat from pandas 2.0.3 to pandas 2.1.0. In particular,

>>> pd.concat([df1, df2])
                              a                             b
0 1970-01-01 00:00:00.000000000 1970-01-01 00:00:00.000000000
1 1970-01-01 00:00:01.000000000 1970-01-01 00:00:01.000000000
0 1970-01-01 00:00:00.000000000 1970-01-01 00:00:00.000000000
1 1970-01-01 00:00:00.000000001 1970-01-01 00:00:00.000000001

This even resolved some other problems that may not explicitly use the pd.concat function but involves implicit concatenations, e.g., using df.loc[n] = ... where df is a DataFrame object with only n rows.

BUG Merge not behaving correctly when having MultiIndex with a single level #53215

In pandas 2.0.3, the method DataFrame.merge did not behave correctly when encountering MultiIndex with a single level. As an example, we create two DataFrame objects, the only difference being that df1 has Index while df2 has MultiIndex with single level.

>>> import pandas as pd
>>> df1 = pd.DataFrame(
...     data={"col2": [100]},
...     index=pd.Index(["A"], name="col1"),
... )
>>> df1
      col2
col1
A      100
>>> df1.index
Index(['A'], dtype='object', name='col1')
>>> df2 = pd.DataFrame(
...     data={"col2": [100]},
...     index=pd.MultiIndex.from_tuples([("A",)], names=["col1"]),
... )
>>> df2
      col2
col1
A      100
>>> df2.index
MultiIndex([('A',)],
           names=['col1'])

Now create a (left) DataFrame to check the merging behavior in pandas 2.0.3.

>>> df = pd.DataFrame({"col1": ["A"]})
>>> df
col1
0    A
>>> df.merge(df1, left_on=["col1"], right_index=True, how="left")
  col1  col2
0    A   100
>>> df.merge(df2, left_on=["col1"], right_index=True, how="left")
  col1  col2
0    A   NaN

The merging behavior of df2 was clearly incorrect, and should be the same as that of df1. This was caused by the logic of checking whether to use internally a multi-index indexer or a single-index indexer. In particular, pandas 2.0.3 checked the length of the join keys, incorrectly putting df2 into the case of using a single-index indexer. ("A",) would then fail to match "A", leading to the NaN. I fixed this by alternatively checking for MultiIndex instance, so that df2 can be put into the right case. From pandas 2.1.0, the merging behavior of DataFrame with single-level MultiIndex would be correct, e.g., the merging behavior of df2 would agree with that of df1.

ENH include the first incompatible key in error message when merging #51947

This is my first merged pull request to pandas! In pandas 2.0.3, the method DataFrame.merge raises an error when there are imcompatible keys, but there was no information provided indicating which keys were the culprits. For instance,

>>> import pandas as pd
>>> df = pd.DataFrame([{"a": 1, "b": 1, "c": 1}])
>>> df2 = pd.DataFrame([{"a": 1, "b": None, "c": 1}])
>>> pd.merge(df, df2)
ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat

This would become problematic if the number of columns is large, making it hard to identify the culprit columns and resolve the failure. I improved the error message by mentioning the name of the first incompatible key, which would be included starting from pandas 2.1.0.

>>> pd.merge(df, df2)
ValueError: You are trying to merge on int64 and object columns for key 'b'. If you wish to proceed you should use pd.concat

Others

BUG: avoid DeprecationWarning when the Series has index as list of Series #55395

In pandas 2.1.3, the Series constructor would raise a deprecation warning if the index argument is a list of Series. For instance,

>>> import pandas as pd
>>> ser = pd.Series([1.23], index=[pd.Series([1]), pd.Series([2])])
DeprecationWarning: Series._data is deprecated and will be removed in a future version. Use public APIs instead.

This was caused by the logic of validating the index argument, which implicitly looks for the _data attribute. However, _data that has been depreated. The initial intention was to check for instances of lists, ABCIndex, and ABCSeries while avoiding circular import. I fixed this by runtime-importing ABCIndex and ABCSeries and performing the explicit check. From pandas 2.1.4 the above construction would be clear of this warning.

Maintenance Contributions

Items are sorted in reverse chronological order by the time of merge.

BUG fix deprecation of limit and fill_method in pct_change #55527

CI: add empty line in no-bool-in-generic to avoid black complaining #54855

API: add NaTType and NAType to pandas.api.typing #53958

CI: linting check to ensure lib.NoDefault is only used for typing #53901

CLN: use lib.no_default instead of lib.NoDefault in .pivot #53877

DEPR fill_method and limit keywords in pct_change #53520

DEPR Rename keyword "quantile" to "q" in Rolling.quantile #53216

FIX typo in deprecation message of deprecate_kwarg decorator #53218

Documentation Contributions

Items are sorted in reverse chronological order by the time of merge.

DOC: EX01 (part of ExtensionArray) #53957

DOC: EX01 ({Categorical, Interval, Multi, Datetime, Timedelta}-Index) #53925

DOC: EX01 (Index and RangeIndex) #53920

DOC: fix asv test results link in contributors doc #53902