This is the collection of my open source contributions to pandas, a powerful data analysis toolkit in Python. It has its code base maintained on GitHub, with nearly 3000 contributors.
I have contributed 29 merged pull requests to pandas, and I am currently its Top #66 contributor.Note that throughout this post, when saying a bug existed in pandas a.b.c, it does not take into consideration backporting. For instance, "a bug existed in pandas 2.1.3" and "fixed in pandas 2.1.4" only implies that the bug was fixed after the release of pandas 2.1.3, but does not gurantee that one would see the bug with pandas 2.1.3 now since the fix for pandas 2.1.4 may be backported to all 2.1.x.
Code Contributions
Items in each section are sorted in reverse chronological order by the time of merge.
Groupby
BUG: groupby sum turning inf+inf and (-inf)+(-inf) into nan #53623
In pandas 2.0.3, the sum method of GroupBy objected summed inf + inf and (-inf) + (-inf) to nan instead of inf and -inf respectively, which is incorrect. For instance, Moreover, this behavior was inconsistent with calling the apply with a standard summation function, which returns the correct result. The problem was caused by the Cython function group_sum which implements Kahan's summation to reduce numerical error caused by finite-precision floating-point operations. It maintains a compensation for low-order bits, and fix the excess in further rounds, which can be interpreted as follows. In the case where input is inf, y and t would become inf. Thus when computing c, i.e., the compensation, we would be performing inf-inf which gives nan. To fix this, I manually set the compensation back to zero whenever it becomes nan. Also note that for efficiency, this group_sum function is written with cython.nogil, so the safe way to do utils.is_nan(c) is by comparing c with itself, i.e., c != c. From pandas 2.1.0, this bug is fixed and inf + inf and (-inf) + (-inf) are correctly summed to inf and -inf respectively.
FIX groupby with column selection not returning tuple when grouping by list of a single element #53517
From pandas 2.0.3, when grouping by a singleton list in pandas and iterating over the resulting GroupBy object, the correct behavior is that each group name is a tuple of length one, in order to be consistent with non-singleton lists. For instance, However in pandas 2.0.3, performing column selection on the GroupBy object would violate the desired behavior. This was caused by keys getting lost in the first place when creating the GroupBy objects since the grouper was passed in instead. With @rhshadrach's help, I fixed this by passing keys explicitly instead of implying from the grouper. From pandas 2.1.0, the result would be consistent with or without column selection.
BUG DataFrameGroupBy.agg with list not respecting as_index=False #53237
groupby with as_index=False should behave in SQL-style, i.e., the group labels should be columns after aggregation rather than being moved as index. For instance, However, if the aggregation functions are passed in as a list, as_index=False was not respected in pandas 2.0.3. This was caused by the logic of result processing, which fell into an early-returning case for the above scenario without going through the step of treating as_index=False. I made a dummy fix, i.e., call the reset_index method on the result when as_index=False and aggregation functions are list-like. From pandas 2.1.0, the aggregation result would respect as_index=False in the above scenario, i.e.,
In pandas 2.0.3, indices may get sorted even if groupby is called with sort=False. As an illustration, first create a Series with an unsorted column "category", i.e., the B's are all before the A's. Now if we call groupby with sort=False, then perform .quantile(...).unstack(), the result would be unexpectedly sorted, such that A goes before B. This was caused by index.levels being sorted, while some methods were using those levels. index.levels would, however, not be sorted if manually created instead of being created using class methods like MultiIndex.from_product. Manual creation of MultiIndex in the _insert_quantile_level function resolved the issue, but at the cost of nearly 50% performance regression which came from always using 64-bit indices. This was fixed using coerce_indexer_dtype to find the most appropriate dtype. From pandas 2.1.0, the result would be unsorted when specified, at least for the quantile method, without experience performance regression.
IO
BUG: complex Series/DataFrame display all complex nans as nan+0j #53844
This is a follow-up on pandas-dev/pandas#53764 which is made also by me, fixing the display of complex nan values. However, the trick that if complex dtype then nan values are represented as NaN+0.0j is wrong. In fact, both the real and the imaginary part of a complex nan value can be nan. Thus the previous pull request led to all types of complex nan values being displayed as NaN+0.0j, even though the underlying data is stored correctly. In this pull request, I split the real and imaginary parts, format them separately, and concatenate them (hopefully) properly. Since the imaginary part can also be NaN, it also has to be padded manually to the maximum length. For instance, from pandas 2.1.0, Series with complex nan values would be displayed like
BUG: bad display for complex series with nan #53764
This is a follow-up on pandas-dev/pandas#53862, which fixed the bug in pandas 2.0.3 of being unable to construct Series with complex nan values. However, that pull request did not account for the display issues, which could result in for instance N000a000N being shown. The internal reason was that pandas 2.0.3 did not consider complex numbers to have nan values, thus splitting by +, -, and j while representing complex nan values still as NaN. I applied some tricks here: if complex dtype then nan values are represented as NaN+0.0j. Then I split the real and imaginary parts with some regex and pad both the real and imaginary parts of a single number, but also all real (resp. imaginary) parts of all the numbers using some existing helper function. With the above fix, complex("nan") can be displayed properly, starting from pandas 2.1.0. Update: This implementation turned out to be not considerate enough. Please refer to my follow-up fix in pandas-dev/pandas#53844 or the item above.
BUG: MultiIndex displays incorrectly with a long element #53044
In pandas 2.0.3, displaying MultiIndex objects with long elements would fail. For instance, This was caused an important variable being set only when breaking out of a for loop during formatting, not addressding the case where the loop terminates naturally and thus causing further errors. With an easy fix, the above code snippet would print correctly starting from pandas 2.1.0.
Missing Data
BUG: interpolate with fillna methods fail to fill across multiblocks #53962
In pandas 2.0.3, calling the interpolate method with method being some fillna method (e.g., "ffill") failed to fill across blocks. As an example, we first make a multi-block DataFrame object. Though not explicitly displayed, this DataFrame is internally multi-block, i.e., columns A through C are in one block and columns D and E are one block each. In pandas 2.0.3, interpolating it with method="ffill" yielded an incorrect result. Clearly the interpolation failed to pass across the block of column D. Although such a usage has been deprecated, i.e., it is recommended to directly use df.ffill(axis=1) instead, it was still meaningful to fix the bug since deprecation warnings are not errors. The reason for this bug was that, internally such a multi-block DataFrame needed to be transposed before passing to the block manager, and then transposed back afterwards. However, the originally logic was to transpose only when axis=1 and method is not a fillnamethod. I updated the logic so that it transposes also when detecting a multi-block internal layout (for reference, this information can be accessed via the _mgr private attribute). Starting from pandas 2.1.0 and until such a usage is fully removed, the functionality works correctly despite the deprecation warning.
In pandas 2.0.3, constructing Series or DataFrame objects that contain complex nan values (e.g., complex("nan")) would raise an error. For instance, This was caused by a hard conversion from complex nan into float dtype in the Cython code. When designing the constructor, pandas did not really take into consideration complex nan values, thus treating all nan as float nan. I added a check for complex nan to avoid this invalid converstion. Starting from pandas 2.1.0, complex("nan") is allowed in Series and DataFrame, displayed as NaN+0.0j. Update: The implementation is correct but the claim about the display is not considerate enough. Please refer to my follow-up fix in pandas-dev/pandas#53844 or the corresponding item in the IO section.
In pandas 2.0.3, the step size of a RangeIndex objects was not correctly reverted when subtracted from a constant. For instance, I made an easy fix to change step to -step if the operation is rsub. From pandas 2.1.0, constant minus RangeIndex would work correctly.
Resample
BUG: resampling empty series loses time zone from dtype #53736
In pandas 2.0.3, empty DataFrame or Series would lose timezone information when resampled. For instance, The reason for this bug was that, the case of empty DataFrame or Series was handled separately from the case of non-empty ones, and in the former case it simply created empty DatetimeIndex with specified freq and name without explicitly setting dtype. The case is similar for PeriodIndex. I fixed this bug by just explicitly passing in the correct dtype, so from pandas 2.1.0 the behavior would be correct.
Reshaping
BUG: pd.concat dataframes with different datetime64 resolutions #53641
This is about a bug introduced during the development of pandas 2.1.0, so it does not exist in any actual release. In short, when concatenating DataFrame objects with different datetime64 resolutions (e.g. "datetime64[s]" versus "datetime64[ns]"), an extremly wierd ValueError would be raised. In particular, This was because pandas internally implements a class method _concat_same_type for extension arrays and so on, some with default axis=0 and some not supporting the axis keyword. Consequently, it originally called the class method without specifying axis, which was fine in many cases but not fine for datetime arrays who should have axis=1. I noticed that a ea_compat_axis argument is available, such that the axis keyword is not supported if ea_compat_axis=True. By this observation I modified the logic to explicitly pass in the axis keyword whenever possible instead of always using the default value. This avoided the regression in the function pd.concat from pandas 2.0.3 to pandas 2.1.0. In particular, This even resolved some other problems that may not explicitly use the pd.concat function but involves implicit concatenations, e.g., using df.loc[n] = ... where df is a DataFrame object with only n rows.
BUG Merge not behaving correctly when having MultiIndex with a single level #53215
In pandas 2.0.3, the method DataFrame.merge did not behave correctly when encountering MultiIndex with a single level. As an example, we create two DataFrame objects, the only difference being that df1 has Index while df2 has MultiIndex with single level. Now create a (left) DataFrame to check the merging behavior in pandas 2.0.3. The merging behavior of df2 was clearly incorrect, and should be the same as that of df1. This was caused by the logic of checking whether to use internally a multi-index indexer or a single-index indexer. In particular, pandas 2.0.3 checked the length of the join keys, incorrectly putting df2 into the case of using a single-index indexer. ("A",) would then fail to match "A", leading to the NaN. I fixed this by alternatively checking for MultiIndex instance, so that df2 can be put into the right case. From pandas 2.1.0, the merging behavior of DataFrame with single-level MultiIndex would be correct, e.g., the merging behavior of df2 would agree with that of df1.
ENH include the first incompatible key in error message when merging #51947
This is my first merged pull request to pandas! In pandas 2.0.3, the method DataFrame.merge raises an error when there are imcompatible keys, but there was no information provided indicating which keys were the culprits. For instance, This would become problematic if the number of columns is large, making it hard to identify the culprit columns and resolve the failure. I improved the error message by mentioning the name of the first incompatible key, which would be included starting from pandas 2.1.0.
Others
BUG: avoid DeprecationWarning when the Series has index as list of Series #55395
In pandas 2.1.3, the Series constructor would raise a deprecation warning if the index argument is a list of Series. For instance, This was caused by the logic of validating the index argument, which implicitly looks for the _data attribute. However, _data that has been depreated. The initial intention was to check for instances of lists, ABCIndex, and ABCSeries while avoiding circular import. I fixed this by runtime-importing ABCIndex and ABCSeries and performing the explicit check. From pandas 2.1.4 the above construction would be clear of this warning.
Maintenance Contributions
Items are sorted in reverse chronological order by the time of merge.
BUG fix deprecation of limit and fill_method in pct_change #55527
CI: add empty line in no-bool-in-generic to avoid black complaining #54855
API: add NaTType and NAType to pandas.api.typing #53958
CI: linting check to ensure lib.NoDefault is only used for typing #53901
CLN: use lib.no_default instead of lib.NoDefault in .pivot #53877
DEPR fill_method and limit keywords in pct_change #53520
DEPR Rename keyword "quantile" to "q" in Rolling.quantile #53216
FIX typo in deprecation message of deprecate_kwarg decorator #53218
Documentation Contributions
Items are sorted in reverse chronological order by the time of merge.