Hello, dear RG community.

Personally, I have found Xarray to be excruciatingly slow, especially for big datasets and nonstandard operations (like a custom filtering function). The only suggestion how to speed up the things that I have found on the Internet is to use numpy. When I adjusted my code accordingly (i.e., used numpy), I laughed so hard because I had to convert almost every single piece of Xarray-based code to a numpy-based code. Still, the remnants of the Xarray-based code kept slowing me down. I went ahead and wrote a crazy piece of code combining Dask and Xarray and numpy and, finally, increased the speed to some acceptable magnitude. That was such a pain.

Pandas, of course, are essentially the same speed-wise. And I couldn't find anything else to handle named arrays in Python other than Xarray or Pandas (I work with multidimensional arrays, so I need Xarray anyway).

I read the docs for Xarray. The authors say the reason for Xarray is to be able to work with multidimensional arrays. I can't fully comprehend that. Why not just add this functionality to Pandas? I could understand if they started such big of a project for some big idea, but just add multidimensional functionality that should've better been added to Pandas to spare users time learning two different data bases seems like not a good justification to me. To say nothing that Xarray has ended up being as slow as Pandas.

I think that a good justification for starting a new data base project for Python is to make it really fast first and foremost. I think a new data base project that will follow numpy example must be started: when the code base is written in lightning-fast C/C++ and then Python wrappers are added on top of that.

I am wondering if anybody is aware of such an effort. If so, when should we expect the release?

Thank you in advance.

Ivan

More Ivan Nepomnyashchikh's questions See All
Similar questions and discussions