December 2019 – Chronicles by Varun Pramanik

As these things typically go, last week I ran into an unusual error when using DataFrame.to_csv:

/usr/local/lib/python3.6/dist-packages/pandas/io/formats/csvs.py in _save_chunk(self, start_i, end_i) 354 ) 355 --> 356 libwriters.write_csv_rows(self.data, ix, self.nlevels, self.cols, self.writer)

pandas/_libs/writers.pyx in pandas._libs.writers.write_csv_rows()

UnicodeEncodeError: 'utf-8' codec can't encode characters in position 31-32: surrogates not allowed

The error was unusual to me because I was using Pandas in a way I typically would, on data that should not have been meaningfully different in type from the data sets I’ve used it on. This was a real head-scratcher that no number of Stack Overflow answers, Github comments or blog posts seemed to offer a good answer to.

With a lot of trial and error, it appeared the raw data itself was the problem, not any weird side effect of re.sub or other munging operations I was doing. In short, I needed to clean up the encodings for every field in the entire DataFrame. Here’s the solution, if you’re in the same boat:

new_df = original_df.applymap(lambda x: str(x).encode("utf-8", errors="ignore").decode("utf-8", errors="ignore"))

I entirely expect this approach is imperfect and non-optimal, but it works. I’d be happy to hear suggestions.

Relevant reading:

Month: December 2019

Pandas to_csv Encoding Error Solution