As these things typically go, last week I ran into an unusual error when using DataFrame.to_csv:
/usr/local/lib/python3.6/dist-packages/pandas/io/formats/csvs.py in _save_chunk(self, start_i, end_i)
354 )
355
--> 356 libwriters.write_csv_rows(self.data, ix, self.nlevels, self.cols, self.writer)
pandas/_libs/writers.pyx in pandas._libs.writers.write_csv_rows()
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 31-32: surrogates not allowed
The error was unusual to me because I was using Pandas in a way I typically would, on data that should not have been meaningfully different in type from the data sets I’ve used it on. This was a real head-scratcher that no number of Stack Overflow answers, Github comments or blog posts seemed to offer a good answer to.
With a lot of trial and error, it appeared the raw data itself was the problem, not any weird side effect of re.sub or other munging operations I was doing. In short, I needed to clean up the encodings for every field in the entire DataFrame. Here’s the solution, if you’re in the same boat:
new_df = original_df.applymap(lambda x: str(x).encode("utf-8", errors="ignore").decode("utf-8", errors="ignore"))
I entirely expect this approach is imperfect and non-optimal, but it works. I’d be happy to hear suggestions.
Relevant reading: