Pandas to_csv Encoding Error Solution

As these things typically go, last week I ran into an unusual error when using DataFrame.to_csv:

/usr/local/lib/python3.6/dist-packages/pandas/io/formats/csvs.py in _save_chunk(self, start_i, end_i)
354 )
355
--> 356 libwriters.write_csv_rows(self.data, ix, self.nlevels, self.cols, self.writer)

pandas/_libs/writers.pyx in pandas._libs.writers.write_csv_rows()

UnicodeEncodeError: 'utf-8' codec can't encode characters in position 31-32: surrogates not allowed

The error was unusual to me because I was using Pandas in a way I typically would, on data that should not have been meaningfully different in type from the data sets I’ve used it on. This was a real head-scratcher that no number of Stack Overflow answers, Github comments or blog posts seemed to offer a good answer to.

With a lot of trial and error, it appeared the raw data itself was the problem, not any weird side effect of re.sub or other munging operations I was doing. In short, I needed to clean up the encodings for every field in the entire DataFrame. Here’s the solution, if you’re in the same boat:

new_df = original_df.applymap(lambda x: str(x).encode("utf-8", errors="ignore").decode("utf-8", errors="ignore"))

I entirely expect this approach is imperfect and non-optimal, but it works. I’d be happy to hear suggestions.

 

Relevant reading:

  1. pandas.DataFrame.applymap
  2. String encode()
  3. String decode()
  4. Python standard encodings