{"id":131,"date":"2019-12-12T09:39:12","date_gmt":"2019-12-12T15:39:12","guid":{"rendered":"https:\/\/varunpramanik.com\/chronicles\/?p=131"},"modified":"2019-12-12T09:39:13","modified_gmt":"2019-12-12T15:39:13","slug":"pandas-to_csv-encoding-error-solution","status":"publish","type":"post","link":"https:\/\/varunpramanik.com\/chronicles\/2019\/12\/12\/pandas-to_csv-encoding-error-solution\/","title":{"rendered":"Pandas to_csv Encoding Error Solution"},"content":{"rendered":"<p>As these things typically go, last week I ran into an unusual error when using DataFrame.to_csv:<\/p>\n<blockquote><p><code>\/usr\/local\/lib\/python3.6\/dist-packages\/pandas\/io\/formats\/csvs.py in _save_chunk(self, start_i, end_i)<br \/>\n354 )<br \/>\n355<br \/>\n--&gt; 356 libwriters.write_csv_rows(self.data, ix, self.nlevels, self.cols, self.writer)<\/code><\/p>\n<p><code>pandas\/_libs\/writers.pyx in pandas._libs.writers.write_csv_rows()<\/code><\/p>\n<p><code>UnicodeEncodeError: 'utf-8' codec can't encode characters in position 31-32: surrogates not allowed<\/code><\/p><\/blockquote>\n<p>The error was unusual to me because I was using Pandas in a way I typically would, on data that should not have been meaningfully different in type from the data sets I\u2019ve used it on. This was a real head-scratcher that no number of Stack Overflow answers, Github comments or blog posts seemed to offer a good answer to.<\/p>\n<p>With a lot of trial and error, it appeared the raw data itself was the problem, not any weird side effect of <em>re.sub<\/em> or other munging operations I was doing. In short, I needed to clean up the encodings for every field in the entire DataFrame. Here&#8217;s the solution, if you\u2019re in the same boat:<\/p>\n<blockquote><p><code><strong>new_df<\/strong> = <strong>original_df<\/strong>.applymap(lambda x: str(x).encode(\"utf-8\", errors=\"ignore\").decode(\"utf-8\", errors=\"ignore\"))<\/code><\/p><\/blockquote>\n<p>I entirely expect this approach is imperfect and non-optimal, but it works. I&#8217;d be happy to <a href=\"https:\/\/twitter.com\/varunpramanik\" target=\"_blank\" rel=\"noopener noreferrer\">hear suggestions<\/a>.<\/p>\n<p>&nbsp;<\/p>\n<p>Relevant reading:<\/p>\n<ol>\n<li><a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.DataFrame.applymap.html\" target=\"_blank\" rel=\"noopener noreferrer\">pandas.DataFrame.applymap<\/a><\/li>\n<li><a href=\"https:\/\/www.w3schools.com\/python\/ref_string_encode.asp\" target=\"_blank\" rel=\"noopener noreferrer\">String encode()<\/a><\/li>\n<li><a href=\"https:\/\/www.tutorialspoint.com\/python\/string_decode.htm\" target=\"_blank\" rel=\"noopener noreferrer\">String decode()<\/a><\/li>\n<li><a href=\"https:\/\/docs.python.org\/3\/library\/codecs.html#standard-encodings\" target=\"_blank\" rel=\"noopener noreferrer\">Python standard encodings<\/a><\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>As these things typically go, last week I ran into an unusual error when using DataFrame.to_csv: \/usr\/local\/lib\/python3.6\/dist-packages\/pandas\/io\/formats\/csvs.py in _save_chunk(self, start_i, end_i) 354 ) 355 &#8211;&gt; 356 libwriters.write_csv_rows(self.data, ix, self.nlevels, self.cols, self.writer) pandas\/_libs\/writers.pyx in pandas._libs.writers.write_csv_rows() UnicodeEncodeError: &#8216;utf-8&#8217; codec can&#8217;t encode characters in position 31-32: surrogates not allowed The error was unusual to me because I was &hellip; <a href=\"https:\/\/varunpramanik.com\/chronicles\/2019\/12\/12\/pandas-to_csv-encoding-error-solution\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Pandas to_csv Encoding Error Solution&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","enabled":false},"version":2}},"categories":[15],"tags":[41,6],"class_list":["post-131","post","type-post","status-publish","format-standard","hentry","category-code","tag-pandas","tag-python"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p7tP59-27","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/varunpramanik.com\/chronicles\/wp-json\/wp\/v2\/posts\/131","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/varunpramanik.com\/chronicles\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/varunpramanik.com\/chronicles\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/varunpramanik.com\/chronicles\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/varunpramanik.com\/chronicles\/wp-json\/wp\/v2\/comments?post=131"}],"version-history":[{"count":1,"href":"https:\/\/varunpramanik.com\/chronicles\/wp-json\/wp\/v2\/posts\/131\/revisions"}],"predecessor-version":[{"id":135,"href":"https:\/\/varunpramanik.com\/chronicles\/wp-json\/wp\/v2\/posts\/131\/revisions\/135"}],"wp:attachment":[{"href":"https:\/\/varunpramanik.com\/chronicles\/wp-json\/wp\/v2\/media?parent=131"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/varunpramanik.com\/chronicles\/wp-json\/wp\/v2\/categories?post=131"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/varunpramanik.com\/chronicles\/wp-json\/wp\/v2\/tags?post=131"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}