Fighting with computers

Computers are not always friendly.

Wednesday, December 02, 2009

Postgresql UTF-8 weirdness

I was recently moving data from a Postgresql version 8.0 database to a 8.4 on a new server. The database dump was made on a UTF-8 system and it was moved to another system using the same encoding. However I was getting some errors when trying to recover the data. Several encoding errors where poping out. A closer inspection revealed that those were indeed a few encoding-rule violations. For some odd reason some data fields ended up with bad data, some double-byte characters had the first byte missing (it was 0xc2 in my case).

I solved the problem creating a small filter program to add the missing byte of these characters from the database dump. Not very nice but it worked. Why the problem developed in the first place I do not know.


#include <stdio.h>

main() {
int a,c;
while((c=getchar())!= EOF)
{
if(c>0xe0) continue;
if(c>0xc0) {
a=getchar();
if(a>0x7f) putchar(c); putchar(a);
} else if(c<=0x7f) putchar(c);
}
return(0);
}

2 Comments:

  • At 8:52 am, Anonymous Anonymous said…

    Did you send this issue to the mailing list? Let the project find out what went wrong. If it's a bug, they can fix it.

    But how did you make your backup? Did you use the lastest version of pgdump to create the backup of your old database-version? That's the way to go when upgrading.

     
  • At 9:00 am, Blogger Miguel Sánchez said…

    I still have to look up the mailing list to see if it is a known issue, but you are right about reporting the problem.

    However, my first idea being the data coming from unfiltered web forms through PHP through Apache is that I cannot directly blame Postgresql (other than maybe accepting UTF code violations as inputs and then repeating that when queried.

    Regarding the backup I created it with pg_dump. Please note it was not a software upgrade but a system upgrade. Old data is still running happily in the old server while I used the dump to configure a new server, with a more recent version of Postgresql.

    I'll report back in this entry what I find out if it may be relevant to this problem.

     

Post a Comment

<< Home