SQLAlchemy Unicode conundrum












2















I'm having a weird problem regarding Unicode handling with SQLAlchemy.
In short, when I insert a Python unicode string into an Unicode column
of my MySQL database, I have no trouble getting it back out. On the database
side, however, it gets stored as a weird 4-byte sequence (and no, this
doesn't seem to have anything to do with the 'utf8mb4' default on
MySQL)



My problem is that I have a MySQL dump from another machine that
contains straight UTF8 characters in the SQL. When I try to retrieve
data imported from that other machine I get UnicodeDecodeErrors all the
time.



Below I've included a minimal example that illustrates the problem.




  • utf8test.sql: Set up a database and create one row with a Unicode
    character in it


  • utf8test.py: Open DB using SQLAlchemy, insert 1 row with
    Python's idea of an UTF character, and retrieve both rows.



It turns out that Python can retrieve the data it inserted itself fine,
but it balks at the literal 'ä' I put into the SQL import script.
Investigation of the hexdumps of both an mysqldumped dataset
and the binary data files of MySQL itself shows that the UTF character
inserted via SQL is the real deal (German umlaut 'ä' = UTF 'c3 bc'),
whereas the Python-inserted 'ä' gets converted to the sequence
'c3 83 c2 a4' which I don't understand (see hexdump down below;
I've used 'xxx' and 'yyy' as markers to faciliate finding them
in the hexdump).



Can anybody shed any light on this?



This creates the test DB:



dh@jenna:~/python$ cat utf8test.sql
DROP DATABASE IF EXISTS utftest;
CREATE DATABASE utftest;
USE utftest;
CREATE TABLE x (
id INTEGER PRIMARY KEY AUTO_INCREMENT,
text VARCHAR(10)
);
INSERT INTO x(text) VALUES ('xxxü');
COMMIT;
dh@jenna:~/python$ mysql < utf8test.sql


Here's the Pyhton script:



dh@jenna:~/python$ cat utf8test.py
# -*- encoding: utf8 -*-

from sqlalchemy import create_engine, Column, Unicode, Integer
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()
class X(Base):
__tablename__ = 'x'
id = Column(Integer, primary_key=True)
text = Column(Unicode(10))

engine = create_engine('mysql://localhost/utftest',
encoding='utf8')
Base.metadata.create_all(engine)
Session = sessionmaker(engine)

db = Session()
x = X(text=u'yyyä')
db.add(x)
db.commit()

rs = db.query(X.text).all()
for r in rs:
print(r.text)

db.close()


This happens when I run the script (runs without error when I
omit the INSERT INTO bit in utf8test.sql):



dh@jenna:~/python$ python utf8test.py
Traceback (most recent call last):
File "utf8test.py", line 23, in <module>
rs = db.query(X.text).all()
[...]
UnicodeDecodeError: 'utf8' codec can't decode
byte 0xfc in position 3: invalid start byte


Here's a hexdump to confirm that the two ä's are indeed stored
differently in the DB. Using hd I've also conformed that both the
Python as well as the SQL scripts are indeed UTF.



dh@jenna:~/python$ mysqldump utftest | hd
00000000 2d 2d 20 4d 79 53 51 4c 20 64 75 6d 70 20 31 30 |-- MySQL dump 10|
00000010 2e 31 36 20 20 44 69 73 74 72 69 62 20 31 30 2e |.16 Distrib 10.|
00000020 31 2e 33 37 2d 4d 61 72 69 61 44 42 2c 20 66 6f |1.37-MariaDB, fo|
00000030 72 20 64 65 62 69 61 6e 2d 6c 69 6e 75 78 2d 67 |r debian-linux-g|
00000040 6e 75 20 28 69 36 38 36 29 0a 2d 2d 0a 2d 2d 20 |nu (i686).--.-- |
[...]
00000520 4c 45 20 4b 45 59 53 20 2a 2f 3b 0a 49 4e 53 45 |LE KEYS */;.INSE|
00000530 52 54 20 49 4e 54 4f 20 60 78 60 20 56 41 4c 55 |RT INTO `x` VALU|
00000540 45 53 20 28 31 2c 27 78 78 78 c3 bc 27 29 2c 28 |ES (1,'xxx..'),(|
00000550 32 2c 27 79 79 79 c3 83 c2 a4 27 29 3b 0a 2f 2a |2,'yyy....');./*|









share|improve this question























  • Don't know the whole story, but c3 83 c2 a4 is UTF-8 bytes of "ä" decoded as latin-1 and re-encoded as UTF-8: 'ä'.encode('utf-8').decode('latin-1').encode('utf-8')

    – Ilja Everilä
    Nov 26 '18 at 11:33


















2















I'm having a weird problem regarding Unicode handling with SQLAlchemy.
In short, when I insert a Python unicode string into an Unicode column
of my MySQL database, I have no trouble getting it back out. On the database
side, however, it gets stored as a weird 4-byte sequence (and no, this
doesn't seem to have anything to do with the 'utf8mb4' default on
MySQL)



My problem is that I have a MySQL dump from another machine that
contains straight UTF8 characters in the SQL. When I try to retrieve
data imported from that other machine I get UnicodeDecodeErrors all the
time.



Below I've included a minimal example that illustrates the problem.




  • utf8test.sql: Set up a database and create one row with a Unicode
    character in it


  • utf8test.py: Open DB using SQLAlchemy, insert 1 row with
    Python's idea of an UTF character, and retrieve both rows.



It turns out that Python can retrieve the data it inserted itself fine,
but it balks at the literal 'ä' I put into the SQL import script.
Investigation of the hexdumps of both an mysqldumped dataset
and the binary data files of MySQL itself shows that the UTF character
inserted via SQL is the real deal (German umlaut 'ä' = UTF 'c3 bc'),
whereas the Python-inserted 'ä' gets converted to the sequence
'c3 83 c2 a4' which I don't understand (see hexdump down below;
I've used 'xxx' and 'yyy' as markers to faciliate finding them
in the hexdump).



Can anybody shed any light on this?



This creates the test DB:



dh@jenna:~/python$ cat utf8test.sql
DROP DATABASE IF EXISTS utftest;
CREATE DATABASE utftest;
USE utftest;
CREATE TABLE x (
id INTEGER PRIMARY KEY AUTO_INCREMENT,
text VARCHAR(10)
);
INSERT INTO x(text) VALUES ('xxxü');
COMMIT;
dh@jenna:~/python$ mysql < utf8test.sql


Here's the Pyhton script:



dh@jenna:~/python$ cat utf8test.py
# -*- encoding: utf8 -*-

from sqlalchemy import create_engine, Column, Unicode, Integer
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()
class X(Base):
__tablename__ = 'x'
id = Column(Integer, primary_key=True)
text = Column(Unicode(10))

engine = create_engine('mysql://localhost/utftest',
encoding='utf8')
Base.metadata.create_all(engine)
Session = sessionmaker(engine)

db = Session()
x = X(text=u'yyyä')
db.add(x)
db.commit()

rs = db.query(X.text).all()
for r in rs:
print(r.text)

db.close()


This happens when I run the script (runs without error when I
omit the INSERT INTO bit in utf8test.sql):



dh@jenna:~/python$ python utf8test.py
Traceback (most recent call last):
File "utf8test.py", line 23, in <module>
rs = db.query(X.text).all()
[...]
UnicodeDecodeError: 'utf8' codec can't decode
byte 0xfc in position 3: invalid start byte


Here's a hexdump to confirm that the two ä's are indeed stored
differently in the DB. Using hd I've also conformed that both the
Python as well as the SQL scripts are indeed UTF.



dh@jenna:~/python$ mysqldump utftest | hd
00000000 2d 2d 20 4d 79 53 51 4c 20 64 75 6d 70 20 31 30 |-- MySQL dump 10|
00000010 2e 31 36 20 20 44 69 73 74 72 69 62 20 31 30 2e |.16 Distrib 10.|
00000020 31 2e 33 37 2d 4d 61 72 69 61 44 42 2c 20 66 6f |1.37-MariaDB, fo|
00000030 72 20 64 65 62 69 61 6e 2d 6c 69 6e 75 78 2d 67 |r debian-linux-g|
00000040 6e 75 20 28 69 36 38 36 29 0a 2d 2d 0a 2d 2d 20 |nu (i686).--.-- |
[...]
00000520 4c 45 20 4b 45 59 53 20 2a 2f 3b 0a 49 4e 53 45 |LE KEYS */;.INSE|
00000530 52 54 20 49 4e 54 4f 20 60 78 60 20 56 41 4c 55 |RT INTO `x` VALU|
00000540 45 53 20 28 31 2c 27 78 78 78 c3 bc 27 29 2c 28 |ES (1,'xxx..'),(|
00000550 32 2c 27 79 79 79 c3 83 c2 a4 27 29 3b 0a 2f 2a |2,'yyy....');./*|









share|improve this question























  • Don't know the whole story, but c3 83 c2 a4 is UTF-8 bytes of "ä" decoded as latin-1 and re-encoded as UTF-8: 'ä'.encode('utf-8').decode('latin-1').encode('utf-8')

    – Ilja Everilä
    Nov 26 '18 at 11:33
















2












2








2








I'm having a weird problem regarding Unicode handling with SQLAlchemy.
In short, when I insert a Python unicode string into an Unicode column
of my MySQL database, I have no trouble getting it back out. On the database
side, however, it gets stored as a weird 4-byte sequence (and no, this
doesn't seem to have anything to do with the 'utf8mb4' default on
MySQL)



My problem is that I have a MySQL dump from another machine that
contains straight UTF8 characters in the SQL. When I try to retrieve
data imported from that other machine I get UnicodeDecodeErrors all the
time.



Below I've included a minimal example that illustrates the problem.




  • utf8test.sql: Set up a database and create one row with a Unicode
    character in it


  • utf8test.py: Open DB using SQLAlchemy, insert 1 row with
    Python's idea of an UTF character, and retrieve both rows.



It turns out that Python can retrieve the data it inserted itself fine,
but it balks at the literal 'ä' I put into the SQL import script.
Investigation of the hexdumps of both an mysqldumped dataset
and the binary data files of MySQL itself shows that the UTF character
inserted via SQL is the real deal (German umlaut 'ä' = UTF 'c3 bc'),
whereas the Python-inserted 'ä' gets converted to the sequence
'c3 83 c2 a4' which I don't understand (see hexdump down below;
I've used 'xxx' and 'yyy' as markers to faciliate finding them
in the hexdump).



Can anybody shed any light on this?



This creates the test DB:



dh@jenna:~/python$ cat utf8test.sql
DROP DATABASE IF EXISTS utftest;
CREATE DATABASE utftest;
USE utftest;
CREATE TABLE x (
id INTEGER PRIMARY KEY AUTO_INCREMENT,
text VARCHAR(10)
);
INSERT INTO x(text) VALUES ('xxxü');
COMMIT;
dh@jenna:~/python$ mysql < utf8test.sql


Here's the Pyhton script:



dh@jenna:~/python$ cat utf8test.py
# -*- encoding: utf8 -*-

from sqlalchemy import create_engine, Column, Unicode, Integer
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()
class X(Base):
__tablename__ = 'x'
id = Column(Integer, primary_key=True)
text = Column(Unicode(10))

engine = create_engine('mysql://localhost/utftest',
encoding='utf8')
Base.metadata.create_all(engine)
Session = sessionmaker(engine)

db = Session()
x = X(text=u'yyyä')
db.add(x)
db.commit()

rs = db.query(X.text).all()
for r in rs:
print(r.text)

db.close()


This happens when I run the script (runs without error when I
omit the INSERT INTO bit in utf8test.sql):



dh@jenna:~/python$ python utf8test.py
Traceback (most recent call last):
File "utf8test.py", line 23, in <module>
rs = db.query(X.text).all()
[...]
UnicodeDecodeError: 'utf8' codec can't decode
byte 0xfc in position 3: invalid start byte


Here's a hexdump to confirm that the two ä's are indeed stored
differently in the DB. Using hd I've also conformed that both the
Python as well as the SQL scripts are indeed UTF.



dh@jenna:~/python$ mysqldump utftest | hd
00000000 2d 2d 20 4d 79 53 51 4c 20 64 75 6d 70 20 31 30 |-- MySQL dump 10|
00000010 2e 31 36 20 20 44 69 73 74 72 69 62 20 31 30 2e |.16 Distrib 10.|
00000020 31 2e 33 37 2d 4d 61 72 69 61 44 42 2c 20 66 6f |1.37-MariaDB, fo|
00000030 72 20 64 65 62 69 61 6e 2d 6c 69 6e 75 78 2d 67 |r debian-linux-g|
00000040 6e 75 20 28 69 36 38 36 29 0a 2d 2d 0a 2d 2d 20 |nu (i686).--.-- |
[...]
00000520 4c 45 20 4b 45 59 53 20 2a 2f 3b 0a 49 4e 53 45 |LE KEYS */;.INSE|
00000530 52 54 20 49 4e 54 4f 20 60 78 60 20 56 41 4c 55 |RT INTO `x` VALU|
00000540 45 53 20 28 31 2c 27 78 78 78 c3 bc 27 29 2c 28 |ES (1,'xxx..'),(|
00000550 32 2c 27 79 79 79 c3 83 c2 a4 27 29 3b 0a 2f 2a |2,'yyy....');./*|









share|improve this question














I'm having a weird problem regarding Unicode handling with SQLAlchemy.
In short, when I insert a Python unicode string into an Unicode column
of my MySQL database, I have no trouble getting it back out. On the database
side, however, it gets stored as a weird 4-byte sequence (and no, this
doesn't seem to have anything to do with the 'utf8mb4' default on
MySQL)



My problem is that I have a MySQL dump from another machine that
contains straight UTF8 characters in the SQL. When I try to retrieve
data imported from that other machine I get UnicodeDecodeErrors all the
time.



Below I've included a minimal example that illustrates the problem.




  • utf8test.sql: Set up a database and create one row with a Unicode
    character in it


  • utf8test.py: Open DB using SQLAlchemy, insert 1 row with
    Python's idea of an UTF character, and retrieve both rows.



It turns out that Python can retrieve the data it inserted itself fine,
but it balks at the literal 'ä' I put into the SQL import script.
Investigation of the hexdumps of both an mysqldumped dataset
and the binary data files of MySQL itself shows that the UTF character
inserted via SQL is the real deal (German umlaut 'ä' = UTF 'c3 bc'),
whereas the Python-inserted 'ä' gets converted to the sequence
'c3 83 c2 a4' which I don't understand (see hexdump down below;
I've used 'xxx' and 'yyy' as markers to faciliate finding them
in the hexdump).



Can anybody shed any light on this?



This creates the test DB:



dh@jenna:~/python$ cat utf8test.sql
DROP DATABASE IF EXISTS utftest;
CREATE DATABASE utftest;
USE utftest;
CREATE TABLE x (
id INTEGER PRIMARY KEY AUTO_INCREMENT,
text VARCHAR(10)
);
INSERT INTO x(text) VALUES ('xxxü');
COMMIT;
dh@jenna:~/python$ mysql < utf8test.sql


Here's the Pyhton script:



dh@jenna:~/python$ cat utf8test.py
# -*- encoding: utf8 -*-

from sqlalchemy import create_engine, Column, Unicode, Integer
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()
class X(Base):
__tablename__ = 'x'
id = Column(Integer, primary_key=True)
text = Column(Unicode(10))

engine = create_engine('mysql://localhost/utftest',
encoding='utf8')
Base.metadata.create_all(engine)
Session = sessionmaker(engine)

db = Session()
x = X(text=u'yyyä')
db.add(x)
db.commit()

rs = db.query(X.text).all()
for r in rs:
print(r.text)

db.close()


This happens when I run the script (runs without error when I
omit the INSERT INTO bit in utf8test.sql):



dh@jenna:~/python$ python utf8test.py
Traceback (most recent call last):
File "utf8test.py", line 23, in <module>
rs = db.query(X.text).all()
[...]
UnicodeDecodeError: 'utf8' codec can't decode
byte 0xfc in position 3: invalid start byte


Here's a hexdump to confirm that the two ä's are indeed stored
differently in the DB. Using hd I've also conformed that both the
Python as well as the SQL scripts are indeed UTF.



dh@jenna:~/python$ mysqldump utftest | hd
00000000 2d 2d 20 4d 79 53 51 4c 20 64 75 6d 70 20 31 30 |-- MySQL dump 10|
00000010 2e 31 36 20 20 44 69 73 74 72 69 62 20 31 30 2e |.16 Distrib 10.|
00000020 31 2e 33 37 2d 4d 61 72 69 61 44 42 2c 20 66 6f |1.37-MariaDB, fo|
00000030 72 20 64 65 62 69 61 6e 2d 6c 69 6e 75 78 2d 67 |r debian-linux-g|
00000040 6e 75 20 28 69 36 38 36 29 0a 2d 2d 0a 2d 2d 20 |nu (i686).--.-- |
[...]
00000520 4c 45 20 4b 45 59 53 20 2a 2f 3b 0a 49 4e 53 45 |LE KEYS */;.INSE|
00000530 52 54 20 49 4e 54 4f 20 60 78 60 20 56 41 4c 55 |RT INTO `x` VALU|
00000540 45 53 20 28 31 2c 27 78 78 78 c3 bc 27 29 2c 28 |ES (1,'xxx..'),(|
00000550 32 2c 27 79 79 79 c3 83 c2 a4 27 29 3b 0a 2f 2a |2,'yyy....');./*|






python mysql unicode sqlalchemy






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 26 '18 at 11:04









musburmusbur

947




947













  • Don't know the whole story, but c3 83 c2 a4 is UTF-8 bytes of "ä" decoded as latin-1 and re-encoded as UTF-8: 'ä'.encode('utf-8').decode('latin-1').encode('utf-8')

    – Ilja Everilä
    Nov 26 '18 at 11:33





















  • Don't know the whole story, but c3 83 c2 a4 is UTF-8 bytes of "ä" decoded as latin-1 and re-encoded as UTF-8: 'ä'.encode('utf-8').decode('latin-1').encode('utf-8')

    – Ilja Everilä
    Nov 26 '18 at 11:33



















Don't know the whole story, but c3 83 c2 a4 is UTF-8 bytes of "ä" decoded as latin-1 and re-encoded as UTF-8: 'ä'.encode('utf-8').decode('latin-1').encode('utf-8')

– Ilja Everilä
Nov 26 '18 at 11:33







Don't know the whole story, but c3 83 c2 a4 is UTF-8 bytes of "ä" decoded as latin-1 and re-encoded as UTF-8: 'ä'.encode('utf-8').decode('latin-1').encode('utf-8')

– Ilja Everilä
Nov 26 '18 at 11:33














2 Answers
2






active

oldest

votes


















0














c3 83 c2 a4 is the "double encoding" for ä. as Ilja points out. It is discussed further here



http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases provides an UPDATE to fix the data.



Here is a checklist of things that may need to be fixed in your Python: http://mysql.rjweb.org/doc.php/charcoll#python



But this is scary: I see c3 bc (Mojibake for ü) and c3 83 c2 a4 (double-encoding of ä. This implies that you have two different problems happening in the same code. Back up to ground zero, make sure you are using utf8 (or utf8mb4) at all stages of things. Your database may be too messed up to recover from, so consider starting over.



Possibly the only issue is the absence of # -*- encoding: utf8 -*- from one of the python scripts. But, no. You do need that, yet the double-encoding occurred when you used it.



Bottom line: You have multiple errors.






share|improve this answer
























  • The "scary" bit in fact isn't scary, if you look at my original post you see that I indeed typed 'xxxü' and 'yyyä' which, fortunately, is just confusing. Of course I had meant to use 'ä' both times.

    – musbur
    Nov 27 '18 at 13:27











  • @musbur - Is one of them from the normal processing, and the other was a manual test? If so, let's focus on the former.

    – Rick James
    Nov 27 '18 at 18:59











  • The 'xxxü' I typed directly into the SQL code, the 'yyyä' was inserted (and correctly retrieved) by Python / SQLAlchemy

    – musbur
    Nov 29 '18 at 4:48











  • confirmed that my data gets double-encoded on entering the DB (as evidenced by the appearance of double-encoded sequences in the internal DB data as well as SQL dumps), and it gets double-decoded when read back into Python by SQLAlchemy. From what I can see, everything is set correctly to utf8(mb4) on the database side. When I declare the "text" column as String in the SQLAlchemy model, the 'ä' goes into the DB as utf8 and also comes out correctly, but Python doesn't recognize it as UTF8 any more, leading to problems later.

    – musbur
    Nov 29 '18 at 5:20











  • @musbur - Since there are about 5 places where utf8 needs to be specified, it is all too easy to focus on the parts that were done correctly and fail to find the other places.

    – Rick James
    Nov 29 '18 at 21:43



















0














Adding ?use_utf8=0 to the DB URL solves the problem. Found that in the SQLAlchemy docs.






share|improve this answer
























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53479763%2fsqlalchemy-unicode-conundrum%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    c3 83 c2 a4 is the "double encoding" for ä. as Ilja points out. It is discussed further here



    http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases provides an UPDATE to fix the data.



    Here is a checklist of things that may need to be fixed in your Python: http://mysql.rjweb.org/doc.php/charcoll#python



    But this is scary: I see c3 bc (Mojibake for ü) and c3 83 c2 a4 (double-encoding of ä. This implies that you have two different problems happening in the same code. Back up to ground zero, make sure you are using utf8 (or utf8mb4) at all stages of things. Your database may be too messed up to recover from, so consider starting over.



    Possibly the only issue is the absence of # -*- encoding: utf8 -*- from one of the python scripts. But, no. You do need that, yet the double-encoding occurred when you used it.



    Bottom line: You have multiple errors.






    share|improve this answer
























    • The "scary" bit in fact isn't scary, if you look at my original post you see that I indeed typed 'xxxü' and 'yyyä' which, fortunately, is just confusing. Of course I had meant to use 'ä' both times.

      – musbur
      Nov 27 '18 at 13:27











    • @musbur - Is one of them from the normal processing, and the other was a manual test? If so, let's focus on the former.

      – Rick James
      Nov 27 '18 at 18:59











    • The 'xxxü' I typed directly into the SQL code, the 'yyyä' was inserted (and correctly retrieved) by Python / SQLAlchemy

      – musbur
      Nov 29 '18 at 4:48











    • confirmed that my data gets double-encoded on entering the DB (as evidenced by the appearance of double-encoded sequences in the internal DB data as well as SQL dumps), and it gets double-decoded when read back into Python by SQLAlchemy. From what I can see, everything is set correctly to utf8(mb4) on the database side. When I declare the "text" column as String in the SQLAlchemy model, the 'ä' goes into the DB as utf8 and also comes out correctly, but Python doesn't recognize it as UTF8 any more, leading to problems later.

      – musbur
      Nov 29 '18 at 5:20











    • @musbur - Since there are about 5 places where utf8 needs to be specified, it is all too easy to focus on the parts that were done correctly and fail to find the other places.

      – Rick James
      Nov 29 '18 at 21:43
















    0














    c3 83 c2 a4 is the "double encoding" for ä. as Ilja points out. It is discussed further here



    http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases provides an UPDATE to fix the data.



    Here is a checklist of things that may need to be fixed in your Python: http://mysql.rjweb.org/doc.php/charcoll#python



    But this is scary: I see c3 bc (Mojibake for ü) and c3 83 c2 a4 (double-encoding of ä. This implies that you have two different problems happening in the same code. Back up to ground zero, make sure you are using utf8 (or utf8mb4) at all stages of things. Your database may be too messed up to recover from, so consider starting over.



    Possibly the only issue is the absence of # -*- encoding: utf8 -*- from one of the python scripts. But, no. You do need that, yet the double-encoding occurred when you used it.



    Bottom line: You have multiple errors.






    share|improve this answer
























    • The "scary" bit in fact isn't scary, if you look at my original post you see that I indeed typed 'xxxü' and 'yyyä' which, fortunately, is just confusing. Of course I had meant to use 'ä' both times.

      – musbur
      Nov 27 '18 at 13:27











    • @musbur - Is one of them from the normal processing, and the other was a manual test? If so, let's focus on the former.

      – Rick James
      Nov 27 '18 at 18:59











    • The 'xxxü' I typed directly into the SQL code, the 'yyyä' was inserted (and correctly retrieved) by Python / SQLAlchemy

      – musbur
      Nov 29 '18 at 4:48











    • confirmed that my data gets double-encoded on entering the DB (as evidenced by the appearance of double-encoded sequences in the internal DB data as well as SQL dumps), and it gets double-decoded when read back into Python by SQLAlchemy. From what I can see, everything is set correctly to utf8(mb4) on the database side. When I declare the "text" column as String in the SQLAlchemy model, the 'ä' goes into the DB as utf8 and also comes out correctly, but Python doesn't recognize it as UTF8 any more, leading to problems later.

      – musbur
      Nov 29 '18 at 5:20











    • @musbur - Since there are about 5 places where utf8 needs to be specified, it is all too easy to focus on the parts that were done correctly and fail to find the other places.

      – Rick James
      Nov 29 '18 at 21:43














    0












    0








    0







    c3 83 c2 a4 is the "double encoding" for ä. as Ilja points out. It is discussed further here



    http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases provides an UPDATE to fix the data.



    Here is a checklist of things that may need to be fixed in your Python: http://mysql.rjweb.org/doc.php/charcoll#python



    But this is scary: I see c3 bc (Mojibake for ü) and c3 83 c2 a4 (double-encoding of ä. This implies that you have two different problems happening in the same code. Back up to ground zero, make sure you are using utf8 (or utf8mb4) at all stages of things. Your database may be too messed up to recover from, so consider starting over.



    Possibly the only issue is the absence of # -*- encoding: utf8 -*- from one of the python scripts. But, no. You do need that, yet the double-encoding occurred when you used it.



    Bottom line: You have multiple errors.






    share|improve this answer













    c3 83 c2 a4 is the "double encoding" for ä. as Ilja points out. It is discussed further here



    http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases provides an UPDATE to fix the data.



    Here is a checklist of things that may need to be fixed in your Python: http://mysql.rjweb.org/doc.php/charcoll#python



    But this is scary: I see c3 bc (Mojibake for ü) and c3 83 c2 a4 (double-encoding of ä. This implies that you have two different problems happening in the same code. Back up to ground zero, make sure you are using utf8 (or utf8mb4) at all stages of things. Your database may be too messed up to recover from, so consider starting over.



    Possibly the only issue is the absence of # -*- encoding: utf8 -*- from one of the python scripts. But, no. You do need that, yet the double-encoding occurred when you used it.



    Bottom line: You have multiple errors.







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Nov 26 '18 at 23:09









    Rick JamesRick James

    70.4k564104




    70.4k564104













    • The "scary" bit in fact isn't scary, if you look at my original post you see that I indeed typed 'xxxü' and 'yyyä' which, fortunately, is just confusing. Of course I had meant to use 'ä' both times.

      – musbur
      Nov 27 '18 at 13:27











    • @musbur - Is one of them from the normal processing, and the other was a manual test? If so, let's focus on the former.

      – Rick James
      Nov 27 '18 at 18:59











    • The 'xxxü' I typed directly into the SQL code, the 'yyyä' was inserted (and correctly retrieved) by Python / SQLAlchemy

      – musbur
      Nov 29 '18 at 4:48











    • confirmed that my data gets double-encoded on entering the DB (as evidenced by the appearance of double-encoded sequences in the internal DB data as well as SQL dumps), and it gets double-decoded when read back into Python by SQLAlchemy. From what I can see, everything is set correctly to utf8(mb4) on the database side. When I declare the "text" column as String in the SQLAlchemy model, the 'ä' goes into the DB as utf8 and also comes out correctly, but Python doesn't recognize it as UTF8 any more, leading to problems later.

      – musbur
      Nov 29 '18 at 5:20











    • @musbur - Since there are about 5 places where utf8 needs to be specified, it is all too easy to focus on the parts that were done correctly and fail to find the other places.

      – Rick James
      Nov 29 '18 at 21:43



















    • The "scary" bit in fact isn't scary, if you look at my original post you see that I indeed typed 'xxxü' and 'yyyä' which, fortunately, is just confusing. Of course I had meant to use 'ä' both times.

      – musbur
      Nov 27 '18 at 13:27











    • @musbur - Is one of them from the normal processing, and the other was a manual test? If so, let's focus on the former.

      – Rick James
      Nov 27 '18 at 18:59











    • The 'xxxü' I typed directly into the SQL code, the 'yyyä' was inserted (and correctly retrieved) by Python / SQLAlchemy

      – musbur
      Nov 29 '18 at 4:48











    • confirmed that my data gets double-encoded on entering the DB (as evidenced by the appearance of double-encoded sequences in the internal DB data as well as SQL dumps), and it gets double-decoded when read back into Python by SQLAlchemy. From what I can see, everything is set correctly to utf8(mb4) on the database side. When I declare the "text" column as String in the SQLAlchemy model, the 'ä' goes into the DB as utf8 and also comes out correctly, but Python doesn't recognize it as UTF8 any more, leading to problems later.

      – musbur
      Nov 29 '18 at 5:20











    • @musbur - Since there are about 5 places where utf8 needs to be specified, it is all too easy to focus on the parts that were done correctly and fail to find the other places.

      – Rick James
      Nov 29 '18 at 21:43

















    The "scary" bit in fact isn't scary, if you look at my original post you see that I indeed typed 'xxxü' and 'yyyä' which, fortunately, is just confusing. Of course I had meant to use 'ä' both times.

    – musbur
    Nov 27 '18 at 13:27





    The "scary" bit in fact isn't scary, if you look at my original post you see that I indeed typed 'xxxü' and 'yyyä' which, fortunately, is just confusing. Of course I had meant to use 'ä' both times.

    – musbur
    Nov 27 '18 at 13:27













    @musbur - Is one of them from the normal processing, and the other was a manual test? If so, let's focus on the former.

    – Rick James
    Nov 27 '18 at 18:59





    @musbur - Is one of them from the normal processing, and the other was a manual test? If so, let's focus on the former.

    – Rick James
    Nov 27 '18 at 18:59













    The 'xxxü' I typed directly into the SQL code, the 'yyyä' was inserted (and correctly retrieved) by Python / SQLAlchemy

    – musbur
    Nov 29 '18 at 4:48





    The 'xxxü' I typed directly into the SQL code, the 'yyyä' was inserted (and correctly retrieved) by Python / SQLAlchemy

    – musbur
    Nov 29 '18 at 4:48













    confirmed that my data gets double-encoded on entering the DB (as evidenced by the appearance of double-encoded sequences in the internal DB data as well as SQL dumps), and it gets double-decoded when read back into Python by SQLAlchemy. From what I can see, everything is set correctly to utf8(mb4) on the database side. When I declare the "text" column as String in the SQLAlchemy model, the 'ä' goes into the DB as utf8 and also comes out correctly, but Python doesn't recognize it as UTF8 any more, leading to problems later.

    – musbur
    Nov 29 '18 at 5:20





    confirmed that my data gets double-encoded on entering the DB (as evidenced by the appearance of double-encoded sequences in the internal DB data as well as SQL dumps), and it gets double-decoded when read back into Python by SQLAlchemy. From what I can see, everything is set correctly to utf8(mb4) on the database side. When I declare the "text" column as String in the SQLAlchemy model, the 'ä' goes into the DB as utf8 and also comes out correctly, but Python doesn't recognize it as UTF8 any more, leading to problems later.

    – musbur
    Nov 29 '18 at 5:20













    @musbur - Since there are about 5 places where utf8 needs to be specified, it is all too easy to focus on the parts that were done correctly and fail to find the other places.

    – Rick James
    Nov 29 '18 at 21:43





    @musbur - Since there are about 5 places where utf8 needs to be specified, it is all too easy to focus on the parts that were done correctly and fail to find the other places.

    – Rick James
    Nov 29 '18 at 21:43













    0














    Adding ?use_utf8=0 to the DB URL solves the problem. Found that in the SQLAlchemy docs.






    share|improve this answer




























      0














      Adding ?use_utf8=0 to the DB URL solves the problem. Found that in the SQLAlchemy docs.






      share|improve this answer


























        0












        0








        0







        Adding ?use_utf8=0 to the DB URL solves the problem. Found that in the SQLAlchemy docs.






        share|improve this answer













        Adding ?use_utf8=0 to the DB URL solves the problem. Found that in the SQLAlchemy docs.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 30 '18 at 14:55









        musburmusbur

        947




        947






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53479763%2fsqlalchemy-unicode-conundrum%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Wiesbaden

            Marschland

            Dieringhausen