SQLAlchemy Unicode conundrum

I'm having a weird problem regarding Unicode handling with SQLAlchemy.
In short, when I insert a Python unicode string into an Unicode column
of my MySQL database, I have no trouble getting it back out. On the database
side, however, it gets stored as a weird 4-byte sequence (and no, this
doesn't seem to have anything to do with the 'utf8mb4' default on
MySQL)

My problem is that I have a MySQL dump from another machine that
contains straight UTF8 characters in the SQL. When I try to retrieve
data imported from that other machine I get UnicodeDecodeErrors all the
time.

Below I've included a minimal example that illustrates the problem.

utf8test.sql: Set up a database and create one row with a Unicode
character in it

utf8test.py: Open DB using SQLAlchemy, insert 1 row with
Python's idea of an UTF character, and retrieve both rows.

It turns out that Python can retrieve the data it inserted itself fine,
but it balks at the literal 'ä' I put into the SQL import script.
Investigation of the hexdumps of both an mysqldumped dataset
and the binary data files of MySQL itself shows that the UTF character
inserted via SQL is the real deal (German umlaut 'ä' = UTF 'c3 bc'),
whereas the Python-inserted 'ä' gets converted to the sequence
'c3 83 c2 a4' which I don't understand (see hexdump down below;
I've used 'xxx' and 'yyy' as markers to faciliate finding them
in the hexdump).

Can anybody shed any light on this?

This creates the test DB:

dh@jenna:~/python$ cat utf8test.sql

DROP DATABASE IF EXISTS utftest;

CREATE DATABASE utftest;

USE utftest;

CREATE TABLE x (

    id INTEGER PRIMARY KEY AUTO_INCREMENT,

        text VARCHAR(10)

        );

INSERT INTO x(text) VALUES ('xxxü');

COMMIT;

dh@jenna:~/python$ mysql < utf8test.sql

Here's the Pyhton script:

dh@jenna:~/python$ cat utf8test.py

# -*- encoding: utf8 -*-



from sqlalchemy import create_engine, Column, Unicode, Integer

from sqlalchemy.orm import sessionmaker

from sqlalchemy.ext.declarative import declarative_base



Base = declarative_base()

class X(Base):

    __tablename__ = 'x'

    id = Column(Integer, primary_key=True)

    text = Column(Unicode(10))



engine = create_engine('mysql://localhost/utftest',

    encoding='utf8')

Base.metadata.create_all(engine)

Session = sessionmaker(engine)



db = Session()

x = X(text=u'yyyä')

db.add(x)

db.commit()



rs = db.query(X.text).all()

for r in rs:

    print(r.text)



db.close()

This happens when I run the script (runs without error when I
omit the INSERT INTO bit in utf8test.sql):

dh@jenna:~/python$ python utf8test.py

Traceback (most recent call last):

  File "utf8test.py", line 23, in <module>

      rs = db.query(X.text).all()

[...]

UnicodeDecodeError: 'utf8' codec can't decode

    byte 0xfc in position 3: invalid start byte

Here's a hexdump to confirm that the two ä's are indeed stored
differently in the DB. Using hd I've also conformed that both the
Python as well as the SQL scripts are indeed UTF.

dh@jenna:~/python$ mysqldump utftest | hd

00000000  2d 2d 20 4d 79 53 51 4c  20 64 75 6d 70 20 31 30  |-- MySQL dump 10|

00000010  2e 31 36 20 20 44 69 73  74 72 69 62 20 31 30 2e  |.16  Distrib 10.|

00000020  31 2e 33 37 2d 4d 61 72  69 61 44 42 2c 20 66 6f  |1.37-MariaDB, fo|

00000030  72 20 64 65 62 69 61 6e  2d 6c 69 6e 75 78 2d 67  |r debian-linux-g|

00000040  6e 75 20 28 69 36 38 36  29 0a 2d 2d 0a 2d 2d 20  |nu (i686).--.-- |

[...]

00000520  4c 45 20 4b 45 59 53 20  2a 2f 3b 0a 49 4e 53 45  |LE KEYS */;.INSE|

00000530  52 54 20 49 4e 54 4f 20  60 78 60 20 56 41 4c 55  |RT INTO `x` VALU|

00000540  45 53 20 28 31 2c 27 78  78 78 c3 bc 27 29 2c 28  |ES (1,'xxx..'),(|

00000550  32 2c 27 79 79 79 c3 83  c2 a4 27 29 3b 0a 2f 2a  |2,'yyy....');./*|

asked Nov 26 '18 at 11:04

musbur

947

Don't know the whole story, but c3 83 c2 a4 is UTF-8 bytes of "ä" decoded as latin-1 and re-encoded as UTF-8: 'ä'.encode('utf-8').decode('latin-1').encode('utf-8')

– Ilja Everilä
Nov 26 '18 at 11:33

add a comment |

Below I've included a minimal example that illustrates the problem.

utf8test.sql: Set up a database and create one row with a Unicode
character in it

utf8test.py: Open DB using SQLAlchemy, insert 1 row with
Python's idea of an UTF character, and retrieve both rows.

Can anybody shed any light on this?

This creates the test DB:

dh@jenna:~/python$ cat utf8test.sql

DROP DATABASE IF EXISTS utftest;

CREATE DATABASE utftest;

USE utftest;

CREATE TABLE x (

    id INTEGER PRIMARY KEY AUTO_INCREMENT,

        text VARCHAR(10)

        );

INSERT INTO x(text) VALUES ('xxxü');

COMMIT;

dh@jenna:~/python$ mysql < utf8test.sql

Here's the Pyhton script:

dh@jenna:~/python$ cat utf8test.py

# -*- encoding: utf8 -*-



from sqlalchemy import create_engine, Column, Unicode, Integer

from sqlalchemy.orm import sessionmaker

from sqlalchemy.ext.declarative import declarative_base



Base = declarative_base()

class X(Base):

    __tablename__ = 'x'

    id = Column(Integer, primary_key=True)

    text = Column(Unicode(10))



engine = create_engine('mysql://localhost/utftest',

    encoding='utf8')

Base.metadata.create_all(engine)

Session = sessionmaker(engine)



db = Session()

x = X(text=u'yyyä')

db.add(x)

db.commit()



rs = db.query(X.text).all()

for r in rs:

    print(r.text)



db.close()

This happens when I run the script (runs without error when I
omit the INSERT INTO bit in utf8test.sql):

dh@jenna:~/python$ python utf8test.py

Traceback (most recent call last):

  File "utf8test.py", line 23, in <module>

      rs = db.query(X.text).all()

[...]

UnicodeDecodeError: 'utf8' codec can't decode

    byte 0xfc in position 3: invalid start byte

Here's a hexdump to confirm that the two ä's are indeed stored
differently in the DB. Using hd I've also conformed that both the
Python as well as the SQL scripts are indeed UTF.

dh@jenna:~/python$ mysqldump utftest | hd

00000000  2d 2d 20 4d 79 53 51 4c  20 64 75 6d 70 20 31 30  |-- MySQL dump 10|

00000010  2e 31 36 20 20 44 69 73  74 72 69 62 20 31 30 2e  |.16  Distrib 10.|

00000020  31 2e 33 37 2d 4d 61 72  69 61 44 42 2c 20 66 6f  |1.37-MariaDB, fo|

00000030  72 20 64 65 62 69 61 6e  2d 6c 69 6e 75 78 2d 67  |r debian-linux-g|

00000040  6e 75 20 28 69 36 38 36  29 0a 2d 2d 0a 2d 2d 20  |nu (i686).--.-- |

[...]

00000520  4c 45 20 4b 45 59 53 20  2a 2f 3b 0a 49 4e 53 45  |LE KEYS */;.INSE|

00000530  52 54 20 49 4e 54 4f 20  60 78 60 20 56 41 4c 55  |RT INTO `x` VALU|

00000540  45 53 20 28 31 2c 27 78  78 78 c3 bc 27 29 2c 28  |ES (1,'xxx..'),(|

00000550  32 2c 27 79 79 79 c3 83  c2 a4 27 29 3b 0a 2f 2a  |2,'yyy....');./*|

asked Nov 26 '18 at 11:04

musbur

947

Don't know the whole story, but c3 83 c2 a4 is UTF-8 bytes of "ä" decoded as latin-1 and re-encoded as UTF-8: 'ä'.encode('utf-8').decode('latin-1').encode('utf-8')

– Ilja Everilä
Nov 26 '18 at 11:33

add a comment |

Below I've included a minimal example that illustrates the problem.

utf8test.sql: Set up a database and create one row with a Unicode
character in it

utf8test.py: Open DB using SQLAlchemy, insert 1 row with
Python's idea of an UTF character, and retrieve both rows.

Can anybody shed any light on this?

This creates the test DB:

dh@jenna:~/python$ cat utf8test.sql

DROP DATABASE IF EXISTS utftest;

CREATE DATABASE utftest;

USE utftest;

CREATE TABLE x (

    id INTEGER PRIMARY KEY AUTO_INCREMENT,

        text VARCHAR(10)

        );

INSERT INTO x(text) VALUES ('xxxü');

COMMIT;

dh@jenna:~/python$ mysql < utf8test.sql

Here's the Pyhton script:

dh@jenna:~/python$ cat utf8test.py

# -*- encoding: utf8 -*-



from sqlalchemy import create_engine, Column, Unicode, Integer

from sqlalchemy.orm import sessionmaker

from sqlalchemy.ext.declarative import declarative_base



Base = declarative_base()

class X(Base):

    __tablename__ = 'x'

    id = Column(Integer, primary_key=True)

    text = Column(Unicode(10))



engine = create_engine('mysql://localhost/utftest',

    encoding='utf8')

Base.metadata.create_all(engine)

Session = sessionmaker(engine)



db = Session()

x = X(text=u'yyyä')

db.add(x)

db.commit()



rs = db.query(X.text).all()

for r in rs:

    print(r.text)



db.close()

This happens when I run the script (runs without error when I
omit the INSERT INTO bit in utf8test.sql):

dh@jenna:~/python$ python utf8test.py

Traceback (most recent call last):

  File "utf8test.py", line 23, in <module>

      rs = db.query(X.text).all()

[...]

UnicodeDecodeError: 'utf8' codec can't decode

    byte 0xfc in position 3: invalid start byte

Here's a hexdump to confirm that the two ä's are indeed stored
differently in the DB. Using hd I've also conformed that both the
Python as well as the SQL scripts are indeed UTF.

dh@jenna:~/python$ mysqldump utftest | hd

00000000  2d 2d 20 4d 79 53 51 4c  20 64 75 6d 70 20 31 30  |-- MySQL dump 10|

00000010  2e 31 36 20 20 44 69 73  74 72 69 62 20 31 30 2e  |.16  Distrib 10.|

00000020  31 2e 33 37 2d 4d 61 72  69 61 44 42 2c 20 66 6f  |1.37-MariaDB, fo|

00000030  72 20 64 65 62 69 61 6e  2d 6c 69 6e 75 78 2d 67  |r debian-linux-g|

00000040  6e 75 20 28 69 36 38 36  29 0a 2d 2d 0a 2d 2d 20  |nu (i686).--.-- |

[...]

00000520  4c 45 20 4b 45 59 53 20  2a 2f 3b 0a 49 4e 53 45  |LE KEYS */;.INSE|

00000530  52 54 20 49 4e 54 4f 20  60 78 60 20 56 41 4c 55  |RT INTO `x` VALU|

00000540  45 53 20 28 31 2c 27 78  78 78 c3 bc 27 29 2c 28  |ES (1,'xxx..'),(|

00000550  32 2c 27 79 79 79 c3 83  c2 a4 27 29 3b 0a 2f 2a  |2,'yyy....');./*|

asked Nov 26 '18 at 11:04

musbur

947

Below I've included a minimal example that illustrates the problem.

utf8test.sql: Set up a database and create one row with a Unicode
character in it

utf8test.py: Open DB using SQLAlchemy, insert 1 row with
Python's idea of an UTF character, and retrieve both rows.

Can anybody shed any light on this?

This creates the test DB:

dh@jenna:~/python$ cat utf8test.sql

DROP DATABASE IF EXISTS utftest;

CREATE DATABASE utftest;

USE utftest;

CREATE TABLE x (

    id INTEGER PRIMARY KEY AUTO_INCREMENT,

        text VARCHAR(10)

        );

INSERT INTO x(text) VALUES ('xxxü');

COMMIT;

dh@jenna:~/python$ mysql < utf8test.sql

Here's the Pyhton script:

dh@jenna:~/python$ cat utf8test.py

# -*- encoding: utf8 -*-



from sqlalchemy import create_engine, Column, Unicode, Integer

from sqlalchemy.orm import sessionmaker

from sqlalchemy.ext.declarative import declarative_base



Base = declarative_base()

class X(Base):

    __tablename__ = 'x'

    id = Column(Integer, primary_key=True)

    text = Column(Unicode(10))



engine = create_engine('mysql://localhost/utftest',

    encoding='utf8')

Base.metadata.create_all(engine)

Session = sessionmaker(engine)



db = Session()

x = X(text=u'yyyä')

db.add(x)

db.commit()



rs = db.query(X.text).all()

for r in rs:

    print(r.text)



db.close()

This happens when I run the script (runs without error when I
omit the INSERT INTO bit in utf8test.sql):

dh@jenna:~/python$ python utf8test.py

Traceback (most recent call last):

  File "utf8test.py", line 23, in <module>

      rs = db.query(X.text).all()

[...]

UnicodeDecodeError: 'utf8' codec can't decode

    byte 0xfc in position 3: invalid start byte

Here's a hexdump to confirm that the two ä's are indeed stored
differently in the DB. Using hd I've also conformed that both the
Python as well as the SQL scripts are indeed UTF.

dh@jenna:~/python$ mysqldump utftest | hd

00000000  2d 2d 20 4d 79 53 51 4c  20 64 75 6d 70 20 31 30  |-- MySQL dump 10|

00000010  2e 31 36 20 20 44 69 73  74 72 69 62 20 31 30 2e  |.16  Distrib 10.|

00000020  31 2e 33 37 2d 4d 61 72  69 61 44 42 2c 20 66 6f  |1.37-MariaDB, fo|

00000030  72 20 64 65 62 69 61 6e  2d 6c 69 6e 75 78 2d 67  |r debian-linux-g|

00000040  6e 75 20 28 69 36 38 36  29 0a 2d 2d 0a 2d 2d 20  |nu (i686).--.-- |

[...]

00000520  4c 45 20 4b 45 59 53 20  2a 2f 3b 0a 49 4e 53 45  |LE KEYS */;.INSE|

00000530  52 54 20 49 4e 54 4f 20  60 78 60 20 56 41 4c 55  |RT INTO `x` VALU|

00000540  45 53 20 28 31 2c 27 78  78 78 c3 bc 27 29 2c 28  |ES (1,'xxx..'),(|

00000550  32 2c 27 79 79 79 c3 83  c2 a4 27 29 3b 0a 2f 2a  |2,'yyy....');./*|

python mysql unicode sqlalchemy

asked Nov 26 '18 at 11:04

musbur

947

asked Nov 26 '18 at 11:04

musbur

947

asked Nov 26 '18 at 11:04

musbur

947

asked Nov 26 '18 at 11:04

musbur

947

asked Nov 26 '18 at 11:04

musbur

947

Don't know the whole story, but c3 83 c2 a4 is UTF-8 bytes of "ä" decoded as latin-1 and re-encoded as UTF-8: 'ä'.encode('utf-8').decode('latin-1').encode('utf-8')

– Ilja Everilä
Nov 26 '18 at 11:33

add a comment |

Don't know the whole story, but c3 83 c2 a4 is UTF-8 bytes of "ä" decoded as latin-1 and re-encoded as UTF-8: 'ä'.encode('utf-8').decode('latin-1').encode('utf-8')

– Ilja Everilä
Nov 26 '18 at 11:33

Don't know the whole story, but c3 83 c2 a4 is UTF-8 bytes of "ä" decoded as latin-1 and re-encoded as UTF-8: 'ä'.encode('utf-8').decode('latin-1').encode('utf-8')

– Ilja Everilä
Nov 26 '18 at 11:33

add a comment |

2 Answers
2

active

oldest

votes

c3 83 c2 a4 is the "double encoding" for ä. as Ilja points out. It is discussed further here

http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases provides an UPDATE to fix the data.

Here is a checklist of things that may need to be fixed in your Python: http://mysql.rjweb.org/doc.php/charcoll#python

But this is scary: I see c3 bc (Mojibake for ü) and c3 83 c2 a4 (double-encoding of ä. This implies that you have two different problems happening in the same code. Back up to ground zero, make sure you are using utf8 (or utf8mb4) at all stages of things. Your database may be too messed up to recover from, so consider starting over.

Possibly the only issue is the absence of # -*- encoding: utf8 -*- from one of the python scripts. But, no. You do need that, yet the double-encoding occurred when you used it.

Bottom line: You have multiple errors.

answered Nov 26 '18 at 23:09

Rick James

70.4k564104

The "scary" bit in fact isn't scary, if you look at my original post you see that I indeed typed 'xxxü' and 'yyyä' which, fortunately, is just confusing. Of course I had meant to use 'ä' both times.

– musbur
Nov 27 '18 at 13:27

@musbur - Is one of them from the normal processing, and the other was a manual test? If so, let's focus on the former.

– Rick James
Nov 27 '18 at 18:59

The 'xxxü' I typed directly into the SQL code, the 'yyyä' was inserted (and correctly retrieved) by Python / SQLAlchemy

– musbur
Nov 29 '18 at 4:48

confirmed that my data gets double-encoded on entering the DB (as evidenced by the appearance of double-encoded sequences in the internal DB data as well as SQL dumps), and it gets double-decoded when read back into Python by SQLAlchemy. From what I can see, everything is set correctly to utf8(mb4) on the database side. When I declare the "text" column as String in the SQLAlchemy model, the 'ä' goes into the DB as utf8 and also comes out correctly, but Python doesn't recognize it as UTF8 any more, leading to problems later.

– musbur
Nov 29 '18 at 5:20

@musbur - Since there are about 5 places where utf8 needs to be specified, it is all too easy to focus on the parts that were done correctly and fail to find the other places.

– Rick James
Nov 29 '18 at 21:43

add a comment |

Adding ?use_utf8=0 to the DB URL solves the problem. Found that in the SQLAlchemy docs.

answered Nov 30 '18 at 14:55

musbur

947

add a comment |

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53479763%2fsqlalchemy-unicode-conundrum%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

c3 83 c2 a4 is the "double encoding" for ä. as Ilja points out. It is discussed further here

http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases provides an UPDATE to fix the data.

Here is a checklist of things that may need to be fixed in your Python: http://mysql.rjweb.org/doc.php/charcoll#python

Possibly the only issue is the absence of # -*- encoding: utf8 -*- from one of the python scripts. But, no. You do need that, yet the double-encoding occurred when you used it.

Bottom line: You have multiple errors.

answered Nov 26 '18 at 23:09

Rick James

70.4k564104

The "scary" bit in fact isn't scary, if you look at my original post you see that I indeed typed 'xxxü' and 'yyyä' which, fortunately, is just confusing. Of course I had meant to use 'ä' both times.

– musbur
Nov 27 '18 at 13:27

@musbur - Is one of them from the normal processing, and the other was a manual test? If so, let's focus on the former.

– Rick James
Nov 27 '18 at 18:59

The 'xxxü' I typed directly into the SQL code, the 'yyyä' was inserted (and correctly retrieved) by Python / SQLAlchemy

– musbur
Nov 29 '18 at 4:48

confirmed that my data gets double-encoded on entering the DB (as evidenced by the appearance of double-encoded sequences in the internal DB data as well as SQL dumps), and it gets double-decoded when read back into Python by SQLAlchemy. From what I can see, everything is set correctly to utf8(mb4) on the database side. When I declare the "text" column as String in the SQLAlchemy model, the 'ä' goes into the DB as utf8 and also comes out correctly, but Python doesn't recognize it as UTF8 any more, leading to problems later.

– musbur
Nov 29 '18 at 5:20

@musbur - Since there are about 5 places where utf8 needs to be specified, it is all too easy to focus on the parts that were done correctly and fail to find the other places.

– Rick James
Nov 29 '18 at 21:43

add a comment |

c3 83 c2 a4 is the "double encoding" for ä. as Ilja points out. It is discussed further here

http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases provides an UPDATE to fix the data.

Here is a checklist of things that may need to be fixed in your Python: http://mysql.rjweb.org/doc.php/charcoll#python

Possibly the only issue is the absence of # -*- encoding: utf8 -*- from one of the python scripts. But, no. You do need that, yet the double-encoding occurred when you used it.

Bottom line: You have multiple errors.

answered Nov 26 '18 at 23:09

Rick James

70.4k564104

The "scary" bit in fact isn't scary, if you look at my original post you see that I indeed typed 'xxxü' and 'yyyä' which, fortunately, is just confusing. Of course I had meant to use 'ä' both times.

– musbur
Nov 27 '18 at 13:27

@musbur - Is one of them from the normal processing, and the other was a manual test? If so, let's focus on the former.

– Rick James
Nov 27 '18 at 18:59

The 'xxxü' I typed directly into the SQL code, the 'yyyä' was inserted (and correctly retrieved) by Python / SQLAlchemy

– musbur
Nov 29 '18 at 4:48

confirmed that my data gets double-encoded on entering the DB (as evidenced by the appearance of double-encoded sequences in the internal DB data as well as SQL dumps), and it gets double-decoded when read back into Python by SQLAlchemy. From what I can see, everything is set correctly to utf8(mb4) on the database side. When I declare the "text" column as String in the SQLAlchemy model, the 'ä' goes into the DB as utf8 and also comes out correctly, but Python doesn't recognize it as UTF8 any more, leading to problems later.

– musbur
Nov 29 '18 at 5:20

@musbur - Since there are about 5 places where utf8 needs to be specified, it is all too easy to focus on the parts that were done correctly and fail to find the other places.

– Rick James
Nov 29 '18 at 21:43

add a comment |

c3 83 c2 a4 is the "double encoding" for ä. as Ilja points out. It is discussed further here

http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases provides an UPDATE to fix the data.

Here is a checklist of things that may need to be fixed in your Python: http://mysql.rjweb.org/doc.php/charcoll#python

Possibly the only issue is the absence of # -*- encoding: utf8 -*- from one of the python scripts. But, no. You do need that, yet the double-encoding occurred when you used it.

Bottom line: You have multiple errors.

answered Nov 26 '18 at 23:09

Rick James

70.4k564104

c3 83 c2 a4 is the "double encoding" for ä. as Ilja points out. It is discussed further here

http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases provides an UPDATE to fix the data.

Here is a checklist of things that may need to be fixed in your Python: http://mysql.rjweb.org/doc.php/charcoll#python

Possibly the only issue is the absence of # -*- encoding: utf8 -*- from one of the python scripts. But, no. You do need that, yet the double-encoding occurred when you used it.

Bottom line: You have multiple errors.

answered Nov 26 '18 at 23:09

Rick James

70.4k564104

answered Nov 26 '18 at 23:09

Rick James

70.4k564104

answered Nov 26 '18 at 23:09

Rick James

70.4k564104

answered Nov 26 '18 at 23:09

Rick James

70.4k564104

The "scary" bit in fact isn't scary, if you look at my original post you see that I indeed typed 'xxxü' and 'yyyä' which, fortunately, is just confusing. Of course I had meant to use 'ä' both times.

– musbur
Nov 27 '18 at 13:27

@musbur - Is one of them from the normal processing, and the other was a manual test? If so, let's focus on the former.

– Rick James
Nov 27 '18 at 18:59

The 'xxxü' I typed directly into the SQL code, the 'yyyä' was inserted (and correctly retrieved) by Python / SQLAlchemy

– musbur
Nov 29 '18 at 4:48

confirmed that my data gets double-encoded on entering the DB (as evidenced by the appearance of double-encoded sequences in the internal DB data as well as SQL dumps), and it gets double-decoded when read back into Python by SQLAlchemy. From what I can see, everything is set correctly to utf8(mb4) on the database side. When I declare the "text" column as String in the SQLAlchemy model, the 'ä' goes into the DB as utf8 and also comes out correctly, but Python doesn't recognize it as UTF8 any more, leading to problems later.

– musbur
Nov 29 '18 at 5:20

@musbur - Since there are about 5 places where utf8 needs to be specified, it is all too easy to focus on the parts that were done correctly and fail to find the other places.

– Rick James
Nov 29 '18 at 21:43

add a comment |

The "scary" bit in fact isn't scary, if you look at my original post you see that I indeed typed 'xxxü' and 'yyyä' which, fortunately, is just confusing. Of course I had meant to use 'ä' both times.

– musbur
Nov 27 '18 at 13:27

@musbur - Is one of them from the normal processing, and the other was a manual test? If so, let's focus on the former.

– Rick James
Nov 27 '18 at 18:59

The 'xxxü' I typed directly into the SQL code, the 'yyyä' was inserted (and correctly retrieved) by Python / SQLAlchemy

– musbur
Nov 29 '18 at 4:48

confirmed that my data gets double-encoded on entering the DB (as evidenced by the appearance of double-encoded sequences in the internal DB data as well as SQL dumps), and it gets double-decoded when read back into Python by SQLAlchemy. From what I can see, everything is set correctly to utf8(mb4) on the database side. When I declare the "text" column as String in the SQLAlchemy model, the 'ä' goes into the DB as utf8 and also comes out correctly, but Python doesn't recognize it as UTF8 any more, leading to problems later.

– musbur
Nov 29 '18 at 5:20

@musbur - Since there are about 5 places where utf8 needs to be specified, it is all too easy to focus on the parts that were done correctly and fail to find the other places.

– Rick James
Nov 29 '18 at 21:43

The "scary" bit in fact isn't scary, if you look at my original post you see that I indeed typed 'xxxü' and 'yyyä' which, fortunately, is just confusing. Of course I had meant to use 'ä' both times.

– musbur
Nov 27 '18 at 13:27

@musbur - Is one of them from the normal processing, and the other was a manual test? If so, let's focus on the former.

– Rick James
Nov 27 '18 at 18:59

The 'xxxü' I typed directly into the SQL code, the 'yyyä' was inserted (and correctly retrieved) by Python / SQLAlchemy

– musbur
Nov 29 '18 at 4:48

confirmed that my data gets double-encoded on entering the DB (as evidenced by the appearance of double-encoded sequences in the internal DB data as well as SQL dumps), and it gets double-decoded when read back into Python by SQLAlchemy. From what I can see, everything is set correctly to utf8(mb4) on the database side. When I declare the "text" column as String in the SQLAlchemy model, the 'ä' goes into the DB as utf8 and also comes out correctly, but Python doesn't recognize it as UTF8 any more, leading to problems later.

– musbur
Nov 29 '18 at 5:20

@musbur - Since there are about 5 places where utf8 needs to be specified, it is all too easy to focus on the parts that were done correctly and fail to find the other places.

– Rick James
Nov 29 '18 at 21:43

add a comment |

Adding ?use_utf8=0 to the DB URL solves the problem. Found that in the SQLAlchemy docs.

answered Nov 30 '18 at 14:55

musbur

947

add a comment |

Adding ?use_utf8=0 to the DB URL solves the problem. Found that in the SQLAlchemy docs.

answered Nov 30 '18 at 14:55

musbur

947

add a comment |

Adding ?use_utf8=0 to the DB URL solves the problem. Found that in the SQLAlchemy docs.

answered Nov 30 '18 at 14:55

musbur

947

Adding ?use_utf8=0 to the DB URL solves the problem. Found that in the SQLAlchemy docs.

answered Nov 30 '18 at 14:55

musbur

947

answered Nov 30 '18 at 14:55

musbur

947

answered Nov 30 '18 at 14:55

musbur

947

answered Nov 30 '18 at 14:55

musbur

947

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ytukyg