SQLAlchemy Unicode conundrum
I'm having a weird problem regarding Unicode handling with SQLAlchemy.
In short, when I insert a Python unicode string into an Unicode column
of my MySQL database, I have no trouble getting it back out. On the database
side, however, it gets stored as a weird 4-byte sequence (and no, this
doesn't seem to have anything to do with the 'utf8mb4' default on
MySQL)
My problem is that I have a MySQL dump from another machine that
contains straight UTF8 characters in the SQL. When I try to retrieve
data imported from that other machine I get UnicodeDecodeErrors all the
time.
Below I've included a minimal example that illustrates the problem.
utf8test.sql: Set up a database and create one row with a Unicode
character in itutf8test.py: Open DB using SQLAlchemy, insert 1 row with
Python's idea of an UTF character, and retrieve both rows.
It turns out that Python can retrieve the data it inserted itself fine,
but it balks at the literal 'ä' I put into the SQL import script.
Investigation of the hexdumps of both an mysqldumped dataset
and the binary data files of MySQL itself shows that the UTF character
inserted via SQL is the real deal (German umlaut 'ä' = UTF 'c3 bc'),
whereas the Python-inserted 'ä' gets converted to the sequence
'c3 83 c2 a4' which I don't understand (see hexdump down below;
I've used 'xxx' and 'yyy' as markers to faciliate finding them
in the hexdump).
Can anybody shed any light on this?
This creates the test DB:
dh@jenna:~/python$ cat utf8test.sql
DROP DATABASE IF EXISTS utftest;
CREATE DATABASE utftest;
USE utftest;
CREATE TABLE x (
id INTEGER PRIMARY KEY AUTO_INCREMENT,
text VARCHAR(10)
);
INSERT INTO x(text) VALUES ('xxxü');
COMMIT;
dh@jenna:~/python$ mysql < utf8test.sql
Here's the Pyhton script:
dh@jenna:~/python$ cat utf8test.py
# -*- encoding: utf8 -*-
from sqlalchemy import create_engine, Column, Unicode, Integer
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class X(Base):
__tablename__ = 'x'
id = Column(Integer, primary_key=True)
text = Column(Unicode(10))
engine = create_engine('mysql://localhost/utftest',
encoding='utf8')
Base.metadata.create_all(engine)
Session = sessionmaker(engine)
db = Session()
x = X(text=u'yyyä')
db.add(x)
db.commit()
rs = db.query(X.text).all()
for r in rs:
print(r.text)
db.close()
This happens when I run the script (runs without error when I
omit the INSERT INTO bit in utf8test.sql):
dh@jenna:~/python$ python utf8test.py
Traceback (most recent call last):
File "utf8test.py", line 23, in <module>
rs = db.query(X.text).all()
[...]
UnicodeDecodeError: 'utf8' codec can't decode
byte 0xfc in position 3: invalid start byte
Here's a hexdump to confirm that the two ä's are indeed stored
differently in the DB. Using hd I've also conformed that both the
Python as well as the SQL scripts are indeed UTF.
dh@jenna:~/python$ mysqldump utftest | hd
00000000 2d 2d 20 4d 79 53 51 4c 20 64 75 6d 70 20 31 30 |-- MySQL dump 10|
00000010 2e 31 36 20 20 44 69 73 74 72 69 62 20 31 30 2e |.16 Distrib 10.|
00000020 31 2e 33 37 2d 4d 61 72 69 61 44 42 2c 20 66 6f |1.37-MariaDB, fo|
00000030 72 20 64 65 62 69 61 6e 2d 6c 69 6e 75 78 2d 67 |r debian-linux-g|
00000040 6e 75 20 28 69 36 38 36 29 0a 2d 2d 0a 2d 2d 20 |nu (i686).--.-- |
[...]
00000520 4c 45 20 4b 45 59 53 20 2a 2f 3b 0a 49 4e 53 45 |LE KEYS */;.INSE|
00000530 52 54 20 49 4e 54 4f 20 60 78 60 20 56 41 4c 55 |RT INTO `x` VALU|
00000540 45 53 20 28 31 2c 27 78 78 78 c3 bc 27 29 2c 28 |ES (1,'xxx..'),(|
00000550 32 2c 27 79 79 79 c3 83 c2 a4 27 29 3b 0a 2f 2a |2,'yyy....');./*|
python mysql unicode sqlalchemy
add a comment |
I'm having a weird problem regarding Unicode handling with SQLAlchemy.
In short, when I insert a Python unicode string into an Unicode column
of my MySQL database, I have no trouble getting it back out. On the database
side, however, it gets stored as a weird 4-byte sequence (and no, this
doesn't seem to have anything to do with the 'utf8mb4' default on
MySQL)
My problem is that I have a MySQL dump from another machine that
contains straight UTF8 characters in the SQL. When I try to retrieve
data imported from that other machine I get UnicodeDecodeErrors all the
time.
Below I've included a minimal example that illustrates the problem.
utf8test.sql: Set up a database and create one row with a Unicode
character in itutf8test.py: Open DB using SQLAlchemy, insert 1 row with
Python's idea of an UTF character, and retrieve both rows.
It turns out that Python can retrieve the data it inserted itself fine,
but it balks at the literal 'ä' I put into the SQL import script.
Investigation of the hexdumps of both an mysqldumped dataset
and the binary data files of MySQL itself shows that the UTF character
inserted via SQL is the real deal (German umlaut 'ä' = UTF 'c3 bc'),
whereas the Python-inserted 'ä' gets converted to the sequence
'c3 83 c2 a4' which I don't understand (see hexdump down below;
I've used 'xxx' and 'yyy' as markers to faciliate finding them
in the hexdump).
Can anybody shed any light on this?
This creates the test DB:
dh@jenna:~/python$ cat utf8test.sql
DROP DATABASE IF EXISTS utftest;
CREATE DATABASE utftest;
USE utftest;
CREATE TABLE x (
id INTEGER PRIMARY KEY AUTO_INCREMENT,
text VARCHAR(10)
);
INSERT INTO x(text) VALUES ('xxxü');
COMMIT;
dh@jenna:~/python$ mysql < utf8test.sql
Here's the Pyhton script:
dh@jenna:~/python$ cat utf8test.py
# -*- encoding: utf8 -*-
from sqlalchemy import create_engine, Column, Unicode, Integer
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class X(Base):
__tablename__ = 'x'
id = Column(Integer, primary_key=True)
text = Column(Unicode(10))
engine = create_engine('mysql://localhost/utftest',
encoding='utf8')
Base.metadata.create_all(engine)
Session = sessionmaker(engine)
db = Session()
x = X(text=u'yyyä')
db.add(x)
db.commit()
rs = db.query(X.text).all()
for r in rs:
print(r.text)
db.close()
This happens when I run the script (runs without error when I
omit the INSERT INTO bit in utf8test.sql):
dh@jenna:~/python$ python utf8test.py
Traceback (most recent call last):
File "utf8test.py", line 23, in <module>
rs = db.query(X.text).all()
[...]
UnicodeDecodeError: 'utf8' codec can't decode
byte 0xfc in position 3: invalid start byte
Here's a hexdump to confirm that the two ä's are indeed stored
differently in the DB. Using hd I've also conformed that both the
Python as well as the SQL scripts are indeed UTF.
dh@jenna:~/python$ mysqldump utftest | hd
00000000 2d 2d 20 4d 79 53 51 4c 20 64 75 6d 70 20 31 30 |-- MySQL dump 10|
00000010 2e 31 36 20 20 44 69 73 74 72 69 62 20 31 30 2e |.16 Distrib 10.|
00000020 31 2e 33 37 2d 4d 61 72 69 61 44 42 2c 20 66 6f |1.37-MariaDB, fo|
00000030 72 20 64 65 62 69 61 6e 2d 6c 69 6e 75 78 2d 67 |r debian-linux-g|
00000040 6e 75 20 28 69 36 38 36 29 0a 2d 2d 0a 2d 2d 20 |nu (i686).--.-- |
[...]
00000520 4c 45 20 4b 45 59 53 20 2a 2f 3b 0a 49 4e 53 45 |LE KEYS */;.INSE|
00000530 52 54 20 49 4e 54 4f 20 60 78 60 20 56 41 4c 55 |RT INTO `x` VALU|
00000540 45 53 20 28 31 2c 27 78 78 78 c3 bc 27 29 2c 28 |ES (1,'xxx..'),(|
00000550 32 2c 27 79 79 79 c3 83 c2 a4 27 29 3b 0a 2f 2a |2,'yyy....');./*|
python mysql unicode sqlalchemy
Don't know the whole story, butc3 83 c2 a4
is UTF-8 bytes of "ä" decoded as latin-1 and re-encoded as UTF-8:'ä'.encode('utf-8').decode('latin-1').encode('utf-8')
– Ilja Everilä
Nov 26 '18 at 11:33
add a comment |
I'm having a weird problem regarding Unicode handling with SQLAlchemy.
In short, when I insert a Python unicode string into an Unicode column
of my MySQL database, I have no trouble getting it back out. On the database
side, however, it gets stored as a weird 4-byte sequence (and no, this
doesn't seem to have anything to do with the 'utf8mb4' default on
MySQL)
My problem is that I have a MySQL dump from another machine that
contains straight UTF8 characters in the SQL. When I try to retrieve
data imported from that other machine I get UnicodeDecodeErrors all the
time.
Below I've included a minimal example that illustrates the problem.
utf8test.sql: Set up a database and create one row with a Unicode
character in itutf8test.py: Open DB using SQLAlchemy, insert 1 row with
Python's idea of an UTF character, and retrieve both rows.
It turns out that Python can retrieve the data it inserted itself fine,
but it balks at the literal 'ä' I put into the SQL import script.
Investigation of the hexdumps of both an mysqldumped dataset
and the binary data files of MySQL itself shows that the UTF character
inserted via SQL is the real deal (German umlaut 'ä' = UTF 'c3 bc'),
whereas the Python-inserted 'ä' gets converted to the sequence
'c3 83 c2 a4' which I don't understand (see hexdump down below;
I've used 'xxx' and 'yyy' as markers to faciliate finding them
in the hexdump).
Can anybody shed any light on this?
This creates the test DB:
dh@jenna:~/python$ cat utf8test.sql
DROP DATABASE IF EXISTS utftest;
CREATE DATABASE utftest;
USE utftest;
CREATE TABLE x (
id INTEGER PRIMARY KEY AUTO_INCREMENT,
text VARCHAR(10)
);
INSERT INTO x(text) VALUES ('xxxü');
COMMIT;
dh@jenna:~/python$ mysql < utf8test.sql
Here's the Pyhton script:
dh@jenna:~/python$ cat utf8test.py
# -*- encoding: utf8 -*-
from sqlalchemy import create_engine, Column, Unicode, Integer
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class X(Base):
__tablename__ = 'x'
id = Column(Integer, primary_key=True)
text = Column(Unicode(10))
engine = create_engine('mysql://localhost/utftest',
encoding='utf8')
Base.metadata.create_all(engine)
Session = sessionmaker(engine)
db = Session()
x = X(text=u'yyyä')
db.add(x)
db.commit()
rs = db.query(X.text).all()
for r in rs:
print(r.text)
db.close()
This happens when I run the script (runs without error when I
omit the INSERT INTO bit in utf8test.sql):
dh@jenna:~/python$ python utf8test.py
Traceback (most recent call last):
File "utf8test.py", line 23, in <module>
rs = db.query(X.text).all()
[...]
UnicodeDecodeError: 'utf8' codec can't decode
byte 0xfc in position 3: invalid start byte
Here's a hexdump to confirm that the two ä's are indeed stored
differently in the DB. Using hd I've also conformed that both the
Python as well as the SQL scripts are indeed UTF.
dh@jenna:~/python$ mysqldump utftest | hd
00000000 2d 2d 20 4d 79 53 51 4c 20 64 75 6d 70 20 31 30 |-- MySQL dump 10|
00000010 2e 31 36 20 20 44 69 73 74 72 69 62 20 31 30 2e |.16 Distrib 10.|
00000020 31 2e 33 37 2d 4d 61 72 69 61 44 42 2c 20 66 6f |1.37-MariaDB, fo|
00000030 72 20 64 65 62 69 61 6e 2d 6c 69 6e 75 78 2d 67 |r debian-linux-g|
00000040 6e 75 20 28 69 36 38 36 29 0a 2d 2d 0a 2d 2d 20 |nu (i686).--.-- |
[...]
00000520 4c 45 20 4b 45 59 53 20 2a 2f 3b 0a 49 4e 53 45 |LE KEYS */;.INSE|
00000530 52 54 20 49 4e 54 4f 20 60 78 60 20 56 41 4c 55 |RT INTO `x` VALU|
00000540 45 53 20 28 31 2c 27 78 78 78 c3 bc 27 29 2c 28 |ES (1,'xxx..'),(|
00000550 32 2c 27 79 79 79 c3 83 c2 a4 27 29 3b 0a 2f 2a |2,'yyy....');./*|
python mysql unicode sqlalchemy
I'm having a weird problem regarding Unicode handling with SQLAlchemy.
In short, when I insert a Python unicode string into an Unicode column
of my MySQL database, I have no trouble getting it back out. On the database
side, however, it gets stored as a weird 4-byte sequence (and no, this
doesn't seem to have anything to do with the 'utf8mb4' default on
MySQL)
My problem is that I have a MySQL dump from another machine that
contains straight UTF8 characters in the SQL. When I try to retrieve
data imported from that other machine I get UnicodeDecodeErrors all the
time.
Below I've included a minimal example that illustrates the problem.
utf8test.sql: Set up a database and create one row with a Unicode
character in itutf8test.py: Open DB using SQLAlchemy, insert 1 row with
Python's idea of an UTF character, and retrieve both rows.
It turns out that Python can retrieve the data it inserted itself fine,
but it balks at the literal 'ä' I put into the SQL import script.
Investigation of the hexdumps of both an mysqldumped dataset
and the binary data files of MySQL itself shows that the UTF character
inserted via SQL is the real deal (German umlaut 'ä' = UTF 'c3 bc'),
whereas the Python-inserted 'ä' gets converted to the sequence
'c3 83 c2 a4' which I don't understand (see hexdump down below;
I've used 'xxx' and 'yyy' as markers to faciliate finding them
in the hexdump).
Can anybody shed any light on this?
This creates the test DB:
dh@jenna:~/python$ cat utf8test.sql
DROP DATABASE IF EXISTS utftest;
CREATE DATABASE utftest;
USE utftest;
CREATE TABLE x (
id INTEGER PRIMARY KEY AUTO_INCREMENT,
text VARCHAR(10)
);
INSERT INTO x(text) VALUES ('xxxü');
COMMIT;
dh@jenna:~/python$ mysql < utf8test.sql
Here's the Pyhton script:
dh@jenna:~/python$ cat utf8test.py
# -*- encoding: utf8 -*-
from sqlalchemy import create_engine, Column, Unicode, Integer
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class X(Base):
__tablename__ = 'x'
id = Column(Integer, primary_key=True)
text = Column(Unicode(10))
engine = create_engine('mysql://localhost/utftest',
encoding='utf8')
Base.metadata.create_all(engine)
Session = sessionmaker(engine)
db = Session()
x = X(text=u'yyyä')
db.add(x)
db.commit()
rs = db.query(X.text).all()
for r in rs:
print(r.text)
db.close()
This happens when I run the script (runs without error when I
omit the INSERT INTO bit in utf8test.sql):
dh@jenna:~/python$ python utf8test.py
Traceback (most recent call last):
File "utf8test.py", line 23, in <module>
rs = db.query(X.text).all()
[...]
UnicodeDecodeError: 'utf8' codec can't decode
byte 0xfc in position 3: invalid start byte
Here's a hexdump to confirm that the two ä's are indeed stored
differently in the DB. Using hd I've also conformed that both the
Python as well as the SQL scripts are indeed UTF.
dh@jenna:~/python$ mysqldump utftest | hd
00000000 2d 2d 20 4d 79 53 51 4c 20 64 75 6d 70 20 31 30 |-- MySQL dump 10|
00000010 2e 31 36 20 20 44 69 73 74 72 69 62 20 31 30 2e |.16 Distrib 10.|
00000020 31 2e 33 37 2d 4d 61 72 69 61 44 42 2c 20 66 6f |1.37-MariaDB, fo|
00000030 72 20 64 65 62 69 61 6e 2d 6c 69 6e 75 78 2d 67 |r debian-linux-g|
00000040 6e 75 20 28 69 36 38 36 29 0a 2d 2d 0a 2d 2d 20 |nu (i686).--.-- |
[...]
00000520 4c 45 20 4b 45 59 53 20 2a 2f 3b 0a 49 4e 53 45 |LE KEYS */;.INSE|
00000530 52 54 20 49 4e 54 4f 20 60 78 60 20 56 41 4c 55 |RT INTO `x` VALU|
00000540 45 53 20 28 31 2c 27 78 78 78 c3 bc 27 29 2c 28 |ES (1,'xxx..'),(|
00000550 32 2c 27 79 79 79 c3 83 c2 a4 27 29 3b 0a 2f 2a |2,'yyy....');./*|
python mysql unicode sqlalchemy
python mysql unicode sqlalchemy
asked Nov 26 '18 at 11:04
musburmusbur
947
947
Don't know the whole story, butc3 83 c2 a4
is UTF-8 bytes of "ä" decoded as latin-1 and re-encoded as UTF-8:'ä'.encode('utf-8').decode('latin-1').encode('utf-8')
– Ilja Everilä
Nov 26 '18 at 11:33
add a comment |
Don't know the whole story, butc3 83 c2 a4
is UTF-8 bytes of "ä" decoded as latin-1 and re-encoded as UTF-8:'ä'.encode('utf-8').decode('latin-1').encode('utf-8')
– Ilja Everilä
Nov 26 '18 at 11:33
Don't know the whole story, but
c3 83 c2 a4
is UTF-8 bytes of "ä" decoded as latin-1 and re-encoded as UTF-8: 'ä'.encode('utf-8').decode('latin-1').encode('utf-8')
– Ilja Everilä
Nov 26 '18 at 11:33
Don't know the whole story, but
c3 83 c2 a4
is UTF-8 bytes of "ä" decoded as latin-1 and re-encoded as UTF-8: 'ä'.encode('utf-8').decode('latin-1').encode('utf-8')
– Ilja Everilä
Nov 26 '18 at 11:33
add a comment |
2 Answers
2
active
oldest
votes
c3 83 c2 a4
is the "double encoding" for ä
. as Ilja points out. It is discussed further here
http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases provides an UPDATE
to fix the data.
Here is a checklist of things that may need to be fixed in your Python: http://mysql.rjweb.org/doc.php/charcoll#python
But this is scary: I see c3 bc
(Mojibake for ü
) and c3 83 c2 a4
(double-encoding of ä
. This implies that you have two different problems happening in the same code. Back up to ground zero, make sure you are using utf8 (or utf8mb4) at all stages of things. Your database may be too messed up to recover from, so consider starting over.
Possibly the only issue is the absence of # -*- encoding: utf8 -*-
from one of the python scripts. But, no. You do need that, yet the double-encoding occurred when you used it.
Bottom line: You have multiple errors.
The "scary" bit in fact isn't scary, if you look at my original post you see that I indeed typed 'xxxü' and 'yyyä' which, fortunately, is just confusing. Of course I had meant to use 'ä' both times.
– musbur
Nov 27 '18 at 13:27
@musbur - Is one of them from the normal processing, and the other was a manual test? If so, let's focus on the former.
– Rick James
Nov 27 '18 at 18:59
The 'xxxü' I typed directly into the SQL code, the 'yyyä' was inserted (and correctly retrieved) by Python / SQLAlchemy
– musbur
Nov 29 '18 at 4:48
confirmed that my data gets double-encoded on entering the DB (as evidenced by the appearance of double-encoded sequences in the internal DB data as well as SQL dumps), and it gets double-decoded when read back into Python by SQLAlchemy. From what I can see, everything is set correctly to utf8(mb4) on the database side. When I declare the "text" column as String in the SQLAlchemy model, the 'ä' goes into the DB as utf8 and also comes out correctly, but Python doesn't recognize it as UTF8 any more, leading to problems later.
– musbur
Nov 29 '18 at 5:20
@musbur - Since there are about 5 places where utf8 needs to be specified, it is all too easy to focus on the parts that were done correctly and fail to find the other places.
– Rick James
Nov 29 '18 at 21:43
add a comment |
Adding ?use_utf8=0 to the DB URL solves the problem. Found that in the SQLAlchemy docs.
add a comment |
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53479763%2fsqlalchemy-unicode-conundrum%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
c3 83 c2 a4
is the "double encoding" for ä
. as Ilja points out. It is discussed further here
http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases provides an UPDATE
to fix the data.
Here is a checklist of things that may need to be fixed in your Python: http://mysql.rjweb.org/doc.php/charcoll#python
But this is scary: I see c3 bc
(Mojibake for ü
) and c3 83 c2 a4
(double-encoding of ä
. This implies that you have two different problems happening in the same code. Back up to ground zero, make sure you are using utf8 (or utf8mb4) at all stages of things. Your database may be too messed up to recover from, so consider starting over.
Possibly the only issue is the absence of # -*- encoding: utf8 -*-
from one of the python scripts. But, no. You do need that, yet the double-encoding occurred when you used it.
Bottom line: You have multiple errors.
The "scary" bit in fact isn't scary, if you look at my original post you see that I indeed typed 'xxxü' and 'yyyä' which, fortunately, is just confusing. Of course I had meant to use 'ä' both times.
– musbur
Nov 27 '18 at 13:27
@musbur - Is one of them from the normal processing, and the other was a manual test? If so, let's focus on the former.
– Rick James
Nov 27 '18 at 18:59
The 'xxxü' I typed directly into the SQL code, the 'yyyä' was inserted (and correctly retrieved) by Python / SQLAlchemy
– musbur
Nov 29 '18 at 4:48
confirmed that my data gets double-encoded on entering the DB (as evidenced by the appearance of double-encoded sequences in the internal DB data as well as SQL dumps), and it gets double-decoded when read back into Python by SQLAlchemy. From what I can see, everything is set correctly to utf8(mb4) on the database side. When I declare the "text" column as String in the SQLAlchemy model, the 'ä' goes into the DB as utf8 and also comes out correctly, but Python doesn't recognize it as UTF8 any more, leading to problems later.
– musbur
Nov 29 '18 at 5:20
@musbur - Since there are about 5 places where utf8 needs to be specified, it is all too easy to focus on the parts that were done correctly and fail to find the other places.
– Rick James
Nov 29 '18 at 21:43
add a comment |
c3 83 c2 a4
is the "double encoding" for ä
. as Ilja points out. It is discussed further here
http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases provides an UPDATE
to fix the data.
Here is a checklist of things that may need to be fixed in your Python: http://mysql.rjweb.org/doc.php/charcoll#python
But this is scary: I see c3 bc
(Mojibake for ü
) and c3 83 c2 a4
(double-encoding of ä
. This implies that you have two different problems happening in the same code. Back up to ground zero, make sure you are using utf8 (or utf8mb4) at all stages of things. Your database may be too messed up to recover from, so consider starting over.
Possibly the only issue is the absence of # -*- encoding: utf8 -*-
from one of the python scripts. But, no. You do need that, yet the double-encoding occurred when you used it.
Bottom line: You have multiple errors.
The "scary" bit in fact isn't scary, if you look at my original post you see that I indeed typed 'xxxü' and 'yyyä' which, fortunately, is just confusing. Of course I had meant to use 'ä' both times.
– musbur
Nov 27 '18 at 13:27
@musbur - Is one of them from the normal processing, and the other was a manual test? If so, let's focus on the former.
– Rick James
Nov 27 '18 at 18:59
The 'xxxü' I typed directly into the SQL code, the 'yyyä' was inserted (and correctly retrieved) by Python / SQLAlchemy
– musbur
Nov 29 '18 at 4:48
confirmed that my data gets double-encoded on entering the DB (as evidenced by the appearance of double-encoded sequences in the internal DB data as well as SQL dumps), and it gets double-decoded when read back into Python by SQLAlchemy. From what I can see, everything is set correctly to utf8(mb4) on the database side. When I declare the "text" column as String in the SQLAlchemy model, the 'ä' goes into the DB as utf8 and also comes out correctly, but Python doesn't recognize it as UTF8 any more, leading to problems later.
– musbur
Nov 29 '18 at 5:20
@musbur - Since there are about 5 places where utf8 needs to be specified, it is all too easy to focus on the parts that were done correctly and fail to find the other places.
– Rick James
Nov 29 '18 at 21:43
add a comment |
c3 83 c2 a4
is the "double encoding" for ä
. as Ilja points out. It is discussed further here
http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases provides an UPDATE
to fix the data.
Here is a checklist of things that may need to be fixed in your Python: http://mysql.rjweb.org/doc.php/charcoll#python
But this is scary: I see c3 bc
(Mojibake for ü
) and c3 83 c2 a4
(double-encoding of ä
. This implies that you have two different problems happening in the same code. Back up to ground zero, make sure you are using utf8 (or utf8mb4) at all stages of things. Your database may be too messed up to recover from, so consider starting over.
Possibly the only issue is the absence of # -*- encoding: utf8 -*-
from one of the python scripts. But, no. You do need that, yet the double-encoding occurred when you used it.
Bottom line: You have multiple errors.
c3 83 c2 a4
is the "double encoding" for ä
. as Ilja points out. It is discussed further here
http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases provides an UPDATE
to fix the data.
Here is a checklist of things that may need to be fixed in your Python: http://mysql.rjweb.org/doc.php/charcoll#python
But this is scary: I see c3 bc
(Mojibake for ü
) and c3 83 c2 a4
(double-encoding of ä
. This implies that you have two different problems happening in the same code. Back up to ground zero, make sure you are using utf8 (or utf8mb4) at all stages of things. Your database may be too messed up to recover from, so consider starting over.
Possibly the only issue is the absence of # -*- encoding: utf8 -*-
from one of the python scripts. But, no. You do need that, yet the double-encoding occurred when you used it.
Bottom line: You have multiple errors.
answered Nov 26 '18 at 23:09
Rick JamesRick James
70.4k564104
70.4k564104
The "scary" bit in fact isn't scary, if you look at my original post you see that I indeed typed 'xxxü' and 'yyyä' which, fortunately, is just confusing. Of course I had meant to use 'ä' both times.
– musbur
Nov 27 '18 at 13:27
@musbur - Is one of them from the normal processing, and the other was a manual test? If so, let's focus on the former.
– Rick James
Nov 27 '18 at 18:59
The 'xxxü' I typed directly into the SQL code, the 'yyyä' was inserted (and correctly retrieved) by Python / SQLAlchemy
– musbur
Nov 29 '18 at 4:48
confirmed that my data gets double-encoded on entering the DB (as evidenced by the appearance of double-encoded sequences in the internal DB data as well as SQL dumps), and it gets double-decoded when read back into Python by SQLAlchemy. From what I can see, everything is set correctly to utf8(mb4) on the database side. When I declare the "text" column as String in the SQLAlchemy model, the 'ä' goes into the DB as utf8 and also comes out correctly, but Python doesn't recognize it as UTF8 any more, leading to problems later.
– musbur
Nov 29 '18 at 5:20
@musbur - Since there are about 5 places where utf8 needs to be specified, it is all too easy to focus on the parts that were done correctly and fail to find the other places.
– Rick James
Nov 29 '18 at 21:43
add a comment |
The "scary" bit in fact isn't scary, if you look at my original post you see that I indeed typed 'xxxü' and 'yyyä' which, fortunately, is just confusing. Of course I had meant to use 'ä' both times.
– musbur
Nov 27 '18 at 13:27
@musbur - Is one of them from the normal processing, and the other was a manual test? If so, let's focus on the former.
– Rick James
Nov 27 '18 at 18:59
The 'xxxü' I typed directly into the SQL code, the 'yyyä' was inserted (and correctly retrieved) by Python / SQLAlchemy
– musbur
Nov 29 '18 at 4:48
confirmed that my data gets double-encoded on entering the DB (as evidenced by the appearance of double-encoded sequences in the internal DB data as well as SQL dumps), and it gets double-decoded when read back into Python by SQLAlchemy. From what I can see, everything is set correctly to utf8(mb4) on the database side. When I declare the "text" column as String in the SQLAlchemy model, the 'ä' goes into the DB as utf8 and also comes out correctly, but Python doesn't recognize it as UTF8 any more, leading to problems later.
– musbur
Nov 29 '18 at 5:20
@musbur - Since there are about 5 places where utf8 needs to be specified, it is all too easy to focus on the parts that were done correctly and fail to find the other places.
– Rick James
Nov 29 '18 at 21:43
The "scary" bit in fact isn't scary, if you look at my original post you see that I indeed typed 'xxxü' and 'yyyä' which, fortunately, is just confusing. Of course I had meant to use 'ä' both times.
– musbur
Nov 27 '18 at 13:27
The "scary" bit in fact isn't scary, if you look at my original post you see that I indeed typed 'xxxü' and 'yyyä' which, fortunately, is just confusing. Of course I had meant to use 'ä' both times.
– musbur
Nov 27 '18 at 13:27
@musbur - Is one of them from the normal processing, and the other was a manual test? If so, let's focus on the former.
– Rick James
Nov 27 '18 at 18:59
@musbur - Is one of them from the normal processing, and the other was a manual test? If so, let's focus on the former.
– Rick James
Nov 27 '18 at 18:59
The 'xxxü' I typed directly into the SQL code, the 'yyyä' was inserted (and correctly retrieved) by Python / SQLAlchemy
– musbur
Nov 29 '18 at 4:48
The 'xxxü' I typed directly into the SQL code, the 'yyyä' was inserted (and correctly retrieved) by Python / SQLAlchemy
– musbur
Nov 29 '18 at 4:48
confirmed that my data gets double-encoded on entering the DB (as evidenced by the appearance of double-encoded sequences in the internal DB data as well as SQL dumps), and it gets double-decoded when read back into Python by SQLAlchemy. From what I can see, everything is set correctly to utf8(mb4) on the database side. When I declare the "text" column as String in the SQLAlchemy model, the 'ä' goes into the DB as utf8 and also comes out correctly, but Python doesn't recognize it as UTF8 any more, leading to problems later.
– musbur
Nov 29 '18 at 5:20
confirmed that my data gets double-encoded on entering the DB (as evidenced by the appearance of double-encoded sequences in the internal DB data as well as SQL dumps), and it gets double-decoded when read back into Python by SQLAlchemy. From what I can see, everything is set correctly to utf8(mb4) on the database side. When I declare the "text" column as String in the SQLAlchemy model, the 'ä' goes into the DB as utf8 and also comes out correctly, but Python doesn't recognize it as UTF8 any more, leading to problems later.
– musbur
Nov 29 '18 at 5:20
@musbur - Since there are about 5 places where utf8 needs to be specified, it is all too easy to focus on the parts that were done correctly and fail to find the other places.
– Rick James
Nov 29 '18 at 21:43
@musbur - Since there are about 5 places where utf8 needs to be specified, it is all too easy to focus on the parts that were done correctly and fail to find the other places.
– Rick James
Nov 29 '18 at 21:43
add a comment |
Adding ?use_utf8=0 to the DB URL solves the problem. Found that in the SQLAlchemy docs.
add a comment |
Adding ?use_utf8=0 to the DB URL solves the problem. Found that in the SQLAlchemy docs.
add a comment |
Adding ?use_utf8=0 to the DB URL solves the problem. Found that in the SQLAlchemy docs.
Adding ?use_utf8=0 to the DB URL solves the problem. Found that in the SQLAlchemy docs.
answered Nov 30 '18 at 14:55
musburmusbur
947
947
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53479763%2fsqlalchemy-unicode-conundrum%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Don't know the whole story, but
c3 83 c2 a4
is UTF-8 bytes of "ä" decoded as latin-1 and re-encoded as UTF-8:'ä'.encode('utf-8').decode('latin-1').encode('utf-8')
– Ilja Everilä
Nov 26 '18 at 11:33