 |
|
| Linux Forum Index » Linux Setup » LANG setting for MS CP 1252... |
|
Page 1 of 1 |
|
| Author |
Message |
| syd_p... |
Posted: Wed Aug 05, 2009 12:20 pm |
|
|
|
Guest
|
Hello,
I have application running on centos 3.8 which brings back some data
from a MS SQL server db, and writes it to disk.
However there are some special characters ( u with 2 dots overhead,
for example) in the data which appear as ? in the linux file created.
I am told the database uses CP 1252, which means the u with 2 dots
overhead,is character 252,
The output of locale is;
$ locale
LANG=C
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=
What can I do to fix this problem please?
Syd |
|
|
| Back to top |
|
|
|
| Nico Kadel-Garcia... |
Posted: Wed Aug 05, 2009 4:51 pm |
|
|
|
Guest
|
On Aug 5, 6:20 pm, syd_p <sydneypue... at (no spam) yahoo.com> wrote:
Quote: Hello,
I have application running on centos 3.8 which brings back some data
from a MS SQL server db, and writes it to disk.
OK. Stop right there. *WHY* are you using a 6 year old operating
system for anything you care about? Seriously, if at all possible,
update to CentOS 4.7 at a minimum, preferably 5.3. You'll get much
better international language support.
Quote: However there are some special characters ( u with 2 dots overhead,
for example) in the data which appear as ? in the linux file created.
I am told the database uses CP 1252, which means the u with 2 dots
overhead,is character 252,
The output of locale is;
$ locale
LANG=C
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL
What can I do to fix this problem please?
Well, it depends. The strings you are handling are not 7-bit ASCII
text, which is what the 'C' format is generally for, they're
effectively binary data. Treat them as such. If you need them to be
visiable, consider setting your LANG and other settings to German or
whatever language with umlauts they were originally written in.
What are you passing this data to? Is it possible that your viewer for
the Linux text file is simply mishandling the generated non-English
character set? |
|
|
| Back to top |
|
|
|
| Bill Marcum... |
Posted: Thu Sep 24, 2009 1:06 pm |
|
|
|
Guest
|
On 2009-09-24, Marcel Bruinsma <mb at (no spam) nomail.afraid.org> wrote:
Quote: Am Donnerstag 24 September 2009 11:57, syd_p a écrit :
Based on the example I mentioned using LANG=de would be
a possible solution.
No, the default CTYPE for de is ISO-8859-1.
CP1252 is a superset of ISO-8859-1. The accented letters are the same.
CP1252 has additional punctuation marks and copyright and trademark
symbols, among other things (code values 128-159 which are undefined
in the ISO-8859-* character sets.) |
|
|
| Back to top |
|
|
|
| Marcel Bruinsma... |
Posted: Thu Sep 24, 2009 3:41 pm |
|
|
|
Guest
|
Am Donnerstag 24 September 2009 11:57, syd_p a écrit :
Quote: Based on the example I mentioned using LANG=de would be
a possible solution.
No, the default CTYPE for de is ISO-8859-1.
Quote: But we are seeing French, Spanish and German "special"
characters which are supported by MS's CP 1252.
Check if your libc supports CP1252 :
$ locale -m | grep '^CP'
CP10007
CP1125
CP1250
CP1251
CP1252
CP1253
CP1254
CP1255
CP1256
CP1257
CP1258
CP737
CP775
CP949
If it does : LANG=en_US.CP1252
Of course, you can replace "en_US" with wathever you prefer.
The important part here is the ".CP1252", which defines the
locale's character set (and encoding). This is independent
from language (the "en") and region (the "_US").
--
printf -v email $(echo \ 155 141 162 143 145 154 142 162 165 151 \
156 163 155 141 100 171 141 150 157 157 056 143 157 155|tr \ \\\\)
# Live every life as if it were your last! # |
|
|
| Back to top |
|
|
|
| Marcel Bruinsma... |
Posted: Sun Sep 27, 2009 6:04 pm |
|
|
|
Guest
|
Am Sonntag, 27. September 2009 23:40, syd_p a écrit :
Quote: → printf '“„â€\n' | iconv -tlatin1 | iconv -flatin1
iconv: Séquence d'échappement illégale à la position 0
→ printf '“„â€\n' | iconv -tcp1252 | iconv -fcp1252
“„â€
Not quite sure how you did the printf above tho.
The three quotes above are actually encoded in UTF-8,
because that is what my terminal understands.
The first iconv on the second printf line converts from
UTF-8 (my default in LANG) to CP1252 and doesn't
report an error, meaning that those characters are
valid in CP1252 encoding. The second iconv does the
inverse : translate from CP1252 to UTF-8, and the
result is the original string.
The first printf passes the same UTF-8 encoded quotes
to iconv, but asks to convert to latin1 (ISO-8859-1), and
this time iconv says "illegal input sequence", because
these quotes do not exist in latin1.
Quote: And not quite sure what I should set to say LANG
and LC_ALL to en_us first and check that out?
Try,
LANG=en_US.CP1252 locale
LANG=en_US.ISO-8859-15 locale
LANG=en_US.ISO-8859-1 locale
LANG=en_US.UTF-8 locale
and see, if any of these does *not* produce an error
like this :
$ LANG=en_US.FOO locale
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
Obviously, character encoding FOO doesn't exist.
Quote: I did not originally set up the box (actually there are 6 or 8
of them) but I think that LANG=C was done cos there was
a problem with LANG-en_us.
Anything is possible, but centos 3.8 isn't that old.
In your OP you write :
« However there are some special characters (u with 2 dots
» overhead, for example) in the data which appear as ? in
» the linux file created. »
Is that a normal question mark, or is it inverse (white in
a black hexagon or square), like this : �
In the latter case, all you would have to do is convert the
output from the db application with 'iconv -fcp1252 -tutf8'.
--
printf -v email $(echo \ 155 141 162 143 145 154 142 162 165 151 \
156 163 155 141 100 171 141 150 157 157 056 143 157 155|tr \ \\\\)
# Live every life as if it were your last! # |
|
|
| Back to top |
|
|
|
| syd_p... |
Posted: Tue Sep 29, 2009 1:43 am |
|
|
|
Guest
|
On 29 Sep, 10:37, Marcel Bruinsma <m... at (no spam) nomail.afraid.org> wrote:
Quote: Am Dienstag, 29. September 2009 10:31, syd_p a écrit :
But with LANG=C which I thought was only 7 bits the following
printfs work just fine.
$ printf "(octal 353) is the character \0353\n"
(octal 353) is the character ë
printf "(octal 361) is the character \0361\n"
(octal 361) is the character ñ
Good, the typeface (font) has the characters you need.
These are two of the characters in the MSSQL db which the
application (not open source) handles as "?".
Check if the application is really the cause of the problem.
For file 'foo', generated by application, run :
LANG=en_US.ISO-8859-15 tr -d '\000-\177' <foo | od -b
--
printf -v email $(echo \ 155 141 162 143 145 154 142 162 165 151 \
156 163 155 141 100 171 141 150 157 157 056 143 157 155|tr \ \\\\)
# Live every life as if it were your last! #
$ cat foo
A?O
$ od -b foo
0000000 101 077 117 012
0000004
octal 101 = A
octal 077 = ?
octal 077 = O
middle char should be capital ñ
$ LANG=en_US.ISO-8859-15 tr -d '\000-\177' <foo | od -b
0000000
not quite sure what this does  |
|
|
| Back to top |
|
|
|
| syd_p... |
Posted: Tue Sep 29, 2009 1:49 am |
|
|
|
Guest
|
On 29 Sep, 10:37, Marcel Bruinsma <m... at (no spam) nomail.afraid.org> wrote:
Quote: Am Dienstag, 29. September 2009 10:31, syd_p a écrit :
But with LANG=C which I thought was only 7 bits the following
printfs work just fine.
$ printf "(octal 353) is the character \0353\n"
(octal 353) is the character ë
printf "(octal 361) is the character \0361\n"
(octal 361) is the character ñ
Good, the typeface (font) has the characters you need.
These are two of the characters in the MSSQL db which the
application (not open source) handles as "?".
Check if the application is really the cause of the problem.
For file 'foo', generated by application, run :
LANG=en_US.ISO-8859-15 tr -d '\000-\177' <foo | od -b
--
printf -v email $(echo \ 155 141 162 143 145 154 142 162 165 151 \
156 163 155 141 100 171 141 150 157 157 056 143 157 155|tr \ \\\\)
# Live every life as if it were your last! #
$ cat foo
A?O
$ od -b foo
0000000 101 077 117 012
0000004
octal 101 = A
octal 077 = ?
octal 077 = O
middle char should be capital ñ
$ LANG=en_US.ISO-8859-15 tr -d '\000-\177' <foo | od -b
0000000
not quite sure what this does  |
|
|
| Back to top |
|
|
|
| syd_p... |
Posted: Tue Sep 29, 2009 1:53 am |
|
|
|
Guest
|
On 29 Sep, 12:43, syd_p <sydneypue... at (no spam) yahoo.com> wrote:
Quote: On 29 Sep, 10:37, Marcel Bruinsma <m... at (no spam) nomail.afraid.org> wrote:
Am Dienstag, 29. September 2009 10:31, syd_p a écrit :
But with LANG=C which I thought was only 7 bits the following
printfs work just fine.
$ printf "(octal 353) is the character \0353\n"
(octal 353) is the character ë
printf "(octal 361) is the character \0361\n"
(octal 361) is the character ñ
Good, the typeface (font) has the characters you need.
These are two of the characters in the MSSQL db which the
application (not open source) handles as "?".
Check if the application is really the cause of the problem.
For file 'foo', generated by application, run :
LANG=en_US.ISO-8859-15 tr -d '\000-\177' <foo | od -b
--
printf -v email $(echo \ 155 141 162 143 145 154 142 162 165 151 \
156 163 155 141 100 171 141 150 157 157 056 143 157 155|tr \ \\\\)
# Live every life as if it were your last! #
$ cat foo
A?O
$ od -b foo
0000000 101 077 117 012
0000004
octal 101 = A
octal 077 = ?
octal 077 = O
middle char should be capital ñ
$ LANG=en_US.ISO-8859-15 tr -d '\000-\177' <foo | od -b
0000000
not quite sure what this does
Ah yes I am - it deletes all "normal" chars and passes the remainder
to od...
not quite sure what all zeros as the out means tho...
And on a Centos 5.3 box I just invoked
$ printf "(octal 353) is the character \0353\n"
(octal 353) is the character 3
on the centos 3.8 box I got the expected output of ë |
|
|
| Back to top |
|
|
|
| Marcel Bruinsma... |
Posted: Tue Sep 29, 2009 3:20 am |
|
|
|
Guest
|
Am Montag, 28. September 2009 12:52, syd_p a écrit :
Quote: I entered the commands as suggested
LANG=en_US.CP1252 locale -> Bad
LANG=en_US.ISO-8859-15 locale -> Good
LANG=en_US.ISO-8859-1 locale -> Good
LANG=en_US.UTF-8 locale -> Good
Then run your application with latin9 :
LANG=en_US.ISO-8859-15 application ...
It should no longer convert the ‘special characters’ to
question marks. Simple test :
LANG=en_US.ISO-8859-15 tr -d '\000-\177' <file | od -b
--
printf -v email $(echo \ 155 141 162 143 145 154 142 162 165 151 \
156 163 155 141 100 171 141 150 157 157 056 143 157 155|tr \ \\\\)
# Live every life as if it were your last! # |
|
|
| Back to top |
|
|
|
| syd_p... |
Posted: Tue Sep 29, 2009 3:35 am |
|
|
|
Guest
|
On 29 Sep, 12:53, syd_p <sydneypue... at (no spam) yahoo.com> wrote:
Quote: On 29 Sep, 12:43, syd_p <sydneypue... at (no spam) yahoo.com> wrote:
On 29 Sep, 10:37, Marcel Bruinsma <m... at (no spam) nomail.afraid.org> wrote:
Am Dienstag, 29. September 2009 10:31, syd_p a écrit :
But with LANG=C which I thought was only 7 bits the following
printfs work just fine.
$ printf "(octal 353) is the character \0353\n"
(octal 353) is the character ë
printf "(octal 361) is the character \0361\n"
(octal 361) is the character ñ
Good, the typeface (font) has the characters you need.
These are two of the characters in the MSSQL db which the
application (not open source) handles as "?".
Check if the application is really the cause of the problem.
For file 'foo', generated by application, run :
LANG=en_US.ISO-8859-15 tr -d '\000-\177' <foo | od -b
--
printf -v email $(echo \ 155 141 162 143 145 154 142 162 165 151 \
156 163 155 141 100 171 141 150 157 157 056 143 157 155|tr \ \\\\)
# Live every life as if it were your last! #
$ cat foo
A?O
$ od -b foo
0000000 101 077 117 012
0000004
octal 101 = A
octal 077 = ?
octal 077 = O
middle char should be capital ñ
$ LANG=en_US.ISO-8859-15 tr -d '\000-\177' <foo | od -b
0000000
not quite sure what this does ;-)
Ah yes I am - it deletes all "normal" chars and passes the remainder
to od...
not quite sure what all zeros as the out means tho...
And on a Centos 5.3 box I just invoked
$ printf "(octal 353) is the character \0353\n"
(octal 353) is the character 3
on the centos 3.8 box I got the expected output of ë
Ahh - I see need to specify the hex value thusly:
$ printf "(hex EB) is the character \xEB\n"
(hex EB) is the character ë
this works on 3.8 and 5.3 |
|
|
| Back to top |
|
|
|
| Marcel Bruinsma... |
Posted: Tue Sep 29, 2009 3:37 am |
|
|
|
Guest
|
Am Dienstag, 29. September 2009 10:31, syd_p a écrit :
Quote: But with LANG=C which I thought was only 7 bits the following
printfs work just fine.
$ printf "(octal 353) is the character \0353\n"
(octal 353) is the character ë
printf "(octal 361) is the character \0361\n"
(octal 361) is the character ñ
Good, the typeface (font) has the characters you need.
Quote: These are two of the characters in the MSSQL db which the
application (not open source) handles as "?".
Check if the application is really the cause of the problem.
For file 'foo', generated by application, run :
LANG=en_US.ISO-8859-15 tr -d '\000-\177' <foo | od -b
--
printf -v email $(echo \ 155 141 162 143 145 154 142 162 165 151 \
156 163 155 141 100 171 141 150 157 157 056 143 157 155|tr \ \\\\)
# Live every life as if it were your last! # |
|
|
| Back to top |
|
|
|
| syd_p... |
Posted: Tue Sep 29, 2009 3:54 am |
|
|
|
Guest
|
On 29 Sep, 12:53, syd_p <sydneypue... at (no spam) yahoo.com> wrote:
Quote: On 29 Sep, 12:43, syd_p <sydneypue... at (no spam) yahoo.com> wrote:
On 29 Sep, 10:37, Marcel Bruinsma <m... at (no spam) nomail.afraid.org> wrote:
Am Dienstag, 29. September 2009 10:31, syd_p a écrit :
But with LANG=C which I thought was only 7 bits the following
printfs work just fine.
$ printf "(octal 353) is the character \0353\n"
(octal 353) is the character ë
printf "(octal 361) is the character \0361\n"
(octal 361) is the character ñ
Good, the typeface (font) has the characters you need.
These are two of the characters in the MSSQL db which the
application (not open source) handles as "?".
Check if the application is really the cause of the problem.
For file 'foo', generated by application, run :
LANG=en_US.ISO-8859-15 tr -d '\000-\177' <foo | od -b
--
printf -v email $(echo \ 155 141 162 143 145 154 142 162 165 151 \
156 163 155 141 100 171 141 150 157 157 056 143 157 155|tr \ \\\\)
# Live every life as if it were your last! #
$ cat foo
A?O
$ od -b foo
0000000 101 077 117 012
0000004
octal 101 = A
octal 077 = ?
octal 077 = O
middle char should be capital ñ
$ LANG=en_US.ISO-8859-15 tr -d '\000-\177' <foo | od -b
0000000
not quite sure what this does ;-)
Ah yes I am - it deletes all "normal" chars and passes the remainder
to od...
not quite sure what all zeros as the out means tho...
And on a Centos 5.3 box I just invoked
$ printf "(octal 353) is the character \0353\n"
(octal 353) is the character 3
on the centos 3.8 box I got the expected output of ë
Ahh - I see need to specify the hex value thusly:
$ printf "(hex EB) is the character \xEB\n"
(hex EB) is the character ë
this works on 3.8 and 5.3 |
|
|
| Back to top |
|
|
|
| Marcel Bruinsma... |
Posted: Tue Sep 29, 2009 8:26 am |
|
|
|
Guest
|
Am Dienstag, 29. September 2009 13:53, syd_p a écrit :
Quote: $  od -b  foo
0000000 101 077 117 012
0000004
octal 101 = Â A
octal 077 = Â ?
octal 077 = Â O
middle char should be capital ñ
Ok, the problem is caused by your application.
Try running it with latin9 ctype :
LANG=en_US.ISO-8859-15 application ...
Quote: $ LANG=en_US.ISO-8859-15 tr -d '\000-\177' <foo | od -b
0000000
not quite sure what this does
Ah yes I am - it deletes all "normal" chars and passes
the remainder to od...
Yes, I thought 'foo' might be a big file, with mostly us-ascii.
In a file with 10000 ascii characters and only 10 non-ascii
the output of od (without the tr filter) might be a bit of a
challenge. ;-)
Quote: not quite sure what all zeros as the out means tho...
Od starts eachs line with an address, the offset of the first
byte in that line. The first line starts at offset 0, unless you
invoke od with the -j option.
Quote: And on a Centos 5.3 box I just invoked
$ printf "(octal 353) is the character \0353\n"
(octal 353) is the character 3
Yes, that is what posix printf is required to do. From,
http://www.opengroup.org/onlinepubs/9699919799/utilities/printf.html
« [...] "\ddd", where ddd is a one, two, or three-digit octal
» number, shall be written as a byte with the numeric value
» specified by the octal number. »
In the printf above, "\035" is an escape sequence, and the
following "3" is a normal digit. To write octal 353, the
printf format string should be '(octal 353) ... \353\n'.
Quote: on the centos 3.8 box I got the expected output of ë
Are you using zsh on the centos 3.8 box?
The zsh built-in printf expects octal escape sequences to
start with '\0' followed by zero, one, three or four octal
digits.
posix: \353 => zsh: \0353
posix: \75 => zsh: \075
--
printf -v email $(echo \ 155 141 162 143 145 154 142 162 165 151 \
156 163 155 141 100 171 141 150 157 157 056 143 157 155|tr \ \\\\)
# Live every life as if it were your last! # |
|
|
| Back to top |
|
|
|
| Marcel Bruinsma... |
Posted: Tue Sep 29, 2009 9:17 am |
|
|
|
Guest
|
Am Dienstag, 29. September 2009 15:54, syd_p a écrit :
Quote: $ printf "(hex EB) is the character \xEB\n"
(hex EB) is the character ë
this works on 3.8 and 5.3
This a nice one to try. It shows the terminal mapping:
printf '\xa4 \x80\n'
To understand that:
zgrep -E '\x(80|a4)' /usr/share/i18n/charmaps/ISO-8859-15.gz
zgrep -E '\x(80|a4)' /usr/share/i18n/charmaps/ISO-8859-1.gz
zgrep -E '\x(80|a4)' /usr/share/i18n/charmaps/CP1252.gz
--
printf -v email $(echo \ 155 141 162 143 145 154 142 162 165 151 \
156 163 155 141 100 171 141 150 157 157 056 143 157 155|tr \ \\\\)
# Live every life as if it were your last! # |
|
|
| Back to top |
|
|
|
| syd_p... |
Posted: Sun Nov 01, 2009 2:44 am |
|
|
|
Guest
|
On 29 Sep, 15:17, Marcel Bruinsma <m... at (no spam) nomail.afraid.org> wrote:
Quote: Am Dienstag, 29. September 2009 15:54, syd_p a écrit :
$ printf "(hex EB) is the character \xEB\n"
(hex EB) is the character ë
this works on 3.8 and 5.3
This a nice one to try. It shows the terminal mapping:
printf '\xa4 \x80\n'
To understand that:
zgrep -E '\x(80|a4)' /usr/share/i18n/charmaps/ISO-8859-15.gz
zgrep -E '\x(80|a4)' /usr/share/i18n/charmaps/ISO-8859-1.gz
zgrep -E '\x(80|a4)' /usr/share/i18n/charmaps/CP1252.gz
--
printf -v email $(echo \ 155 141 162 143 145 154 142 162 165 151 \
156 163 155 141 100 171 141 150 157 157 056 143 157 155|tr \ \\\\)
# Live every life as if it were your last! #
Finally I have had some success with:
LC_ALL=en_GB.iso88591
LANG=en_GB.iso88591
Does that make sense?
Have not yet tried en_US.iso88591
presumably that should work too? What difference would replacing GB
with US have - in general?
Thanks very much for your input.
Syd |
|
|
| Back to top |
|
|
|
|
|
All times are GMT - 5 Hours
The time now is Wed Dec 09, 2009 7:47 pm
|
|