 |
|
| .NET DotNet Forum Index » VB.NET Forum (Visual Basic .NET) » change XML encoding... |
|
Page 1 of 1 |
|
| Author |
Message |
| Keith G Hicks... |
Posted: Thu Nov 05, 2009 4:27 pm |
|
|
|
Guest
|
Okay, I need to clean up these files. They are coming out of this goofy
system with this header:
<?xml version=?1.0? encoding=?UTF-8??>
The quotes around things are not coming in as quotes. And it's not the
correct encoding anyway. It needs to be this:
<?xml version="1.0" encoding="ISO-8859-1"?>
So I guess I need to change the encoding of each file before I can open it
up as an XML doc and read it there. I have no idea what is the best way to
do this programmatically in vb.net. Do I need to open with StreamWriter or
is there an easier way? I can't find anything out there that explains this
clearly. If I need to do this with streamwriter could someone point me
somewhere that shows how to do this?
Thanks,
Keith |
|
|
| Back to top |
|
|
|
| Scott M.... |
Posted: Thu Nov 05, 2009 6:05 pm |
|
|
|
Guest
|
"Keith G Hicks" <krh at (no spam) comcast.net> wrote in message
news:uMeJV7lXKHA.4360 at (no spam) TK2MSFTNGP04.phx.gbl...
Quote: Okay, I need to clean up these files. They are coming out of this goofy
system with this header:
?xml version=?1.0? encoding=?UTF-8??
The quotes around things are not coming in as quotes. And it's not the
correct encoding anyway. It needs to be this:
?xml version="1.0" encoding="ISO-8859-1"?
So I guess I need to change the encoding of each file before I can open it
up as an XML doc and read it there. I have no idea what is the best way to
do this programmatically in vb.net. Do I need to open with StreamWriter or
is there an easier way? I can't find anything out there that explains this
clearly. If I need to do this with streamwriter could someone point me
somewhere that shows how to do this?
Thanks,
Keith
Well, if the XML files were "well-formed", you'd simply load them up into a
W3C compliant XML DOM Document, which Microsoft makes available in the
System.Xml namespace with the XmlDocument class. Now, with LINQ, we also
have the XDocument type, which I believe is much easier to work with and can
be declared with an inference, as in:
Dim someXML = "...xml goes here..."
The problem is that right now, you can't use either of these because your
XML isn't well-formed. Your first goal should be to try to get the XML that
you are initially receiving to be well-formed.
As for the encoding, you can read the original XML into a new XML DOM
Document or XDocument and set the encoding of that new document.
Where are these XML streams coming from in the first place? |
|
|
| Back to top |
|
|
|
| Keith G Hicks... |
Posted: Thu Nov 05, 2009 6:13 pm |
|
|
|
Guest
|
Never mind. I figured it out:
Dim TheFileLines As New List(Of String)
TheFileLines.AddRange(System.IO.File.ReadAllLines(xmlFilesLocation & "\" &
sArticleToPost))
TheFileLines.RemoveAt(0)
TheFileLines.Insert(0, "<?xml version=""1.0"" encoding=""ISO-8859-1""?>")
System.IO.File.WriteAllLines(xmlFilesLocation & "\" & sArticleToPost,
TheFileLines.ToArray)
"Keith G Hicks" <krh at (no spam) comcast.net> wrote in message
news:uMeJV7lXKHA.4360 at (no spam) TK2MSFTNGP04.phx.gbl...
Quote: Okay, I need to clean up these files. They are coming out of this goofy
system with this header:
?xml version=?1.0? encoding=?UTF-8??
The quotes around things are not coming in as quotes. And it's not the
correct encoding anyway. It needs to be this:
?xml version="1.0" encoding="ISO-8859-1"?
So I guess I need to change the encoding of each file before I can open it
up as an XML doc and read it there. I have no idea what is the best way to
do this programmatically in vb.net. Do I need to open with StreamWriter or
is there an easier way? I can't find anything out there that explains this
clearly. If I need to do this with streamwriter could someone point me
somewhere that shows how to do this?
Thanks,
Keith
|
|
|
| Back to top |
|
|
|
| Keith G Hicks... |
Posted: Thu Nov 05, 2009 6:26 pm |
|
|
|
Guest
|
They're coming from a crappy Mac system that is very inflexible. They have
almost no control over how these get output. I wish i could get them
well-formed but I'm sort of stuck.
"Scott M." <s-mar at (no spam) nospam.nospam> wrote in message
news:%23ogpCxmXKHA.4788 at (no spam) TK2MSFTNGP05.phx.gbl...
Quote:
"Keith G Hicks" <krh at (no spam) comcast.net> wrote in message
news:uMeJV7lXKHA.4360 at (no spam) TK2MSFTNGP04.phx.gbl...
Okay, I need to clean up these files. They are coming out of this goofy
system with this header:
?xml version=?1.0? encoding=?UTF-8??
The quotes around things are not coming in as quotes. And it's not the
correct encoding anyway. It needs to be this:
?xml version="1.0" encoding="ISO-8859-1"?
So I guess I need to change the encoding of each file before I can open
it up as an XML doc and read it there. I have no idea what is the best
way to do this programmatically in vb.net. Do I need to open with
StreamWriter or is there an easier way? I can't find anything out there
that explains this clearly. If I need to do this with streamwriter could
someone point me somewhere that shows how to do this?
Thanks,
Keith
Well, if the XML files were "well-formed", you'd simply load them up into
a W3C compliant XML DOM Document, which Microsoft makes available in the
System.Xml namespace with the XmlDocument class. Now, with LINQ, we also
have the XDocument type, which I believe is much easier to work with and
can be declared with an inference, as in:
Dim someXML = "...xml goes here..."
The problem is that right now, you can't use either of these because your
XML isn't well-formed. Your first goal should be to try to get the XML
that you are initially receiving to be well-formed.
As for the encoding, you can read the original XML into a new XML DOM
Document or XDocument and set the encoding of that new document.
Where are these XML streams coming from in the first place?
|
|
|
| Back to top |
|
|
|
| Scott M.... |
Posted: Thu Nov 05, 2009 6:48 pm |
|
|
|
Guest
|
"Keith G Hicks" <krh at (no spam) comcast.net> wrote in message
news:%23$$5j2mXKHA.4808 at (no spam) TK2MSFTNGP06.phx.gbl...
Quote: Never mind. I figured it out:
Dim TheFileLines As New List(Of String)
TheFileLines.AddRange(System.IO.File.ReadAllLines(xmlFilesLocation & "\" &
sArticleToPost))
TheFileLines.RemoveAt(0)
TheFileLines.Insert(0, "<?xml version=""1.0"" encoding=""ISO-8859-1""?>")
System.IO.File.WriteAllLines(xmlFilesLocation & "\" & sArticleToPost,
TheFileLines.ToArray)
"Keith G Hicks" <krh at (no spam) comcast.net> wrote in message
news:uMeJV7lXKHA.4360 at (no spam) TK2MSFTNGP04.phx.gbl...
Okay, I need to clean up these files. They are coming out of this goofy
system with this header:
?xml version=?1.0? encoding=?UTF-8??
The quotes around things are not coming in as quotes. And it's not the
correct encoding anyway. It needs to be this:
?xml version="1.0" encoding="ISO-8859-1"?
So I guess I need to change the encoding of each file before I can open
it
up as an XML doc and read it there. I have no idea what is the best way
to
do this programmatically in vb.net. Do I need to open with StreamWriter
or
is there an easier way? I can't find anything out there that explains
this
clearly. If I need to do this with streamwriter could someone point me
somewhere that shows how to do this?
Thanks,
Keith
You realize that just because you've said what you want the encoding to be
doesn't mean that the characters are actually encoded that way, right? |
|
|
| Back to top |
|
|
|
| Keith G Hicks... |
Posted: Thu Nov 05, 2009 7:04 pm |
|
|
|
Guest
|
Yeah, I found that out. I'm kind of stabbing in the dark here. I'm asking
for help and trying to figure things out while waiting. I'm figuring a few
things out but not enough.
I have no way of getting these files in a better format than they already
are. I'm kind of stuck. I need to know how to take a file and change the
encoding to <?xml version="1.0" encoding="ISO-8859-1"?>
If I open the file manually in a tool I have called EditPad Pro I can paste
the above into the header. Then when I save it EditPad asks if I want to
change to the new encoding or not. Works quite well. I also discovered that
if I chagne the header in Notepad the characters I'm having toruble with
actually come out fine after I save it and reopen in XML editor. So that's
why I thought that changing it in vb code would do the same thing. Guess
not. Not sure why it works in Notepad.
So anyone that can help me write code to encode these files properly would
get my sincerest thanks.
Thanks,
Keith
"Scott M." <s-mar at (no spam) nospam.nospam> wrote in message
news:%23aMS$InXKHA.4816 at (no spam) TK2MSFTNGP06.phx.gbl...
Quote:
"Keith G Hicks" <krh at (no spam) comcast.net> wrote in message
news:%23$$5j2mXKHA.4808 at (no spam) TK2MSFTNGP06.phx.gbl...
Never mind. I figured it out:
Dim TheFileLines As New List(Of String)
TheFileLines.AddRange(System.IO.File.ReadAllLines(xmlFilesLocation & "\"
&
sArticleToPost))
TheFileLines.RemoveAt(0)
TheFileLines.Insert(0, "<?xml version=""1.0"" encoding=""ISO-8859-1""?>")
System.IO.File.WriteAllLines(xmlFilesLocation & "\" & sArticleToPost,
TheFileLines.ToArray)
"Keith G Hicks" <krh at (no spam) comcast.net> wrote in message
news:uMeJV7lXKHA.4360 at (no spam) TK2MSFTNGP04.phx.gbl...
Okay, I need to clean up these files. They are coming out of this goofy
system with this header:
?xml version=?1.0? encoding=?UTF-8??
The quotes around things are not coming in as quotes. And it's not the
correct encoding anyway. It needs to be this:
?xml version="1.0" encoding="ISO-8859-1"?
So I guess I need to change the encoding of each file before I can open
it
up as an XML doc and read it there. I have no idea what is the best way
to
do this programmatically in vb.net. Do I need to open with StreamWriter
or
is there an easier way? I can't find anything out there that explains
this
clearly. If I need to do this with streamwriter could someone point me
somewhere that shows how to do this?
Thanks,
Keith
You realize that just because you've said what you want the encoding to be
doesn't mean that the characters are actually encoded that way, right?
|
|
|
| Back to top |
|
|
|
| Scott M.... |
Posted: Thu Nov 05, 2009 7:15 pm |
|
|
|
Guest
|
"Keith G Hicks" <krh at (no spam) comcast.net> wrote in message
news:e2N4BTnXKHA.1236 at (no spam) TK2MSFTNGP05.phx.gbl...
Quote: Yeah, I found that out. I'm kind of stabbing in the dark here. I'm asking
for help and trying to figure things out while waiting. I'm figuring a few
things out but not enough.
I have no way of getting these files in a better format than they already
are. I'm kind of stuck. I need to know how to take a file and change the
encoding to <?xml version="1.0" encoding="ISO-8859-1"?
If I open the file manually in a tool I have called EditPad Pro I can
paste the above into the header. Then when I save it EditPad asks if I
want to change to the new encoding or not. Works quite well. I also
discovered that if I chagne the header in Notepad the characters I'm
having toruble with actually come out fine after I save it and reopen in
XML editor. So that's why I thought that changing it in vb code would do
the same thing. Guess not. Not sure why it works in Notepad.
So anyone that can help me write code to encode these files properly would
get my sincerest thanks.
Thanks,
Keith
Keith,
Take a look http://www.15seconds.com/Issue/050616.htm and look at the
XmlWriterSettings section. This is what you want.
-Scott |
|
|
| Back to top |
|
|
|
| Keith G Hicks... |
Posted: Thu Nov 05, 2009 11:37 pm |
|
|
|
Guest
|
The first line of the file's I'm getting is fouled up and so I cannot
open/read it at all using any XML features in VB. The first line is not
recognizeable. It's coiming to me saying it's UTF-8 but it's not and the
double quotes in the header are not coming to me as double quotes.
When I use StreamReader, alter the fist line and then save it as a new
file, that almost works but the characters that need to have the correct
encoding actually get changed to something else in the save process. I'm
guessing the stream reader is interpreting them funny and so it doesn't
really matter what I change the header to, the characters themselves change
(I checked in a hex editor to be sure).
So since it works to manually open these files in notepad and simply change
the header to the correct encoding, the characters themselves MUST have the
correct binary values. All that needs to be done is to change that header to
the right encoding without fouling up the characters in the body.
So how can I open the file in the most raw form of text, replace that first
line and save it without changing the characters in question in the process?
I made some progress with this:
Dim sr As New StreamReader(xmlFilesLocation & "\" & sArticleToPost,
Encoding.UTF7)
Dim text As String = sr.ReadToEnd
Dim text2() As String
ReDim text2(1)
text2(0) = text.Replace("<?xml version=1.0 encoding=UTF-8?>", "<?xml
version=""1.0"" encoding=""ISO-8859-1""?>")
System.IO.File.WriteAllLines(xmlFilesLocation & "\x" & sArticleToPost,
text2)
The text2 variable shows the correct characters and when I copy its value
into notepad it's fine. But it doesn't save right. I still get weirder
characters than I want. It's supposed to have characters like N with a
tilde, O with a tilde, O with an accent mark, etc. There are about 6 or 7 I
expect to see in this file. But when I open the newly saved files, those
characters are converted into very strange characters that I'd have to show
you.
I have a question regarding all of this. The encoding header merely tells
the program that's opening the file how to read the characters that are in
it. The characters are of course ultimately stored in binary so the encoding
knows how to interpret the binary into readable characters. If I open a file
using one encoding and the characters look a certain way and then save it
using another, the characters change binary. Is this all true? Am I
understandign this or not? I mean the 0's and 1's that are stored on disk
don't change just cuz of the way you open it. If you open it using one
interpreter (encoding) adn they look this way then open using another
encoding you'll see different characters. that makes sense to me. So the
only way I could see the binary changing is if the encoding used when saving
reinterprets the charcters to different string of 1's and 0's. Yes?
Okay, so when I choose the "encoding" parameter of StreamReader, there are
only about 5 options (UTF-7, UTF-8, UTF-32, ASCII, Default, ...) How do I
tell it I want it to read AND SAVE as ISO-8859-1????
Opening UTF-7 seems to help but OMG when I save using UTF-7 things are a big
mess.
Thanks,
Keith |
|
|
| Back to top |
|
|
|
| Branco Medeiros... |
Posted: Fri Nov 06, 2009 8:11 am |
|
|
|
Guest
|
Göran Andersson wrote:
<snip<
Quote: Specify the encoding when you read the lines, that way you will read the
rest of the file correctly:
Dim fileName As String = Path.Combine(xmlFilesLocation, sArticleToPost)
Dim iso = Encoding.GetEncoding("ISO-8859-1")
Dim lines As String() = System.IO.File.ReadAllLines(fileName, iso)
lines(0) = "<?xml version=""1.0"" encoding=""ISO-8859-1""?>"
System.IO.File.WriteAllLines(fileName, lines, iso)
snip
I may be wrong, but it seems to me that your code tries to read the
file as if it was in ISO-8859-1 ecoding, while the OP's first post
states that the files are in UTF-8 and he wants then in ISO-8859-1. He
is probably reading garbage atm.
If he wants to pursue this approach, then the *reading* encoding
should be set to UTF8:
<aircode>
Dim FileName As String = _
Path.Combine(xmlFilesLocation, sArticleToPost)
Dim Lines As String() = _
System.IO.File.ReadAllLines(FileName, Encoding.UTF8)
Lines(0) = "<?xml version=""1.0"" encoding=""ISO-8859-1""?>"
System.IO.File.WriteAllLines( _
FileName, Lines, Encoding.GetEncoding("ISO-8859-1"))
</aircode>
HTH.
Regards,
Branco. |
|
|
| Back to top |
|
|
|
| Keith G Hicks... |
Posted: Fri Nov 06, 2009 11:51 am |
|
|
|
Guest
|
Yeah, I found that out too. I got that from a post somewhere and it seemed
a bit clunky. I ended up with this instead (also a bit clunky)
Dim sr As New StreamReader(xmlFilesLocation & "\" & sArticleToPost,
Encoding.GetEncoding("ISO-8859-1"))
Dim text As String = sr.ReadToEnd
sr.Close()
sr = Nothing
Dim text2() As String
ReDim text2(1)
text2(0) = text.Replace("<?xml version=1.0 encoding=UTF-8?>", "<?xml
version=""1.0"" encoding=""ISO-8859-1""?>").Replace("&", "&")
System.IO.File.WriteAllLines(xmlFilesLocation & "\" & sArticleToPost,
text2, Encoding.GetEncoding("ISO-8859-1"))
But what you suggested below is cleaner and works just fine. Thanks for the
input. Really appreciated.
"Göran Andersson" <guffa at (no spam) guffa.com> wrote in message
news:OzpGp0vXKHA.3720 at (no spam) TK2MSFTNGP02.phx.gbl...
Quote: Keith G Hicks wrote:
Never mind. I figured it out:
Dim TheFileLines As New List(Of String)
TheFileLines.AddRange(System.IO.File.ReadAllLines(xmlFilesLocation & "\"
&
sArticleToPost))
TheFileLines.RemoveAt(0)
TheFileLines.Insert(0, "<?xml version=""1.0"" encoding=""ISO-8859-1""?>")
System.IO.File.WriteAllLines(xmlFilesLocation & "\" & sArticleToPost,
TheFileLines.ToArray)
You are shuffling a lot of data around that you don't need to. You don't
have to turn the array to a list and back again. Just replace the first
item instead of removing it and inserting a new one.
Specify the encoding when you read the lines, that way you will read the
rest of the file correctly:
Dim fileName As String = Path.Combine(xmlFilesLocation, sArticleToPost)
Dim iso = Encoding.GetEncoding("ISO-8859-1")
Dim lines As String() = System.IO.File.ReadAllLines(fileName, iso)
lines(0) = "<?xml version=""1.0"" encoding=""ISO-8859-1""?>"
System.IO.File.WriteAllLines(fileName, lines, iso)
You could also save the file as UTF-8 if you like. Once you have the text
as strings, it can be encoded using any encoding that has a character set
that supports the characters that you have. As strings in .NET are Uncide,
you can always encode a string using any Unicode encoding.
lines(0) = "<?xml version=""1.0"" encoding=""UTF-8""?>"
System.IO.File.WriteAllLines(fileName, lines)
--
Göran Andersson
_____
http://www.guffa.com |
|
|
| Back to top |
|
|
|
| Scott M.... |
Posted: Fri Nov 06, 2009 12:02 pm |
|
|
|
Guest
|
"Keith G Hicks" <krh at (no spam) comcast.net> wrote in message
news:OTC0zFwXKHA.3720 at (no spam) TK2MSFTNGP02.phx.gbl...
Quote: Yeah, I found that out too. I got that from a post somewhere and it
seemed
a bit clunky. I ended up with this instead (also a bit clunky)
Dim sr As New StreamReader(xmlFilesLocation & "\" & sArticleToPost,
Encoding.GetEncoding("ISO-8859-1"))
Dim text As String = sr.ReadToEnd
sr.Close()
sr = Nothing
Dim text2() As String
ReDim text2(1)
text2(0) = text.Replace("<?xml version=1.0 encoding=UTF-8?>", "<?xml
version=""1.0"" encoding=""ISO-8859-1""?>").Replace("&", "&")
System.IO.File.WriteAllLines(xmlFilesLocation & "\" & sArticleToPost,
text2, Encoding.GetEncoding("ISO-8859-1"))
But what you suggested below is cleaner and works just fine. Thanks for
the
input. Really appreciated.
"Göran Andersson" <guffa at (no spam) guffa.com> wrote in message
news:OzpGp0vXKHA.3720 at (no spam) TK2MSFTNGP02.phx.gbl...
Keith G Hicks wrote:
Never mind. I figured it out:
Dim TheFileLines As New List(Of String)
TheFileLines.AddRange(System.IO.File.ReadAllLines(xmlFilesLocation & "\"
&
sArticleToPost))
TheFileLines.RemoveAt(0)
TheFileLines.Insert(0, "<?xml version=""1.0""
encoding=""ISO-8859-1""?>")
System.IO.File.WriteAllLines(xmlFilesLocation & "\" & sArticleToPost,
TheFileLines.ToArray)
You are shuffling a lot of data around that you don't need to. You don't
have to turn the array to a list and back again. Just replace the first
item instead of removing it and inserting a new one.
Specify the encoding when you read the lines, that way you will read the
rest of the file correctly:
Dim fileName As String = Path.Combine(xmlFilesLocation, sArticleToPost)
Dim iso = Encoding.GetEncoding("ISO-8859-1")
Dim lines As String() = System.IO.File.ReadAllLines(fileName, iso)
lines(0) = "<?xml version=""1.0"" encoding=""ISO-8859-1""?>"
System.IO.File.WriteAllLines(fileName, lines, iso)
You could also save the file as UTF-8 if you like. Once you have the text
as strings, it can be encoded using any encoding that has a character set
that supports the characters that you have. As strings in .NET are
Uncide,
you can always encode a string using any Unicode encoding.
lines(0) = "<?xml version=""1.0"" encoding=""UTF-8""?>"
System.IO.File.WriteAllLines(fileName, lines)
--
Göran Andersson
_____
http://www.guffa.com
Replace your sr = Nothing line to sr.Dispose instead.
Setting a variable to Nothing in VB .NET really doesn't accomplish anything,
except for in the rarest cases. But, just closing a StreamReader, does not
fully clean it up, so Dispose is needed.
-Scott |
|
|
| Back to top |
|
|
|
| Keith G Hicks... |
Posted: Fri Nov 06, 2009 1:35 pm |
|
|
|
Guest
|
Thanks for the info Branco but Göran was right. It's not UTF-8. I can see
why you'd think that based on the header that was in the original file but
that's in there incorrectly which is why I need to change it.
"Branco Medeiros" <branco.medeiros at (no spam) gmail.com> wrote in message
news:3581e5f8-4b5c-4672-9acb-773007cdb952 at (no spam) f16g2000yqm.googlegroups.com...
Göran Andersson wrote:
<snip<
Quote: Specify the encoding when you read the lines, that way you will read the
rest of the file correctly:
Dim fileName As String = Path.Combine(xmlFilesLocation, sArticleToPost)
Dim iso = Encoding.GetEncoding("ISO-8859-1")
Dim lines As String() = System.IO.File.ReadAllLines(fileName, iso)
lines(0) = "<?xml version=""1.0"" encoding=""ISO-8859-1""?>"
System.IO.File.WriteAllLines(fileName, lines, iso)
snip
I may be wrong, but it seems to me that your code tries to read the
file as if it was in ISO-8859-1 ecoding, while the OP's first post
states that the files are in UTF-8 and he wants then in ISO-8859-1. He
is probably reading garbage atm.
If he wants to pursue this approach, then the *reading* encoding
should be set to UTF8:
<aircode>
Dim FileName As String = _
Path.Combine(xmlFilesLocation, sArticleToPost)
Dim Lines As String() = _
System.IO.File.ReadAllLines(FileName, Encoding.UTF8)
Lines(0) = "<?xml version=""1.0"" encoding=""ISO-8859-1""?>"
System.IO.File.WriteAllLines( _
FileName, Lines, Encoding.GetEncoding("ISO-8859-1"))
</aircode>
HTH.
Regards,
Branco. |
|
|
| Back to top |
|
|
|
|
|
All times are GMT - 5 Hours
The time now is Sat Dec 05, 2009 2:25 am
|
|