When parsing XML, the character é is missing - character-encoding

I have an XML as input to a Java function that parses it and produces an output. Somewhere in the XML there is the word "stratégie". The output is "stratgie". How should I parse the XML as to get the "é" character as well?
The XML is not produced by myself, I get it as a response from a web service and I am positive that "stratégie" is included in it as "stratégie".
In the parser, I have:
public List<Item> GetItems(InputStream stream) {
try {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(stream);
doc.getDocumentElement().normalize();
NodeList nodeLst = doc.getElementsByTagName("item");
List<Item> items = new ArrayList<Item>();
Item currentItem = new Item();
Node node = nodeLst.item(0);
if (node.getNodeType() == Node.ELEMENT_NODE) {
Element item = (Element) node;
if(node.getChildNodes().getLength()==0){
return null;
}
NodeList title = item.getElementsByTagName("title");
Element titleElmnt = (Element) title.item(0);
if (null != titleElmnt)
currentItem.setTitle(titleElmnt.getChildNodes().item(0).getNodeValue());
....
Using the debugger, I can see that titleElmnt.getChildNodes().item(0).getNodeValue() is "stratgie" (without the é).
Thank you for your help.

I strongly suspect that either you're parsing it incorrectly or (rather more likely) it's just not being displayed properly. You haven't really told us anything about the code or how you're using the result, which makes it hard to give very concrete advice.
As ever with encoding issues, the first thing to do is work out exactly where data is getting lost. Lots of logging tends to be the way forward: create a small test case that demonstrates the problem (as small as you can get away with) and log everything about the data. Don't just try to log it as raw text: log the Unicode value of each character. That way your log will have all the information even if there are problems with the font or encoding you use to view the log.

The answer was here: http://www.yagudaev.com/programming/java/7-jsp-escaping-html

You can either use utf-8 and have the 'é' char in your document instead of é, or you need to have a parser that understand this entity which exists in HTML and XHTML and maybe other XML dialects but not in pure XML : in pure XML there's "only" ", <, > and maybe &apos; I don't remember.
Maybe you can need to specify those special-char entities in your DTD or XML Schema (I don't know which one you use) and tell your parser about it.

Related

Dart Markdown package, how to handle new lines

I am trying to make a WYSIWYG internal tool. And we decided to implement this feature with contentEditable. However, we save data to our databases in markdown. So I have to be able to parse from html to md and back. For html to md I use package html2md and for the other way around I use Markdown package.
The issue i've been having is that when you write to my editor text like
HEY
After many lines some text
It produces this in md
HEY
After many lines some text
Notably it uses 2 whitespace and 2 LF characters (or atleast i think so but i might be slightly wrong.) I solved this issue by parsing it like this
markdownToHtml(data.replaceAll('&', '&').replaceAll('<', '<').replaceAll('>', '>'), inlineSyntaxes: [TextSyntax(String.fromCharCodes([32,32,10,10]),sub: "<div><br></div>")],inlineOnly: true );
The inline only parameter was neccesary because without it the text syntax wasnt applied for some reason. However this inline only then bit me in the arse when I tried to implement parsing of unordered lists, which are parsed as blocks. So I need a way to correctly parse these empty lines without using inline only.
class EmptyLineBlockSyntax extends BlockSyntax{
RegExp get pattern => RegExp(r'^(?:[ \t][ \t]+)$');
const EmptyLineBlockSyntax();
Node parse(BlockParser parser) {
parser.encounteredBlankLine = true;
parser.advance();
return Element('p',[Element.empty('br')]);
}
}
return markdownToHtml(data.replaceAll('&', '&').replaceAll('<', '<').replaceAll('>', '>'), blockSyntaxes: [EmptyLineBlockSyntax()]);

Confirming existence of a string in an xml table Lua

Good afternoon everyone,
My problem is that I have 2 XML lists
<List1> <Agency>String</Agency> </List1>
and
<List2><Agency2>String</Agency2><List2>.
In Lua I need to create a program which is parsing this list and when the user inputs a matching string from List 1 or List 2, the program needs to actually confirm to the user if the string belongs to either L1 or L2 or if the string is inexistent. I'm new to Lua and to programming generally speaking and I would be very grateful for you answers. I have LuaExpat as a plugin but I can't seem to be able to actually read from file, I can only do some beginner tricks if the xml list is written in the code. At a later time this small program will be fed by an RSS.
require("lxp")
local stuff = {}
xmldata="<Top><A/> <B a='1'/> <B a='2'/><B a='3'/><C a='3'/></Top>"
function doFunc(parser, name, attr)
if not (name == 'B') then return end
stuff[#stuff+1]= attr
end
local xml = lxp.new{StartElement = doFunc}
xml:parse(xmldata)
xml:close()
print(stuff[3].a)
This code is a tutorial over the web that works, everything is just fine it prints nr. 3. Now I want to know how to do that from an actual file, as if I input io.read:(file, "r" or "rb" ) under xmldata variable and run the same thing it returns either empty space or nil.

How to remove non-ascii char from MQ messages with ESQL

CONCLUSION:
For some reason the flow wouldn't let me convert the incoming message to a BLOB by changing the Message Domain property of the Input Node so I added a Reset Content Descriptor node before the Compute Node with the code from the accepted answer. On the line that parses the XML and creates the XMLNSC Child for the message I was getting a 'CHARACTER:Invalid wire format received' error so I took that line out and added another Reset Content Descriptor node after the Compute Node instead. Now it parses and replaces the Unicode characters with spaces. So now it doesn't crash.
Here is the code for the added Compute Node:
CREATE FUNCTION Main() RETURNS BOOLEAN
BEGIN
DECLARE NonPrintable BLOB X'0001020304050607080B0C0E0F101112131415161718191A1B1C1D1E1F7F808182838485868788898A8B8C8D8E8F909192939495969798999A9B9C9D9E9FA0A1A2A3A4A5A6A7A8A9AAABACADAEAFB0B1B2B3B4B5B6B7B8B9BABBBCBDBEBFC0C1C2C3C4C5C6C7C8C9CACBCCCDCECFD0D1D2D3D4D5D6D7D8D9DADBDCDDDEDFE0E1E2E3E4E5E6E7E8E9EAEBECEDEEEFF1F2F3F4F5F6F7F8F9FAFBFCFDFEFF';
DECLARE Printable BLOB X'20202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020';
DECLARE Fixed BLOB TRANSLATE(InputRoot.BLOB.BLOB, NonPrintable, Printable);
SET OutputRoot = InputRoot;
SET OutputRoot.BLOB.BLOB = Fixed;
RETURN TRUE;
END;
UPDATE:
The message is being parsed as XML using XMLNSC. Thought that would cause a problem, but it does not appear to be.
Now I'm using PHP. I've created a node to plug into the legacy flow. Here's the relevant code:
class fixIncompetence {
function evaluate ($output_assembly,$input_assembly) {
$output_assembly->MRM = $input_assembly->MRM;
$output_assembly->MQMD = $input_assembly->MQMD;
$tmp = htmlentities($input_assembly->MRM->VALUE_TO_FIX, ENT_HTML5|ENT_SUBSTITUTE,'UTF-8');
if (!empty($tmp)) {
$output_assembly->MRM->VALUE_TO_FIX = $tmp;
}
// Ensure there are no null MRM fields. MessageBroker is strict.
foreach ($output_assembly->MRM as $key => $val) {
if (empty($val)) {
$output_assembly->MRM->$key = '';
}
}
}
}
Right now I'm getting a vague error about read only messages, but before that it wasn't working either.
Original Question:
For some reason I am unable to impress upon the senders of our MQ
messages that smart quotes, endashes, emdashes, and such crash our XML
parser.
I managed to make a working solution with SQL queries, but it wasted
too many resources. Here's the last thing I tried, but it didn't work
either:
CREATE FUNCTION CLEAN(IN STR CHAR) RETURNS CHAR BEGIN
SET STR = REPLACE('–',STR,'–');
SET STR = REPLACE('—',STR,'—');
SET STR = REPLACE('·',STR,'·');
SET STR = REPLACE('“',STR,'“');
SET STR = REPLACE('”',STR,'”');
SET STR = REPLACE('‘',STR,'&lsqo;');
SET STR = REPLACE('’',STR,'’');
SET STR = REPLACE('•',STR,'•');
SET STR = REPLACE('°',STR,'°');
RETURN STR;
END;
As you can see I'm not very good at this. I have tried reading about
various ESQL string functions without much success.
So in ESQL you can use the TRANSLATE function.
The following is a snippet I use to clean up a BLOB containing non-ASCII low hex values so that it then be cast into a usable character string.
You should be able to modify it to change your undesired characters into something more benign. Basically each hex value in NonPrintable gets translated into its positional equivalent in Printable, in this case always a full-stop i.e. x'2E' in ASCII. You'll need to make your BLOB's long enough to cover the desired range of hex values.
DECLARE NonPrintable BLOB X'000102030405060708090A0B0C0D0E0F101112131415161718191A1B1C1D1E1F202122232425262728292A2B2C2D2E2F303132333435363738393A3B3C3D3E3F';
DECLARE Printable BLOB X'2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E';
SET WorkBlob = TRANSLATE(WorkBlob, NonPrintable, Printable);
BTW if messages with invalid characters only come in every now and then I'd probably specify BLOB on the input node and then use something similar to the following to invoke the XMLNSC parser.
CREATE LASTCHILD OF OutputRoot DOMAIN 'XMLNSC'
PARSE(InputRoot.BLOB.BLOB CCSID InputRoot.Properties.CodedCharSetId ENCODING InputRoot.Properties.Encoding);
With the exception terminal wired up you can then correct the BLOB's of any messages containing parser breaking invalid characters before attempting to reparse.
Finally my best wishes as I've had a number of battles over the years with being forced to correct invalid message content in the "Integration Layer" after all that's what it's meant to do.

Open search server : basic questions

I am evaluating OSS to implement crawling, indexing and searching a mid-sized ASP.NET (MVC4) website.
So far it looks promising.
Here are some basic questions, which I could not find in the docs:
German Umlauts:
the Renderer/Search for German Umlauts 'ä, ü, ö' fails:
http://localhost:8080/renderer?use=haas&name=gSearch&query=küche
returns
"küche in the search box with no results - there should be results in the index!"
(I created a query "gSearch" with language=German
can OSS return Synonyms like "...did you mean..." WITHOUT having to manually insert every thinkable or unthinkable synonym MANUALLY??
I did not get results until I added "aspx" in Schema->Parser_list-> HTML -> supported extensions
is this correct - or should I add another parser for ASP - ... can I have more than one parser for HTML, ASP, PDF...etc...?
after doing 3. I got results - both aspx and pdf documents... but I did not get a clickable link (filename) for the PDF-Files ??
what would be the best way to call search from MVC? Via Webservices...? I do not want to include an IFRAME
It's always troublesome when several different questions are gathered in one., but here's my take on number 4:
I use a WebRequest, very straightforward.
var webRequest = WebRequest.Create("http://localhost:8080/select?use=haas&query=kitchen");
webRequest.Timeout = 10000;
WebResponse webResponse;
try
{
webResponse = webRequest.GetResponse();
}
catch (WebException ex)
{
WriteToEventLog(ex.Message);
}
var xmlStream = webResponse.GetResponseStream();
var reader = XmlReader.Create(xmlStream);
var doc = XDocument.Load(reader, LoadOptions.PreserveWhitespace);
Then you have yourself an XML with the returned fields set up in your OSS index query.

DBF Large Char Field

I have a database file that I beleive was created with Clipper but can't say for sure (I have .ntx files for indexes which I understand is what Clipper uses). I am trying to create a C# application that will read this database using the System.Data.OleDB namespace.
For the most part I can sucessfully read the contents of the tables there is one field that I cannot. This field called CTRLNUMS that is defined as a CHAR(750). I have read various articles found through Google searches that suggest field larger than 255 chars have to be read through a different process than the normal assignment to a string variable. So far I have not been successful in an approach that I have found.
The following is a sample code snippet I am using to read the table and includes two options I used to read the CTRLNUMS field. Both options resulted in 238 characters being returned even though there is 750 characters stored in the field.
Here is my connection string:
Provider=Microsoft.Jet.OLEDB.4.0;Data Source=c:\datadir;Extended Properties=DBASE IV;
Can anyone tell me the secret to reading larger fields from a DBF file?
using (OleDbConnection conn = new OleDbConnection(connectionString))
{
conn.Open();
using (OleDbCommand cmd = new OleDbCommand())
{
cmd.Connection = conn;
cmd.CommandType = CommandType.Text;
cmd.CommandText = string.Format("SELECT ITEM,CTRLNUMS FROM STUFF WHERE ITEM = '{0}'", stuffId);
using (OleDbDataReader dr = cmd.ExecuteReader())
{
if (dr.Read())
{
stuff.StuffId = dr["ITEM"].ToString();
// OPTION 1
string ctrlNums = dr["CTRLNUMS"].ToString();
// OPTION 2
char[] buffer = new char[750];
int index = 0;
int readSize = 5;
while (index < 750)
{
long charsRead = dr.GetChars(dr.GetOrdinal("CTRLNUMS"), index, buffer, index, readSize);
index += (int)charsRead;
if (charsRead < readSize)
{
break;
}
}
}
}
}
}
You can find a description of the DBF structure here: http://www.dbf2002.com/dbf-file-format.html
What I think Clipper used to do was modify the Field structure so that, in Character fields, the Decimal Places held the high-order byte of the size, so Character field sizes were really 256*Decimals+Size.
I may have a C# class that reads dbfs (natively, not ADO/DAO), it could be modified to handle this case. Let me know if you're interested.
Are you still looking for an answer? Is this a one-off job or something that needs doing regularly?
I have a Python module that is primarily intended to extract data from all kinds of DBF files ... it doesn't yet handle the length_high_byte = decimal_places hack, but it's a trivial change. I'd be quite happy to (a) share this with you and/or (b) get a copy of such a DBF file for testing.
Added later: Extended-length feature added, and tested against files I've created myself. Offer to share code with anyone who would like to test it still stands. Still interested in getting some "real" files myself for testing.
3 suggestions that might be worth a shot...
1 - use Access to create a linked table to the DBF file, then use .Net to hit the table in the access database instead of going direct to the DBF.
2 - try the FoxPro OLEDB provider
3 - parse the DBF file by hand. Example is here.
My guess is that #1 should work the easiest, and #3 will give you the opportunity to fine tune your cussing skills. :)

Resources