After upgrading to Ruby-1.9.3-p392 today, REXML throws a Runtime Error when attempting to retrieve an XML response over a certain size - everything works fine and no error is thrown when receiving under 25 XML records, but once a certain XML response length threshold is reached, I get this error:
Error occurred while parsing request parameters.
Contents:
RuntimeError (entity expansion has grown too large):
/.rvm/rubies/ruby-1.9.3-p392/lib/ruby/1.9.1/rexml/text.rb:387:in `block in unnormalize'
I realize this was changed in the most recent Ruby version:
http://www.ruby-lang.org/en/news/2013/02/22/rexml-dos-2013-02-22/
As a quick fix, I've changed the size of REXML::Document.entity_expansion_text_limit to a larger number and the error goes away.
Is there a less risky solution?
This issue is generated when you send too much content as XML response.
To fix this issue : You need to restrict the data(< 10k) in the individual node (Instead of sending the whole data, show truncated data and provide a seperate link to view full content)
The error is being raised from the below file :
ruby-2.1.2/lib/ruby/2.1.0/rexml/text.rb
# Unescapes all possible entities
def Text::unnormalize( string, doctype=nil, filter=nil, illegal=nil )
sum = 0
string.gsub( /\r\n?/, "\n" ).gsub( REFERENCE ) {
s = Text.expand($&, doctype, filter)
if sum + s.bytesize > Security.entity_expansion_text_limit
raise "entity expansion has grown too large"
else
sum += s.bytesize
end
s
}
end
The limit ruby-2.1.2/lib/ruby/2.1.0/rexml/text.rb defaults to 10240 which means 10k data per node.
REXML already defaults to only allow 10000 entity substitutions per document, so the maximum amount of text that can be generated by entity substitution will be around 98 megabytes. (Refer https://www.ruby-lang.org/en/news/2013/02/22/rexml-dos-2013-02-22/ )
That sounds like a LOT of XML. Do you really need to get all of it? Maybe you can just request certain fields from the remote server? One option might be to try another XML parser (Nokogiri for example). Another option to maybe use something other than XML as a transport (JSON? Binary?).
Related
I am interacting with openAI using ruby-openAI gem, but I get timeout error, is there a way I can exceed the timeout limit?
response = #client.completions(
parameters: {
model: "text-davinci-003",
prompt: "In the style of #{#as_written_by}, write a longer article in HTML of at least 750 words using the #{article} as the primary source and basis for the new article, and include interesting facts from the #{secondary_sources}, with a tags around the source of the information pointing to the original URLs",
max_tokens: 3000
})
With my Saxon extension function code I have the log messages:
> java -cp ./saxon-he-10.2.jar:./ net.sf.saxon.Transform -t -init:MyInitializer -xsl:./exttest.xsl -o:./out.xml -it:initialtemplate
Saxon-HE 10.2J from Saxonica
Java version 14.0.2
Stylesheet compilation time: 305.437652ms
Processing (no source document) initial template = initialtemplate
Using parser com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser
Building tree for null using class net.sf.saxon.tree.tiny.TinyBuilder
Tree built in 0.850325ms
Tree size: 3 nodes, 0 characters, 0 attributes
Execution time: 29.965658ms
Memory used: 14Mb
It is not clear to me wether Building tree for null using class net.sf.saxon.tree.tiny.TinyBuilder means that there is something wrong with my code https://gitlab.com/ms452206/socode20200915 and how to avoid it.
It's a poor message and I will improve it; the "null" would be the base URI (or systemId) of the document if it had one. The fact that the document has no known base URI could be a predictor of trouble downstream, since some things rely on a document having a known base URI; but it's not an error in itself.
It's most likely to happen if you build a document using a JAXP Source object whose systemId property is null. Which is what you have done when you wrote:
new StreamSource(new StringReader("<foo/>"))
This is likely to cause failures only if the document contains relative URIs (for example in external entity references or href or xml:base attributes), which is not the case with your simple XML document.
We have an API as a proxy between clients and google Pub/Sub, so it basically retrieves a JSON body and publishes it to the topic. Then, it is processed by DataFlow, which stores it in BigQuery. Also, we use transform UDF to, for instance, convert a field value to upper case; it parses JSON sent and produces a new one.
The problem is the following. The number of bytes sent to the destination table is much less than to the deadletter, and the error message is 99% percent contains the error saying that the sent JSON is invalid. And that's true, the payloadstring column contains distorted JSONs: they could be truncated, concatenated with other ones, or even both. I've added logs on the API side to see where did the message set corrupted, but neither received or sent by the API JSON bodies are invalid.
How can I debug this problem? Is it any chance of pub/sub or dataflow to corrupt messages? If so, what can I do to fix it?
UPD. By the way, we use a Google-provided template called "pubsub topic to bigquery"
UPD2. API is written in Go, and the way we send the message is simply by calling
res := p.topic.Publish(ctx, &pubsub.Message{Data: msg})
The res variable is then used for error logging. p here is a custom struct.
The message we sent is a JSON with 15 fields, and just to be concise I'll mock it and UDF.
Message:
{"MessageName":"Name","MessageTimestamp":123123123",...}
UDF:
function transform(inJson) {
var obj;
try {
obj = JSON.parse(inJson);
} catch (error){
throw 'parse JSON error: '+error;
}
if (Object.keys(obj).length !== 15){
throw "Message is invalid";
}
if (!(obj.hasOwnProperty('EventSource') && typeof obj.EventSource === 'string' && obj.MessageName.length>0)) {
throw "MessageName is absent or invalid";
}
/*
other fields check
*/
obj.MessageName = obj.MessageName.toUpperCase()
/*
other fields transform
*/
return JSON.stringify(obj);
}
UPD3:
Besides being corrupted, I've noticed that every single message is duplicated at least once, and the duplicates are often truncated.
The problem occurred several days ago when it was a massive increase in the number of messages, but now it got back to normal, and the error is still there. The problem was seeing before, but it was a much more rare case.
The behavior you describe suggests that the data is corrupt before it gets to Pubsub or Dataflow.
I have performed a test, sending JSON messages containing 15 fields. Your UDF function as well as the Dataflow template work fine since I was able to insert the data to BigQuery.
Based on that, it seems your messages are already corrupted before getting to Pub/Sub, I suggest you to check your messages once they arrived to Pub/Sub and see if they have the correct format.
Please notice that it's required for the messages schema match with the BigQuery table schema.
I try to perform example from https://cran.r-project.org/web/packages/text2vec/vignettes/files-multicore.html but with my file "text" - 3.7Gb plain text, build from Wikipedia XML dump with Perl script from here - http://mattmahoney.net/dc/textdata.html
setwd("c:/rtest")
library(text2vec)
library(doParallel)
N_WORKERS = 2
registerDoParallel(N_WORKERS)
it_files_par = ifiles_parallel(file_paths = "text")
it_token_par = itoken_parallel(it_files_par, preprocessor = tolower, tokenizer = word_tokenizer)
vocab = create_vocabulary(it_token_par)
This causes error:
Error in unserialize(socklist[[n]]) : error reading from connection
I have 8Gb RAM, word2vec model from this file is created without any errors.
First of all it doesn't make sense to use parallel iterators on a single file - each file processed in a separate R worker process. So here it will be worse than just itoken. Also it involves sending result from each worker to the master process. Here we see that result it too big to be send through socket.
Long story short - just use itoken or split your file into several smaller files.
I'm having some problems with the freebase api. I have managed to put a key to the freebase provider so I don't see any 403 errors, relating quota restrictions. But since I used a google api key, commons is not being recognized when I hit "alt + enter". But while I'm writing, the provider manages to show me data.
[<Literal>]
let FreebaseApiKey = "AIzaSyCOn15-T31Ls"
type FreebaseDataWithKey = FreebaseDataProvider<Key=FreebaseApiKey>
let dataWithKey = FreebaseDataWithKey.GetDataContext()
let travelDestinations = dataWithKey.Commons.Travel.``Travel destinations``
let all = travelDestinations |> Seq.toList
let first = all.Head.Name
As you can see, I have access to Travel Destinations, so the provider shows me data correctly but when I execute it:
Script.fsx(17,38): error FS0039: The field, constructor or member 'Commons' is not defined
The weird thing is that if I delete the google key, and use the provider, this error does not happen. Any clues?
In a provider like freebase, we need to do things asynchronously, and that unfortunately causes the error reporting not to be very good (see https://github.com/fsharp/fsharp/issues/280)
What's probably happening is that freebase is returning errors due to the api key not being right or something similar, and does errors are not surfacing, as the toplevel objects are already cached. You can either look under Fiddler to see the json being returned, use data.DataContext.SendingQuery or data.DataContext.SendingRequest, or clean the cache by deleting the FreebaseSchema and FreebaseRuntime folders under your temporary internet files system folder.
We recently tried to change this to cause errors to surface by generating the toplevel types synchronously, but that caused other problems (https://github.com/fsharp/FSharp.Data/issues/522)