SEC company filings: Is the <SEC-HEADER> tag valid SGML? If so, how to parse it? - parsing

I tried to parse SEC company filings from sec.gov. Starting from fb 10-Q index.htm let's look at a complete text submission filing like complete submission text filing. It has a structure like:
<SEC-DOCUMENT>
<SEC-HEADER>
<ACCEPTANCE-DATETIME>"some content" This tag is not closed.
"some lines resembling yaml markup"
These are indented lines with a
"key": "value" structure.
</SEC-HEADER>
<DOCUMENT>
.
.
some content
.
.
</DOCUMENT>
"several DOCUMENT tags" ...
</SEC-DOCUMENT>
I tried to figure out the structure of the <SEC-HEADER> tag and found some information under Public Dissemination
Service (PDS) Technical
Specification (pdf) and concluded that the content of the header should be SGML.
Nevertheless, I am clueless about the formatting, since there are no angle brackets, and the keys - value paires are separated by colons like key: value instead of <key>value</key>. In the pdf link I could not find anything about colons.
Question: Is the <SEC-HEADER> tag valid SGML? If it is, how to parse it?
I'd be glad at any help.

The short answer is no. The <SEC-HEADER> tag in the raw filing is not a valid SGML.
However, it is my understanding that this section in the raw filing is parsed automatically from the header file <accession_num>.hdr.sgml, which does follow SGML. This header file can be found in the same directory as the raw filing (i.e., the <accession_num>.txt file).
I use a REGEX of the form: ^<(.+?)>(.+?)$ (with re.MULTILINE option) to capture each (tag, value) tuple and get the results directly in a dict().
I believe the only tag in that file that has a closing tag is the </FILER> tag, where there could be multiple filers in each filing. You can first extract those using a REGEX of the form: <FILER>(.+?)</FILER> and then employ the same REGEX as above to get the inner tags for each filer.
Note that other than 'FILER', there could be other tags, representing different relations of the entities to the filing. Those are 'ISSUER', 'SUBJECT COMPANY', 'FILED BY', 'FILED FOR', 'SERIAL COMPANY', 'REPORTING OWNER'.

Related

importxml of url with Hebrew returns in encoding other than UTF-8 that chrome doesn't recognize

For example, in the dummy spreadsheet (tab 'desired outcome'), under "Link 1" you will see this URL:
http://www.promotion-il.co.il/service/%5DE%5E4%5D9%5E5-%5E8%5D9%5D7-%5D7%5E9%5DE%5DC%5D9-%5DC%5E2%5E1%5E7%5D9%5DD/
However, the actual URL in UTF-8 is:
http://www.promotion-il.co.il/service/%D7%9E%D7%A4%D7%99%D7%A5-%D7%A8%D7%99%D7%97-%D7%97%D7%A9%D7%9E%D7%9C%D7%99-%D7%9C%D7%A2%D7%A1%D7%A7%D7%99%D7%9D/
The actual URL string that contains Hebrew is:
http://www.promotion-il.co.il/service/מפיץ-ריח-חשמלי-לעסקים/
I will also add that the same URL has returned with a proper UTF-8 encoding for other blog posts. (See second example in the same tab).
Why is it happening?
How can it be fixed?
Thanks in advance!
This is the solution I came up with eventually:
I saw that for the imported urls - in order to fix a broken url 2 repalcements were needed:
5D --> D7%9
5E --> D7%A
I used this formula in a separate column to achieve it:
==ARRAYFORMULA(SUBSTITUTE(SUBSTITUTE((<COLUMN WITH IMPORTED URLS HERE>),"5D","D7%9"),"5E","D7%A"))

regex to extract URLs from text - Ruby

I am trying to detect the urls from a text and replace them by wrapping in quotes like below:
original text: Hey, it is a url here www.example.com
required text: Hey, it is a url here "www.example.com"
original text show my input value and required text represents the required output. I searched a lot on web but could not find any possible solution. I already have tried URL.extract feature but that doesn't seem to detect URLs without http or https. Below are the examples of some of urls I want to deal with. Kindly let me know if you know the solution.
ANQUETIL-DUPERRON Abraham-Hyacinthe, KIEFFER Jean-Luc, www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,Les Belles lettres, 2001.
https://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/
www.jstor.org/stable/24084454
www.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/
insu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so
www.cerege.fr/spip.php?page=pageperso&id_user=94
Find words who look like urls:
str = "ANQUETIL-DUPERRON Abraham-Hyacinthe, KIEFFER Jean-Luc, www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,Les Belles lettres, 2001.\n\nhttps://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/\n\nwww.jstor.org/stable/24084454\n\nwww.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/\n\ninsu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so\n\nwww.cerege.fr/spip.php?page=pageperso&id_user=94"
str.split.select{|w| w[/(\b+\.\w+)/]}
This will give you an array of words which have no spaces and include a one or more . characters which MIGHT work for your use case.
puts str.split.select{|w| w[/(\b+\.\w+)/]}
www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,
https://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/
www.jstor.org/stable/24084454
www.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/
insu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so
www.cerege.fr/spip.php?page=pageperso&id_user=94
Updated
Complete solution to modify your string:
str_with_quote = str.clone # make a clone for the `gsub!`
str.split.select{|w| w[/(\b+\.\w+)/]}
.each{|url| str_with_quote.gsub!(url, '"' + url + '"')}
Now your cloned object wraps urls inside double quotes
puts str_with_quote
Will give you this output
ANQUETIL-DUPERRON Abraham-Hyacinthe, KIEFFER Jean-Luc, "www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,Les" Belles lettres, 2001.
"https://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/"
"www.jstor.org/stable/24084454"
"www.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/"
"insu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so"
"www.cerege.fr/spip.php?page=pageperso&id_user=94"

How to read string stored in hdf5 format files by DM

I am scripting with DM and would like to read hdf5 file format.
I borrowed Tore Niermann's gms_HDF5_Plug-In (hdf5_GMS2X_amd64.dll) and his CMD_import_hdf5.s script. It use h5_read_dataset(filename, datapath) to read a image dataset.
I am trying to figure out the way to read a string info stored in the same file. I am particular interested to read the angle stored in string as shown in this figure.Demonstrated string to read. The h5_read_dataset(filename, datapath) function doesn't work for reading string.
There is a help file (hdf5_plugin.chm) with a list of functions but unfortunately I can't open them to see more info.
hdf5_plugin.chm showing the function list.
I suppose the right function to read strings should be something like h5_read_attr() or h5_info() but I didn't test them out. DM always says the two functions doesn't exist.
After reading out the angle by string, I will also need a bit help to convert the string to a double datatype.
Thank you.
Converting String to Number is done with the Val() command.
There is no integer/double/float concept for variables in DM-script, all are just number. ( This is different for images, where you can define the numeric type. Also: For file-inport/export a type differntiation can be made using the taggroup streaming commands in the other answer. )
Example script:
string numStr = "1.234e-2"
number num = val( numStr )
ClearResults()
Result( "\n As string:" + numStr )
Result( "\n As value:" + num )
Result( "\n As value, formatted:" + Format(num,"%3.2f") )
Potential answer regarding the .chm files: When you download (or email) .chm files in Windows, the OS classifies them as "potentially dagerouse" (because it could contain executable HTML code, I think). As a result, these files can not be shown by default. However, you can right-click these files and "unblock" them in the file properties.
Example:
I think this will be most likely a question specific to that plugin and not general DM scripting. So it might be better to contact the plugin-author directly.
The alternative (not good) solution would be to "rewrite" your own HDF5 file-reader, if you know the file-format. For this you would need the "Streaming" commands of the DM script language and browse through the (binary?) source file to the apropriate file location. The starting point for reading on this in the F1 help documentation would be here:

google translate misses up the coding of my file

i am trying to use google translate for localization of an XML file, it has near 350K lines, but some of them contain coding for in-game font size and color, like so:
<replacement><p horizontalalignment="center"><br/><image enablescale="false" imagesetpath="00015590.InterD_Jeryoung_3"/><br/><image enablescale="true" imagesetpath="00015590.Tag_Dungeon_Six_Superior" scalerate="1.5"/><image enablescale="true" imagesetpath="00015590.Tag_Dungeon_Four_Superior" scalerate="1.5"/><br/><image enablescale="true" imagesetpath="00009499.Field_Boss" scalerate="1.4"/>Хмельной лик<br/><br/></p>Уничтожить зараженных насекомых<br/>возле мест обитания их королевы。<br/></replacement>
now for god knows what reason, google translate alters that code in the process of translation into some unacceptable coding, like so:
<replacement> <p horizontalalignment="center"> <br/> <image enablescale="false" imagesetpath="00015590.InterD_Jeryoung_3"/> <br/> <image enablescale = "true "imagesetpath =" 00015590.Tag_Dungeon_Six_Superior "scalerate =" 1.5 "/> <image enablescale="true" imagesetpath="00015590.Tag_Dungeon_Four_Superior" scalerate="1.5"/> <br/> <image enablescale = "true" imagesetpath = "00009499.Field_Boss" scalerate = "1.4" /> Intoxicated face <br/> <br/> </ p> Destroy infected insects <br/> habitats near their queen. <br/> </ replacement>
is there any way to avoid that, why is it happening exactly? anyhelp is appreciated on that matter,thanks
EDIT : i am also looking for a way to input my text and have it out in the same exact language with only the coding mishaps changing, so i can isolate those,build a comparison table and then use that to fix the errors after the actual translation is done, but i don't see a way for selecting the same language as input AND output in google translate, it always forces me choose a different one in input or output, kind of makes sense but if there is a way to do that, i might be able to work around it..
Do not feed Google translate with your Xml file, as far as I know it doesn't understand Xml.
Extract the text from the Xml file.
Feed the text to translate.
Transform the text back to Xml.
You could simply transform the Xml to a text document with a single line per Xml element so it would be easier to turn it back into Xml.
More detail
According to the Toolkit you can upload:
HTML (.HTML)
Microsoft Word (.DOC/.DOCX)
OpenDocument Text (.ODT)
Plain Text (.TXT)
Rich Text (.RTF)
Wikipedia URLs
And a couple of extras such as JSON. So no Xml.
The best way I see is to transform your Xml document into one of these types (I would probably use JSON) and transform it is such a way that it can easily be transformed back again by using either position (1 line in the text file is the first element in the Xml document) or by an id (add the Id or position of the element in the xml hierarchy to the JSON element)
My guess is that the toolkit recognizes the html tags in the xml and escapes them. So another option might be to un-escape the > to > and &lt to <

Lucene/Solr unexpected query answer

I'm using Solr 4.4.0 running on Tomcat 7.0.29.
The solrconfig.xlm is as-delivered (excepted for the Solr home directory of course).
I could pass on the schema.xml, though I doubt this would help much, as the following will show.
If I select all documents containing "russia" in the text, which is the default field, ie if I execute the query "russia", I find only 1 document, which is correct.
If I select all documents containing "web" in the text ("web"), the result is 29, which is also correct.
If I search for all documents that do not contain "russia" ("NOT(russia)"), the result is still correct (202).
If I search for all documents that contain "web" and do not contain "russia" ("web AND NOT(russia)"), the result is, once again, correct (28, because the document containing "russia" also contains "web").
But if I search for all documents that contain "web" or do not contain "russia" ("web OR NOT(russia)"), the result is still 28, though I should get 203 matches (the whole set).
Has anyone got an explanation ??
For information, the AND and OR work correctly if I don't use a NOT somewhere in the query, i.e. :
"web AND russia" --> OK
"web OR russia" --> OK
I got a solution from Yonik Seeley, which is to translate the NOT(russia) into (*:* -russia), so that a positive value (: i.e. all documents) can be used to substract from (-russia). This solution works very well. I still believe that it would be a good idea to modify the parser so that the strainghtforward request "web OR NOT(russia)" would work without translation.

Resources