What's the format of the OpenOffice dictionaries? - parsing

Does anyone know what the format of the OpenOffice dictionary files are? As far as I can see there is one word per line, and some flags that presumably tells me something about the word.
Here's a couple of lines from the english dictionary as an example:
absoluteness/S
absorbency/SM
abstract/ShTVDPiGY
absurdness/S
And from the Norwegian dictionary, which is what I'll use:
flatorm/AEG
flatpresse/W
flatseng/ACEG
flatside/ACDEFGHJ
flatskjerm/A
What does for instance "/AEG" and "/S" mean? I assume each letter/flag has a certain meaning, so that tha A in "/AEG" means the same as the A in "/ACDEFGHJ".
I have googled all over the place, but I can't find any information.

OO uses the hunspell engine for spell-checking. The stuff after the "/" is linked to data in the corresponding affix file.

Related

Open and extract information from large text file (Geonames)

I want to make a list of all major towns and cities in the UK.
Geonames seems like a good place to start, although I need to use it locally (as opposed to the API) as I will be working offline while using the information.
Due to the large size of the geonames "allcountries.txt" file it won't open on Notepad, Notepad++ and Sublime. I've tried opening in Excel (including the Data modelling function) but the file has more than a million rows so this won't work either.
Is it possible to open this file, extract the UK-only cities, and manipulate in Excel and/or some other software? I am only after place name, lat, long, country name, continent
#dedek's suggestion (in the comments) to use GB.txt is definitely the best answer for your particular case.
I've added another answer because this technique is much more flexible and will allow you to filter by country or any other column. i.e. You can adapt this solution to filter by language, region in the UK, population, etc or apply it the cities5000.txt file, for example.
Solution:
Use grep to find data that matches a particular pattern. In essence, the command below is saying, find all rows where the 8th column is exactly "GB".
grep -P "[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\tGB\t" allCountries.txt > UK.txt
(grep comes standard with most Unix systems but there are definitely tools out there that can do it on Windows too.)
Details:
grep: The command being executed.
\t: Shorthand for the TAB character.
-P: Tells grep to use a Perl-style regular expression (grep might not recognize \t as a TAB character otherwise). (This might be a bit different if you are using another version of grep.)
[^\t]*: zero or more non-tab characters i.e. an optional column value.
> UK.txt: writes the output of the command to a file called "UK.txt".
Again, you could adapt this example to filter on any column in any file.

F#: Link actual word/definitions dictionary to code

I'm running into a search issue with my question. I'm trying to link an actual dictionary (e.g., words with definitions) to some code I'm writing in F#. Specifically, I'm using FsVerbalExpressions to identify whitespace-separated strings and would like to look each string up in an actual dictionary to determine if they're words or not.
The problem I'm having is that when I search on SO (or Google or anywhere else) for "link dictionary to F#" or "F# dictionary library" or some other permutation of "F#" and "dictionary," I get hits on how to use the dictionary collection in F#.
I'm hoping someone out there has some insight into how to link a dictionary, though this answer has given me some directions to alternatives if I can't find exactly what I'm trying to do.
Thanks for your help!

Character #\u009C cannot be represented in the character set CHARSET:CP1252 - how to fix it

As already pointed out in the topic, I got the following error:
Character #\u009C cannot be represented in the character set CHARSET:CP1252
trying to print out a string given back by drakma:http-request, as far as I understand the error-code the problem is that the windows-encoding (CP1252) does not support this character.
Therefore to be able to process it, I might/must convert the whole string.
My question is what package/library does support converting strings to certain character-sets efficiently?
An alike question is this one, but just ignoring the error would not help in my case.
Drakma already does the job of "converting strings": after all, when it reads from some random webserver, it just gets a stream of bytes. It then has to convert that to a lisp string. You probably want to bind *drakma-default-external-format* to something else, although I can't remember off-hand what the allowable values are. Maybe something like :utf-8?

Parsing PDF files

I'm finding it difficult to parse a pdf file that's created in a non-english language. I used pdfbox and itext but couldn't find anything in there that could help parse this file. Here's the pdf file that I'm talking about: http://prapatti.com/slokas/telugu/vishnusahasranaamam.pdf The pdf says that it's created use LaTeX and Tikkana font. I have Tikkana font installed on my machine, but that didn't help. Please help me in this.
Thanks, K
When you say "parse PDF files", my first thought was that the PDF in question wasn't opening in various PDF viewers & libraries, and was therefore corrupt in some way.
But that's not the case at all. It opens just fine in Acrobat Reader X. And then I see the text on the page.
And when I copy/paste that text from the first page, I get:
Ûûp{¨¶ðQ{p{¨|={pÛû{¨>üb¶úN}l{¨d{p{¨> >Ûpû¶bp{¨}|=/}pT¶=}Nm{Z{Úpd{m}a¾Ú}mp{Ú¶¨>ztNð{øÔ_c}m{ТÁ}=N{Nzt¶ztbm}¥Ázv¬b¢Á
Á ÛûÁøÛûzÏrze¨=ztTzv}lÛzt{¨d¨c}p{Ðu{¨½ÐuÛ½{=Û Á{=Á Á ÁÛûb}ßb{q{d}p{¨ze=Vm{Ðu½Û{=Á
That's from Reader.
Much of the text in this PDF is written using various "Type 3" fonts. These fonts claim to use "WinAnsiEncoding" (Also Known As code page 1252), with a "differences" array. This differences array is wrong:
47 /BB 61 /BP /BQ 81 /C6...
The first number is the code point being replaced, the second is a Name of a character that replaces the original value at that code point.
There's no such character names as BB, BP, BQ, C9... and so on. So when you copy-paste that text, you get the above garbage.
I'm sorry, but the only reliable way to extract text from such a PDF is OCR (optical character recognition).
Eh... Long shot idea:
If you can find the specific versions of the specific fonts used to generate this PDF, you just might be able to determine the actual stream contents of known characters converted to Type 3 fonts in this way.
Once you have these known streams, you can compare them to the streams in the PDF and use that to build your own translation table.
You could either fix the existing PDF[s] (by changing the names in the encoding dictionary and Type 3 charproc entries) such that these text extractors will work correctly, or just grab the bytes out of the stream and translate them yourself.
The workflow would go something like this:
For each character in a font used in the form:
render it to PDF by itself using the same LaTeK/GhostScript versions.
Open the PDF and find the CharProc for that particular known character.
Store that stream along with the known character used to build it.
For each text byte in the PDF to be interpreted.
Get the glyph name for the given byte based on the existing encoding array
Get the "char proc" stream for that glyph name and compare it to your known char procs.
NOTE: This could be rewritten to be much more efficient with some caching, but it gets the idea across (I hope).
All that requires a fairly deep understanding of PDF and the parsing methods involved. But it just might work. Might not too...

Need help to gather data from a txt file and insert in a web page?

could someone advise me on the most efficient way to gather data from one source, select a specific piece of data and insert it in a web page? Specifically, I wish to:
Call up this buoy data text file: http://www.ndbc.noaa.gov/data/realtime2/46237.txt
Find the water temperature and insert that value in my web page.
First big question: What scripting language should I use? (I'm assuming Fortran is not an option :-)
Second not so big question: This same data set is available in graphic and xml format. Would either of these data formats be more useful than the .txt file?
Thanks in advance.
Use Perl.
(Hey, you asked. Normally one programs in whatever language one would normally use.)
The XML format won't be much more useful than the text format.
This text file format is just about as simple as it could ever get. Just about any scripting or general purpose programming language will work. The critical part is to split each line on a regex "\s+". i.e. in Python it would be:
import re
theFileObject = open('/path/to/downloaded/file.txt')
for line in theFileObject.readlines():
columns = re.split(r'\s+', line)
# each column is columns[0] through columns[19]
So basically choose whatever programming language seems easiest to you. Any .NET language would be equally capable, as well as ruby, python, scheme, etc. I personally have a distaste for perl because I find it very difficult to read

Resources