Open and extract information from large text file (Geonames) - geonames

I want to make a list of all major towns and cities in the UK.
Geonames seems like a good place to start, although I need to use it locally (as opposed to the API) as I will be working offline while using the information.
Due to the large size of the geonames "allcountries.txt" file it won't open on Notepad, Notepad++ and Sublime. I've tried opening in Excel (including the Data modelling function) but the file has more than a million rows so this won't work either.
Is it possible to open this file, extract the UK-only cities, and manipulate in Excel and/or some other software? I am only after place name, lat, long, country name, continent

#dedek's suggestion (in the comments) to use GB.txt is definitely the best answer for your particular case.
I've added another answer because this technique is much more flexible and will allow you to filter by country or any other column. i.e. You can adapt this solution to filter by language, region in the UK, population, etc or apply it the cities5000.txt file, for example.
Solution:
Use grep to find data that matches a particular pattern. In essence, the command below is saying, find all rows where the 8th column is exactly "GB".
grep -P "[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\tGB\t" allCountries.txt > UK.txt
(grep comes standard with most Unix systems but there are definitely tools out there that can do it on Windows too.)
Details:
grep: The command being executed.
\t: Shorthand for the TAB character.
-P: Tells grep to use a Perl-style regular expression (grep might not recognize \t as a TAB character otherwise). (This might be a bit different if you are using another version of grep.)
[^\t]*: zero or more non-tab characters i.e. an optional column value.
> UK.txt: writes the output of the command to a file called "UK.txt".
Again, you could adapt this example to filter on any column in any file.

Related

Extracting PDF Tables into Excel in Automation Anywhere

[![enter image description here][4]][4][![enter image description here][5]][5]I have a PDF that has tabular data that runs over 50+ pages, i want to extract this table into an excel file using Automation Anywhere. (i am using community version of AA 11.3). I watched videos of the PDF integration command but haven't had any success trying this for tabular data.
Requesting assistance.
Thanks.
I am afraid that your case will be quite challenging... and the main reason for that are the values that contains multiple lines. You can still achieve what you need, and with good performance, but the code itself will not be pretty. You will also be facing challanges with Automation Anywhere, since it does not really provide the right tools to do such a thing and you may need to resort to scripting (VBScripts) or Metabots.
Solution 1
This one will try to use purely text extraction and Regular expressions. Mainly standard functionality, nothing too "dirty".
First you need to realise how do the exported data look like. You can see that you can export to Plain or Structured.
The Plain one is not useful at all as the data is all over the place, without any clear pattern.
The Structured one is much better as the data structure resembles the data from the original document. From looking at the data you can make these observations:
Each row contains 5 columns
All columns are always filled (at least in the visible sample set)
The last two columns can serve as a pattern "anchor" (identifier), because they contain a clear pattern (a number followed by minimum of two spaces followed by a dollar sign and another number)
Rows with data are separated by a blank row
The text columns may contain a multiline value, which will duplicate the rows (this one thing makes it especially tricky)
First wou need to ensure that the Structured data contain only the table, nothing else. You can probably use the Before-After string command for that.
Then you need to check if you can reliably identify the character width of every column. You can try this for yourself if you copy the text into Excel, use the Text to Columns with the Fixed Width option and try to play around with the sliders
The you need to try to find a way how to reliably identify each row and prepare it for the Split command in AA. For that you need to have a delimiter. But since each data row can actually consists of multiple text rows, you need to create a delimiter of your own. I used the Replace function with Regular Expression option and replace a specific pattern for a delimiter (pipe). See here.
Now that you have added a custom delimiter, you can use the Split command to add each row into a list and loop through it.
Because each data row may consists of several rows, you will need to use Split again, this time use the [ENTER] as delimiter. Now you need to loop through each of the text line of a single data line and use the Substring function to extract data based on column width and concatenate them to a single value that you store somewhere else.
All in all, a painful process.
Solution 2
This may not be applicable, but it's worth a try - open the PDF in Microsoft Word. It will give you a warning, ignore it. Word will attempt to open the document and, if you're lucky, it will recognise your table as a table. If it works, it will make the data extraction much easier an you will be able to use Macros/VBA or even simple Copy&Paste. I tried it on a random PDF of my own and it works quite well.

Informix 4GL report to screen - Reverse

I have a generated report in Informix 4GL that prints to the screen.
I need to have one column displayed in reverse format.
I tried the following:
print line_image attribute(reverse)
But that doesn't work. Is this possible at all?
Adding on to the previous answer, you can try the following
print "\033[7mHello \033[0mWorld"
\033[7m means to print in reverse. And, \033[0m means to go back to standard.
If you mean "is there any way at all to do it", the answer's "yes". If you mean "is there a nice easy built-in way to do it", the answer's "no".
What you'll need to do is:
Determine the character sequence that switches to 'reverse' video — store the characters in a string variable brv (begin reverse video; choose your own name if you don't like mine).
Determine the character sequence that switches to 'normal' video — store the characters in a string variable erv (end reverse video).
Arrange for your printing to use:
PRINT COLUMN 1, first_lot_of_data,
COLUMN 37, brv, reverse_data,
COLUMN 52, erv,
COLUMN 56, next_lot_of_data
There'll probably be 3 or 4 characters needed to switch. Those characters will be counted by the column-counting code in the report.
Different terminal types will have different sequences. These days, the chances are your not dealing with the huge variety of actual green-screen terminals that were prevalent in the mid-80s, so you may be able to hardwire your findings for the brv and erv strings. OTOH, you may have to do some fancy footwork to find the correct sequences for different terminals at runtime. Shout if you need more information on this.
A simple way which might allow you to discover the relevant sequences is to run a program such as (this hasn't been anywhere near an I4GL compiler — there are probably syntax errors in it):
MAIN
DISPLAY "HI" AT 1,1
DISPLAY "REVERSE" AT 1,4 ATTRIBUTE(REVERSE)
DISPLAY "LO" AT 1, 12
SLEEP 2
END MAIN
Compile that into terminfo.4ge and run:
./terminfo.4ge # So you know what the screen looks like
./terminfo.4ge > out.file
There's a chance that won't use the display attributes. You'd see that if you run cat out.file and don't see the reverse flash up, then we have to work harder.
You could also look at the terminal entry in the termcap file or from the terminfo entry. Use infocmp $TERM (with the correct terminal type set in the environment variable) and look for the smso (enter standout mode) and rmso (exit standout mode) capabilities. Decipher those (I have rmso=\E[27m and smso=\E[7m for an xterm-256color terminal; the \E is ASCII ESC or \033) and use them in the brv and erv strings. Note that rmso is 5 characters long.

How can these strings be different?

I am facing a weird problem.
I have extracted data from an Excel file. It should contain an IBAN account number.
Then I tried to analyze the set of account numbers (which the source guarantees to be good) with a Java library.
To keep the scope of the question narrow, I can't explain the following. The below strings are different
030​69
03069
The first is a copy & paste from the Excel file, the second is handwritten. Google returns different results for abi [above number] and in fact in the second case I can find that it is the bank code for Intesa Sanpaolo bank (exact page displaying the ABI code, localized, here).
So, to keep the scope narrow: how is that possible? Is it something to do with the encoding?
Try it yourself: do CTRL+F and try type "030", it will select both lines. Now type 6, it will match only the 2nd line.
Same happened in Notepad++
There's an U+200B ZERO WIDTH SPACE in between 030 and 69 in the first text.
Paste the text in https://www.branah.com/unicode-converter for example, or edit in a hexadecimal capable editor.
The solution for cleaning such strings could be for example to whitelist characters, so replace everything that isn't A-Z0-9 will be scrubbed.

How can I search for <item1> AND <item2> using the Delphi XE2 IDE search?

I use searching all the time to locate stuff within my (huge) application source, so search effectiveness is very important to me. Presently in the Delphi XE2 IDE I like to use:
Find in Files
Include subdirectories.
Nothing else fancy, just a text keyword. This works ok but what I would really like to do is to extend what I'm doing now to include lines that contain 'A' AND 'B' where A and B are any group of characters (one type of boolean search). Exact matches against A and B are fine, because this now allows you to put in two very partial keywords and still find a unique occurence. I've been using this method in my own search engine for years. Is there an easy way of doing this in the Delphi IDE please?
Thanks
You can use regular expressions (just check the regular expressions checkbox on the right side of the Find window). The regex support is somewhat limited - it's documented for XE2 on the XE2 docwiki here.
I use GExperts Grep Search instead (part of the GExperts IDE experts set), which offers fuller regex support (although still not great) and a better display (IMO) of the search results. (Note the image of the Grep Search dialog contains a regular expression that will match WordA or WordB in either order in the file, so it satisfies your search logic within the limited regex support in GExperts. It matches single words on the line as well, but the results dialog makes it easy to find the lines you're interested in, and double-clicking a line will take you to that match in the IDE's code editor.)
The above results are based on a single file search and those results. For multiple files (in this case, just two), the dialog appears like this:

How to make a small engine like Wolfram|Alpha?

Lets say I have three models/tables: operating_systems, words, and programming_languages:
# operating_systems
name:string created_by:string family:string
Windows Microsoft MS-DOS
Mac OS X Apple UNIX
Linux Linus Torvalds UNIX
UNIX AT&T UNIX
# words
word:string defenitions:string
window (serialized hash of defenitions)
hello (serialized hash of defenitions)
UNIX (serialized hash of defenitions)
# programming_languages
name:string created_by:string example_code:text
C++ Bjarne Stroustrup #include <iostream> etc...
HelloWorld Jeff Skeet h
AnotherOne Jon Atwood imports 'SORULEZ.cs' etc...
When a user searches hello, the system shows the defenitions of 'hello'. This is relatively easy to implement. However, when a user searches UNIX, the engine must choose: word or operating_system. Also, when a user searches windows (small letter 'w'), the engine chooses word, but should also show Assuming 'windows' is a word. Use as an operating system instead.
Can anyone point me in the right direction with parsing and choosing the topic of the search query? Thanks.
Note: it doesn't need to be able to perform calculations as WA can do.
Have a new index table called terms that contains a tokenised version of each valid term. That way, you only have to search one table.
# terms
Id Name Type Priority
1 window word false
2 Windows operating_system true
Then you can see how close a match the users search term is. I.e. "Windows" would be a 100% match with 2 - so assume that, but a close match to 1 also, so suggest that as an alternative. You've have to write your own rules engine that decided how close a word matches (i.e. what gets assumed with "windows" vs "Windows"?) The Priority field could be the final decider if the rules engine can't decide, and could in theory be driven by user activity so it learns what users are more likely referring to.
And what about to make a cache in form of a database table where all the keywords would be.
The search query would be something like this:
SELECT * FROM keywords WHERE keyword = '<YourKeyWord>' /* mysql */
the keywords table would contain some kind of references to your modules.
The advantage of this approarch is of course fast searching.
You may use two queries in order to simulate the behaviour you ask for:
Exact match (no problem in mysql)
Case insensitive search
Wolfram Alpha is far more complex than your example... I'm not certain of its inner workings (I have done very little reading on it), but I believe it is a very large and complex automated inference system. They're rather trivial to implement (Prolog is basically a general purpose one you can put whatever data you need into), but they're very hard to make useful.

Resources