Is it possible to delete small islands from a geojson/topojson file? - geojson

Is there a way to remove small islands from my topojson file?
I currently have islands that belong to countries like Spain and portugal, but I don't want to display these small islands. I tried geojson.io but deleting these islands results in everything being deleted that belongs to the country, so including the mainland in Europe, which is the only part that I want to keep.

Open the topojson map file in text editor. Split it in new lines with string ']],[['.
Then find a multiPolygon objects that you want to reduce.
Take a note of arc numbers in this MultiPolygon.
Now look on list of arcs.
Each arc is a small part of polygons/multiPolygons. Each arc has an ID, which is their order in file.
If you count them, then you can see which arcs are used in your MultiPolygon that you are trying to reduce.
In general small islands/areas are represented by very small arcs (length of points in arc definition).
By modifying list of arc IDs in your Main MultiPolygon you can switch them off from map.

I was able to do this using this GeoJSON online tool.
http://geojson.io/
I uploaded my TopoJSON file and then selected the island I wanted to delete then clicked "Delete feature".
After that I copied the JSON text back into my file. I had to make sure to keep the first part of my file so that it still worked in the code:
{"type":"Topology","objects":{"states":{"type":"GeometryCollection","crs":{"type":"name","properties":{"name":"urn:ogc:def:crs:OGC:1.3:CRS84"}},
I only pasted the "geometries" section. Then it worked!

Related

Extracting PDF Tables into Excel in Automation Anywhere

[![enter image description here][4]][4][![enter image description here][5]][5]I have a PDF that has tabular data that runs over 50+ pages, i want to extract this table into an excel file using Automation Anywhere. (i am using community version of AA 11.3). I watched videos of the PDF integration command but haven't had any success trying this for tabular data.
Requesting assistance.
Thanks.
I am afraid that your case will be quite challenging... and the main reason for that are the values that contains multiple lines. You can still achieve what you need, and with good performance, but the code itself will not be pretty. You will also be facing challanges with Automation Anywhere, since it does not really provide the right tools to do such a thing and you may need to resort to scripting (VBScripts) or Metabots.
Solution 1
This one will try to use purely text extraction and Regular expressions. Mainly standard functionality, nothing too "dirty".
First you need to realise how do the exported data look like. You can see that you can export to Plain or Structured.
The Plain one is not useful at all as the data is all over the place, without any clear pattern.
The Structured one is much better as the data structure resembles the data from the original document. From looking at the data you can make these observations:
Each row contains 5 columns
All columns are always filled (at least in the visible sample set)
The last two columns can serve as a pattern "anchor" (identifier), because they contain a clear pattern (a number followed by minimum of two spaces followed by a dollar sign and another number)
Rows with data are separated by a blank row
The text columns may contain a multiline value, which will duplicate the rows (this one thing makes it especially tricky)
First wou need to ensure that the Structured data contain only the table, nothing else. You can probably use the Before-After string command for that.
Then you need to check if you can reliably identify the character width of every column. You can try this for yourself if you copy the text into Excel, use the Text to Columns with the Fixed Width option and try to play around with the sliders
The you need to try to find a way how to reliably identify each row and prepare it for the Split command in AA. For that you need to have a delimiter. But since each data row can actually consists of multiple text rows, you need to create a delimiter of your own. I used the Replace function with Regular Expression option and replace a specific pattern for a delimiter (pipe). See here.
Now that you have added a custom delimiter, you can use the Split command to add each row into a list and loop through it.
Because each data row may consists of several rows, you will need to use Split again, this time use the [ENTER] as delimiter. Now you need to loop through each of the text line of a single data line and use the Substring function to extract data based on column width and concatenate them to a single value that you store somewhere else.
All in all, a painful process.
Solution 2
This may not be applicable, but it's worth a try - open the PDF in Microsoft Word. It will give you a warning, ignore it. Word will attempt to open the document and, if you're lucky, it will recognise your table as a table. If it works, it will make the data extraction much easier an you will be able to use Macros/VBA or even simple Copy&Paste. I tried it on a random PDF of my own and it works quite well.

Parse PDF file and output single character locations

I'm trying to extract text information from a (digital) PDF by identifying content and location of each character and each word. For words, pdftotext --bbox from xpdf / poppler works quite well, but I cannot find an easy way to extract character location.
What I've tried
The solution I currently have is to convert the pdf to svg (via pdf2svg), and then parse the resulting svg to extract single character (= glyph) locations. In a third step, the resulting boxes are compared, each character is assigned to a word and hopefully the numbers match.
Problems
While the above works for most "basic" fonts, there are two (main) situations where this approach fails:
In script fonts (or some extreme italic fonts), bounding boxes are way larger than their content; as a result, words overlap significantly, and it can well happen that a character is entirely contained in two words. In this case, the mapping fails, because once I translate to svg I have no information on what character is contained in which glyph.
In many fonts multiple characters can be ligated, giving rise to a single glyph. In this case, the count of character boxes does not match the number of characters in the word, and matching each letter to a box is again problematic.
The second point (which is the main one for me) has a partial workaround by identifying the common ligatures and (if the counts don't match) splitting the corresponding bounding boxes into multiple pieces; but that cannot always work, because for example "ffi" is sometimes ligated to a single glyph, sometimes in two glyphs "ff" + "i", and sometimes in two glyphs "f" + "fi", depending on the font.
What I would hope
It is my understanding that pdf actually contain glyph information, and not words. If so, all the programs that extract text from pdf (like pdftotext) must first extract and locate the various characters, and then maybe group them into words/lines; so I am a bit surprised that I could not find options to output location for each single character. Converting to svg essentially gives me that, but in that conversion all information about the content (i.e. the mapping glyph-to-character, or glyph-to-characters, if there was a ligature) is lost, because there is no font anymore. And redoing the effort of matching each glyph to a character by looking at the font again feels like rewriting a pdf parser...
I would therefore be very grateful for any idea of how to solve this. The top answer here suggests that this might be doable with TET, but it's a paying option, and replacing my whole infrastructure to handle just one limit case seems a big overkill...
A PDF file doesn't necessarily specify the position of each character explicitly. Typically, it breaks a text into runs of characters (all using the same font, anything up to a line, I think) and then for each run, specifies the position of the bounding box that should contain the glyphs for those characters. So the exact position of each glyph will depend on metrics (mostly glyph-widths) of the font used to render it.
The Python package pdfminer has a script pdf2txt.py. Try invoking it with -t xml. The docs just say XML format. Provides the most information. But my notes indicate that it will apply the font-metrics and give you a <text> element for every single glyph, with font and bounding-box info.
There are various versions in various places (e.g. PyPI and github). If you need Python 3 support, look for pdfminer.six.

How can I cluster similar type of sentences based on their context and extract keywords from them

I wanted to cluster sentences based on their context and extract common keywords from similar context sentences.
For example
1. I need to go to home
2. I am eating
3. He will be going home tomorrow
4. He is at restaurant
Sentences 1 and 3 will be similar with keyword like go and home and maybe it's synonyms like travel and house .
Pre existing API will be helpful like using IBM Watson somehow
This API actually is doing what you are exactly asking for (Clustering sentences + giving key-words):
http://www.rxnlp.com/api-reference/cluster-sentences-api-reference/
Unfortunately the algorithm used for clustering and the for generating the key-words is not available.
Hope this helps.
You can use RapidMiner with Text Processing Extension.
Insert each sentence in a seperate file and put them all in a folder.
Put the operators and make a design like below.
Click on the Process Documents from files operator and in the right bar side choose "Edit list" on "Text directories" field. Then choose the folder that contains your files.
Double click on Process Documents from files operator and in the new window add the operators like below design(just the ones you need).
Then run your process.

Finding Possible Words in an Ever Changing Grid

I am working on a word game where the user creates words from an ever changing grid of letters. Validating the users selection is easy enough to do using a wordlist.
Since the playing grid is randomly generated and previously played tiles are removed and replaced with new letters, I need to be able to effectively check for possible valid plays between each user submission, so if there are no possible valid words I can reset the grid or something to that affect. The solution only needs to detect that there is at least one valid 3 - 7 letter word within the current set of tiles. It does not need to know ever possible combination. A user can start on any tile and build a word using one tile away in any direction from the currently selected letter.
Important: The solution can't slow the game play down. As soon as the user submits the current word and the new tiles appear they can start a new selection without delay.
Any direction would be greatly appreciated as I have not been able to find what I think I'm looking for with any google searches so far.
Building with Swift for iOS8+
As #jamesp mentioned, a trie is your best bet. You have a couple of advantages for yourself here, the first one being is that you can bail out the second you have found a word. You don't have to scan the whole grid for all possible words, just find one and be done with it. Start with any random letter and look at the ones around it, and match those up against the trie. If you find a match, continue with the letters around that one and so on until you have found a word or a dead end. If the current start tile doesn't have a full word around it, go on to the next tile in the grid and try from there again.
It's going to take quite a bit of processing time to get through a problem like that, especially since the words can twist and turn.
There's a couple of ways to approach that, but if you have a fast dictionary lookup, you'll probably want to step through your puzzle, starting at the upper-left tile, and look at the letter there. Say it is "S". Your dictionary will provide you with a list of acceptable "S" words. You can step through those words looking at the second letter of each word and seeing if there is an adjacent tile to the current tile that has that letter. If not, you're done - move to the next tile. If so, you can do that exact same process again recursively for words starting with "S" and the next letter.
For instance, say the current tile is "S". You'd look up your list of "S" words, and start looping through them. One of them is "Syzygy". If there is no adjacent "Y" tile, you're done - move to the next word. If there is a "Y" tile adjacent, look around that tile for a "Z". If there is none, you're done. Otherwise, move to the second "Y", and so on. (If tiles can only be used once for a word, you may have to remember tiles to exclude from later letters, so that the player can't use the same "Y" three times to spell that "Syzygy".)
I'd wager that the vast majority of your tiles would be excluded quickly with this approach, but it still could take a long time to process, especially if your grid is large. You can address this by running that check in the background, and letting the player continue to play while it's checking, and then show an alert when you finally ascertain that there are no valid plays.
Keep in mind that just because there is one valid word in the puzzle, it doesn't mean that it's still really solvable by the end user. This sort of puzzle isn't like your typical match-three games where if you just look long enough, you'll find the match. Most "valid word" lists are going to contain many words that most people aren't going to know. Words like "propale" and "helctic" and "syzygy". (And you can't exclude words like that, because then when someone finds one, instead of the intense satisfaction of finding an obscure word, they get the intense frustration of "But that's a real word, dang it!")
So, probably what you want is an assessment of how obscure the existing words are. If "dog" is the only word available, that's still probably pretty solvable for most people, whereas if the only words available are "propale" and "helctic" and "syzygy", that's probably impossible for most people, even though there are more words available.
To do that, you'll need to rank your dictionary words as to how common they are, and then add up the ubiquity score for the existing words to make that sort of assessment, and calling the puzzle "unsolvable" if it doesn't reach a certain threshold for that score. Same algorithm, but you'll be adding a score for each word you find. And you can't just quit when you find the first word, but you can quit when you reach that threshold.
If this sounds daunting, then that's because you've designed yourself into a corner. A better approach might be letting the user swap some tiles if they're stumped, or letting them add a wildcard letter, etc. Things that they manage, so that even if there are no real words in the puzzle, they still have strategic options. Then you don't even have to solve this problem, and it solves the deeper problem of knowing whether a puzzle is practically solvable rather than technically solvable.
Why not construct your grid by first putting a randomly selected valid word somewhere on it and then filling in the blank spaces with random letters.
Edit
So if you can't do that, one way that might be quick enough is to organise your words in a trie. This might be enough to make the search fast enough. The idea would be to iterate through the letters in the grid and for each one select the appropriate first letter in the trie. This makes the search for each allowed neighbour smaller and so on until either your trie runs out or you find a word.
I would also select the random letters in a distribution that mirrors the distribution of letters in your dictionary. One way to do this is to
count the number of each letter in your dictionary to give each one a weight
generate a random number between 0 and the total of all the weights
iterate through the letters, subtracting each one's weight from your random number
when the subtraction gets below zero, the letter you are on is the one you want.
You can speed the above up by using a binary search but there's only 26 letters, so the extra complication isn't worth it.

XNA 2D windows phone level editor

Can anyone kindly point me in the right direction as to what the best way is to make a 2D Level Editor in XNA for Windows Phone?
I have a game that's almost finished but I wish to create multiple versions of it in the future with different levels and themes etc.
What would be the best way to do this?
I suggest not making one and using something like Tiled instead. It will save stages in a relatively simple XML format, and there's even a C# library to read the Tiled files.
There are plenty of other good editors as well. I recommend going in this direction because quite frankly you will spend way too much time making your own.
To expand further on Tiled, you can use the aforementioned library to parse a TMX file, which are made from using Tiled and saving your map. Read the Usage on the github page, looks pretty simple to use.
When you parse a Tiled element, say a specific tile index or a Tiled "object", you have to map that to something useful in your game (a graphical sprite texture, an enemy or object the player can interact with, etc.). For tiles, you can manage this via enums (create an alias for each tile type and assign it the exact tile number from your tilesheet), or even just an array that follows the same mapping. For objects, use Tiled's object properties to assign meaningful values that you define, which then get saved along with the TMX and you can parse them using that library.
For example, you could define a property in a Tiled object called "enemytype" and give it the value "lizard". The code when parsing could look for this property and value, and create a Lizard object when it's parsed.
If you've got a nearly-finished game, then I'd assume that somewhere in there is code to load and display the map you have. Extract that code and you're halfway there. Then you just need to add some way of adding to the map it's reading from, and save it back to the same format.
If your maps are currently created in memory, then you'll need to figure out a file format you can save them as (XML or JSON works, but a big CSV of ints for tile types works too and is simpler). Then you'll need code to read in from that format and populate your current map model.

Resources