Combining GeoJSON Features - geojson

I have a large (200MB) geoJSON file that has a lot of complex polygons and multipolygons. A very truncated example is at https://gist.github.com/jinky32/81f61e1fc118822ba103?short_path=d16949b
As you can see this file is comprised of polygons and multipolygons that have a String property of either 1 or 2. Below is an example of how these shapes look on mapshaper.org when highlighting a multipolygon of either value in the same tile (essentially c.90+% of this tile is made up of a multipolygon with one value or the other)
----- ---- -----
I do not need to differentiate between these different values and polygons / multipolygons with a String value of either 1 or 2 can be combined together which I hope will reduce the file size.
Can anyone advise how I can achieve this - preferably with a cli tool?

Answer is to use ogr2ogr:
ogr2ogr -f "GeoJSON" -dialect sqlite -sql "select st_union(geometry) as geometry from OGRGeoJSON where string in ('1','2')" gj_union_test.json geojsontest.json

Related

Problems plotting time-series interactively with Altair

Description of the problem
My goal is quite basic: to plot time series in an interactive plot. After some research I decided to give a try to Altair.
There are already QGIS plugins for time-series visualisation, but as far as I'm aware, none for plotting time-series at vector-level, interactively clicking on a map and selecting a Polygon. So that's why I decided to go for a self-made solution using Altair, maybe combining it with Folium to add functionalities later on.
I'm totally new to the Altair library (as well as Vega and Vega-lite), and quite new in datascience and data visualisation as well... so apologies in advance for my ignorance!
There are already well explained tutorials on how to plot time series with Altair (for example here, or in the official website). However, my study case has some particularities that, as far as I've seen, have not yet been approached altogether.
The data is produced using the Python API for Google Earth Engine and preprocessed with Python and the pandas/geopandas libraries:
In Google Earth Engine, a vegetation index (NDVI in the current case) is computed at pixel-level for a certain region of interest (ROI). Then the function image.reduceRegions() is mapped across the ImageCollection to compute the mean of the ndvi in every polygon of a FeatureCollection element, which represent agricultural parcels. The resulting vector file is exported.
Under a Jupyter-lab environment, the data is loaded into a geopandas GeoDataFrame object and preprocessed, transposing the DataFrame and creating a datetime column, among others, in order to have the data well-shaped for time-series representation with Altair.
Data overview after preprocessing:
My "final" goal would be to show, in the same graphic, an interactive line plot with a set of lines representing each one an agricultural parcel, with parcels categorized by crop types in different colours, e.g. corn in green, wheat in yellow, peer trees in brown... (the information containing the crop type of each parcel can be added to the DataFrame making a join with another DataFrame).
I am thinking of something looking more or less like the following example, with legend's years being the parcels coloured by crop types:
But so far I haven't managed to make my data look this way... at all.
As you can see there are many nulls in the data (this is due to the application of a cloud masking function and to the fact that there are several Sentinel-2 orbits intersecting the ROI). I would like to just omit the non-null values for earch column/parcel, but I don't know if this data configuration can pose problems (any advice on that?).
So far I got:
The generation of the preceding graphic, for a single parcel, takes already around 23 seconds. Which is something maybe shoud/cloud be improved (how?)
And more importantly, the expected line representing the item/polygon/parcel's values (NDVI) is not even shown in the plot (note that I chose a parcel containing rather few non-null values).
For sure I am doing many things wrong. Would be great to get some advice to solve (some of) them.
Sample of the data and code to reproduce the issue
Here's a text sample of the data in JSON format, and the code used to reproduce the issue is the following:
import pandas as pd
import geopandas as gpd
import altair as alt
df= pd.read_json(r"path\to\json\file.json")
df['date']= pd.to_datetime(df['date'])
print(gdf.dtypes)
df
Output:
lines=alt.Chart(df).mark_line().encode(
x='date:O',
y='17811:Q',
color=alt.Color(
'17811:Q', scale=alt.Scale(scheme='redyellowgreen', domain=(-1, 1)))
)
lines.properties(width=700, height=600).interactive()
Output:
Thanks in advance for your help!
If I understand correctly, it is mostly the format of your dataframe that needs to be changed from wide to long, which you can do either via .melt in pandas or .transform_fold in Altair. With melt, the default names are 'variable' (the previous columns name) and 'value' (the value for each column) for the melted columns:
alt.Chart(df.melt(id_vars='date'), width=500).mark_line().encode(
x='date:T',
y='value',
color=alt.Color('variable')
)
The gaps comes from the NaNs; if you want Altair to interpolate missing values, you can drop the NaNs:
alt.Chart(df.melt(id_vars='date').dropna(), width=500).mark_line().encode(
x='date:T',
y='value',
color=alt.Color('variable')
)
If you want to do it all in Altair, the following is equivalent to the last pandas example above (the transform uses 'key' instead of 'variable' as the name for the former columns). I also use and ordinal instead of nominal type for the color encoding to show how to make the colors more similar to your example.:
alt.Chart(df, width=500).mark_line().encode(
x='date:T',
y='value:Q',
color=alt.Color('key:O')
).transform_fold(
df.drop(columns='date').columns.tolist()
).transform_filter(
'isValid(datum.value)'
)

What is the best options for tippecanoe to process line string features?

I have a GeoJSON file with a FeatureCollection (more than 300 000 features) of LineStrings. It is a road traffic records. I need to convert it to the MVT format using Tippecanoe. I'm trying to convert the GeoJSON with this params:
tippecanoe data.geojson -pf -pS -zg --detect-shared-borders -o data.mbtiles -f
Then I uploading it to Mapbox account as a tileset and use to render with Mapbox GL JS. And there is a problem - not all the features are visible. Moreover, if if will reconvert the GeoJSON file - then I will get a different result! So - what is the best options to use with tippecanoe to convert all the features (lineStrings) without oversimplification to use it with Mapbox GL JS?
P.S. One more thing which I noticed is that datasets uploaded with Mapbox Studio and then converted to tileset has some info like this: "This layer contains mostly LineStrings", but with my own tilesets converted with the tippecanoe I see a next message: "* No dominant geometry type*"
-ae will auto-increase the maxzoom if features are still being dropped at that zoom level. But when zoomed out it doesn't always look good depending on the type of features (e.g.: mising cadastre doesn't look good)...

Store GIS Quadtree in Raw File (w/ Geohash) or PNG?

I am collecting GIS data consisting of normalized four values for whole world. I am curious on what would be the best way to store this data and wanted to take your advise. Would it be more efficient (in terms of size) to store the four values of the quadtree, along with a Geohash index via Z-order (Morton) or Hilbert curve? Or would it be more efficient to store it in a PNG file using alpha = 0 for empty spaces and lossless compression? The enclosed image 1 only visualizes one of the four values over Google Maps and I need to store this global data each day. Please, note that I will only store leaf nodes as visualized in the image 1 rather than the whole quadtree. I will also store this over time so I would also like to know your ideas about how video compression would improve.
Thank you all in advance for your time and consideration!

Why does ELKI need db.in file in addition to distance matrix? Also what should db.in file contain?

I tried to follow this tutorial on using ELKI with pre-computed distances for clustering.
http://elki.dbs.ifi.lmu.de/wiki/HowTo/PrecomputedDistances
I used the following set of command line options:
-dbc.filter FixedDBIDsFilter -dbc.startid 0 -algorithm clustering.OPTICS
-algorithm.distancefunction external.FileBasedDoubleDistanceFunction
-distance.matrix /path/to/matrix -optics.minpts 5 -resulthandler ResultWriter
ELkI fails with a configuration error saying db.in file is needed to make the computation.
The following configuration errors prevented execution:
No value given for parameter "dbc.in":
Expected: The name of the input file to be parsed.
No value given for parameter "parser.distancefunction":
Expected: Distance function used for parsing values.
My question is what is db.in file? Why should I provide it in addition to the distance matrix file since the pair-wise distance matrix file completely specifies all the information about the point cloud. (also I don't have access to any other information other than the pair-wise distance information).
What should I do about db.in? Should I override it, or specify some dummy information etc. Kindly help me understand.
thank you.
This is documented in the ELKI HowTos:
http://elki.dbs.ifi.lmu.de/wiki/HowTo/PrecomputedDistances
Using without primary data
-dbc DBIDRangeDatabaseConnection -idgen.count 100
However, there is a bug (patch is on the howto page, and will be in the next release) so you right now can't fully use this; as a workaround you can use a text file that enumerates the objects.
The reason for this is that ELKI is designed to work on multi-relational data. It's not just processing matrixes. But some algorithms may e.g. need a geographic representation of an object, some measurements for this object, and a label for evaluation. That is three relations.
What the DBIDRange data source essentially does is create a single "fake" relation that is just the DBIDs 0 to 99. On algorithms that don't need actual data, but only distances (e.g. LOF or DBSCAN or OPTICS), it is sufficient to have object IDs and a distance matrix.

How to read Mahout clustering output

I have run the k-Means clustering algorithm on the synthetic control data from the Mahout tutorial, and was wondering if someone could explain how to interpret the output. I ran clusterdump and received output that looks something like this (truncated to save space):
CL-592{n=57 c=30.726, 29.813...] r=[3.528, 3.597...]}
Weight : [props - optional]: Point:
1.0 : [distance=27.453962995925863]: [24.672, 35.261, 30.486...]
1.0 : [distance=27.675053294846002]: [25.592, 29.951, 34.188...]
1.0 : [distance=28.97727289419493]: [30.696, 32.667, 34.223...]
1.0 : [distance=21.999685652862784]: [32.702, 35.219, 30.143...]
...
CL-598{n=50 c=[29.611, 29.769...] r=[3.166, 3.561...]}
Weight : [props - optional]: Point:
1.0 : [distance=27.266203490250472]: [27.679, 33.506, 23.594...]
1.0 : [distance=28.749781351838173]: [34.727, 28.325, 30.331...]
1.0 : [distance=32.635136046420186]: [27.758, 33.859, 29.879...]
1.0 : [distance=29.328974057024624]: [29.356, 26.793, 25.575...]
Could someone explain to me how to read this? From what I understand, CL-__ is a cluster ID, followed by n=number of points in the cluster, c=centroid as a vector, r=radius as a vector, and then each point in the cluster. Is this correct? Furthermore, how do I know which clustered point matches up with which input point? i.e. are the points described as a key-value pair where the key is some kind of ID for the point and the value is the vector? If not is there some way I can set it up so it is?
I believe your interpretation of the data is correct (I've only been working with Mahout for ~3 weeks, so someone more seasoned should probably weigh in on this).
As far as linking points back to the input that created them I've used NamedVector, where the name is the key for the vector. When you read one of the generated points files (clusteredPoints) you can convert each row (point vector) back into a NamedVector and retrieve the name using .getName().
Update in response to comment
When you initially read your data into Mahout, you convert it into a collection of vectors with which you then write to a file (points) for use in the clustering algorithms later. Mahout gives you several Vector types which you can use, but they also give you access to a Vector wrapper class called NamedVector which will allow you to identify each vector.
For example, you could create each NamedVector as follows:
NamedVector nVec = new NamedVector(
new SequentialAccessSparseVector(vectorDimensions),
vectorName
);
Then you write your collection of NamedVectors to file with something like:
SequenceFile.Writer writer = new SequenceFile.Writer(...);
VectorWritable writable = new VectorWritable();
// the next two lines will be in a loop, but I'm omitting it for clarity
writable.set(nVec);
writer.append(new Text(nVec.getName()), nVec);
You can now use this file as input to one of the clustering algorithms.
After having run one of the clustering algorithms with your points file, it will have generated yet another points file, but it will be in a directory named clusteredPoints.
You can then read in this points file and extract the name you associated to each vector. It'll look something like this:
IntWritable clusterId = new IntWritable();
WeightedPropertyVectorWritable vector = new WeightedPropertyVectorWritable();
while (reader.next(clusterId, vector))
{
NamedVector nVec = (NamedVector)vector.getVector();
// you now have access to the original name using nVec.getName()
}
Try to add the option -of CSV in clusterdump, you will have a more exploitable result for further treatment.
I have the same problem,(using mahout 0.6).I am also a beginner. I need to display the documents in the form of clusters to the users. So i will need document names rather that words corresponding to clusters. I have been clustering the text documents from shell script.

Resources