I have a large collection of documents containing geospatial point data (among other data). I have another large collection of documents containing polygons (among other data).
I want to filter queries on the point data by whether the points are included in any of the polygons that have a certain property.
Can I do this with RavenDB and if so, how?
Things I thought about:
I can't see how I could do this with an index because indexes only map (and/or reduce), so I can not query one collection by another.
I can't just make the query and rely on Raven's result caching, because querying by the set of polygons would quickly make the query length exceed any sensible query length limit.
In RavenDB you can define indexes to deal with spatial data.
Let's assume you have a document with spatial polygon defined as WKT:
(WKT --> Well Known Text markup https://en.wikipedia.org/wiki/Well-known_text)
public class EventWithWKT
{
public string Id { get; set; }
public string Name { get; set; }
public string WKT { get; set; }
}
By polygon WKT I mean something like
POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))
Then you can define index that would handle the WKT like this:
public class EventsWithWKT_ByNameAndWKT : AbstractIndexCreationTask<EventWithWKT>
{
public EventsWithWKT_ByNameAndWKT()
{
Map = events => from e in events
select new
{
Name = e.Name,
WKT = e.WKT
};
Spatial(x => x.WKT, options => options.Geography.Default());
}
}
By calling the "Spatial" method in index definition, RavenDB creates special column that can be then queried with spatial queries.
Now filtering on whether a certain point is within the polygons of the documents or not, can be done with the following query:
var results = session
.Query<EventWithWKT, EventsWithWKT_ByNameAndWKT>()
.Customize(x => x.RelatesToShape("WKT", "POINT (30 10)", SpatialRelation.Within))
.ToList();
That is not the only way to define and use spatial data. You can read more about RavenDB spatial indexes in those documentation articles:
defining spatial indexes
querying spatial indexes
Related
looking into Java query approach or Kotlin based DNQ can't see how to make queries similar to 'group by' ... What is a proper approach for such queries? E.g. when I have invoices entities and I would like to group these by company name and sum of sales.
Something like this can help you achieve an emulated "GROUP BY":
entityStore.executeInTransaction(
new StoreTransactionalExecutable() {
#Override
public void execute(#NotNull final StoreTransaction txn) {
txn.getEntityTypes().forEach(type -> {
EntityIterable result = txn.getAll(type);
long count = result.count();
});
}
});
Essentially what this does is it queries all entity types and then do a count, similar to what GROUP BY does. From here you can just create a Map or something to put both the entity type and count in the map.
I'm trying to compare vespa query capabilities with both ES and MongoDb but I'm having a hard time to figure out what kind of support the YQL has for advanced queries over JSON. (by the way, this would be an awesome post for vespa blog)
For example given an object Person (see example below) with nested documents and/or an array of objects, how could do:
Select all Persons whose hobbies contains 'sport'.
Select all Persons whose Phones area code equals 'NY'.
Select all Persons whose Mother's Birthdate is greater than 1960.
Person = {
Name: 'Joe',
Hobbies: ['sports','books','bonzais'],
Phones: [{Number: '12-3456-7890', areaCode: 'NY'},{Number: '22-3456-7890', areaCode: 'CA'}],
Mother: {
Name: 'Mom',
Birthdate: '1961-24-02'
}
}
Also, are there any best practices regarding how should I model an object for Vespa/YQL?
Thanks in advance.
A clarification first: YQL is just a query syntax. The JSON query language (https://docs.vespa.ai/documentation/reference/select-reference.html) is another. Yet another way (the most common) is to construct queries directly from the data received from clients in a Searcher (Java) component.
Below I show to construct your three examples in each of these variants. Vespa does not have a date type so here I've assumed you have a 'Birthyear' integer field instead.
Select all Persons whose hobbies contains 'sport'.
// YQL (as GET URL parameters)
?query=select * from Persons where hobbies contains 'sports';&type=yql
// JSON (POST body)
{ "contains" : [ "hobbies", "sports" ]}
// Java code
query.getModel().getQueryTree().setRoot(new WordItem("sports", "hobbies"));
Select all Persons whose Phones area code equals 'NY'.
// YQL (as GET URL parameters)
?query=select * from Persons where phones.areaCode contains 'NY';&type=yql
// JSON (POST body)
{"select" : { "where" : { "contains" : [ "phones.areaCode", "NY" ] } } }
// Java code
query.getModel().getQueryTree().setRoot(new WordItem("NY", "phones.areaCode"));
Select all Persons whose Mother's Birthdate is greater than 1960.
// YQL (as GET URL parameters)
?query=select * from Persons where mother.Birthyear > 1960;&type=yql
// JSON (POST body)
{"select" : { "where" : { "range" : [ "mother.Birthyear", { ">": 1960}] } } }
// Java code
query.getModel().getQueryTree().setRoot(new IntItem(">1960", "mother.Birthyear"));
Note:
Structured fields are referenced by dotting into the structures.
Containers becomes (has these tokens) or (equals) depending on the field matching setting.
I have a rather huge (30 mln rows, up to 5–100Kb each) Table on Azure.
Each RowKey is a Guid and PartitionKey is a first Guid part, for example:
PartitionKey = "1bbe3d4b"
RowKey = "1bbe3d4b-2230-4b4f-8f5f-fe5fe1d4d006"
Table has 600 reads and 600 writes (updates) per second with an average latency of 60ms. All queries use both PartitionKey and RowKey.
BUT, some reads take up to 3000ms (!). In average, >1% of all reads take more than 500ms and there's no correlation with entity size (100Kb row may be returned in 25ms and 10Kb one – in 1500ms).
My application is an ASP.Net MVC 4 web-site running on 4-5 Large instances.
I have read all MSDN articles regarding Azure Table Storage performance goals and already did the following:
UseNagle is turned Off
Expect100Continue is also disabled
MaxConnections for table client is set to 250 (setting 1000–5000 doesn't make any sense)
Also I checked that:
Storage account monitoring counters have no throttling errors
There are some kind of "waves" in performance, though they does not depend on load
What could be the reason of such performance issues and how to improve it?
I use the MergeOption.NoTracking setting on the DataServiceContext.MergeOption property for extra performance if I have no intention of updating the entity anytime soon. Here is an example:
var account = CloudStorageAccount.Parse(RoleEnvironment.GetConfigurationSettingValue("DataConnectionString"));
var tableStorageServiceContext = new AzureTableStorageServiceContext(account.TableEndpoint.ToString(), account.Credentials);
tableStorageServiceContext.RetryPolicy = RetryPolicies.Retry(3, TimeSpan.FromSeconds(1));
tableStorageServiceContext.MergeOption = MergeOption.NoTracking;
tableStorageServiceContext.AddObject(AzureTableStorageServiceContext.CloudLogEntityName, newItem);
tableStorageServiceContext.SaveChangesWithRetries();
Another problem might be that you are retrieving the entire enity with all its properties even though you intend only use one or two properties - this is of course wasteful but can't be easily avoided. However, If you use Slazure then you can use query projections to only retrieve the entity properties that you are interested in from the table storage and nothing more, which would give you better query performance. Here is an example:
using SysSurge.Slazure;
using SysSurge.Slazure.Linq;
using SysSurge.Slazure.Linq.QueryParser;
namespace TableOperations
{
public class MemberInfo
{
public string GetRichMembers()
{
// Get a reference to the table storage
dynamic storage = new QueryableStorage<DynEntity>("UseDevelopmentStorage=true");
// Build table query and make sure it only return members that earn more than $60k/yr
// by using a "Where" query filter, and make sure that only the "Name" and
// "Salary" entity properties are retrieved from the table storage to make the
// query quicker.
QueryableTable<DynEntity> membersTable = storage.WebsiteMembers;
var memberQuery = membersTable.Where("Salary > 60000").Select("new(Name, Salary)");
var result = "";
// Cast the query result to a dynamic so that we can get access its dynamic properties
foreach (dynamic member in memberQuery)
{
// Show some information about the member
result += "LINQ query result: Name=" + member.Name + ", Salary=" + member.Salary + "<br>";
}
return result;
}
}
}
Full disclosure: I coded Slazure.
You could also consider pagination if you are retrieving large data sets, example:
// Retrieve 50 members but also skip the first 50 members
var memberQuery = membersTable.Where("Salary > 60000").Take(50).Skip(50);
Typically, if a specific query requires scanning a large number of rows, that will take longer time. Is the behavior you are seeing specific a query / data? Or, are you seeing the performance varies for the same data and query?
If you have a Car model with 20 or so properties (and several table joins) for a carDetail page then your LINQ to SQL query will be quite large.
If you have a carListing page which uses under 5 properties (all from 1 table) then you use a CarSummary model. Should the CarSummary model be populated using the same query as the Car model?
Or should you use a separate LINQ to SQL query which would be more precise?
I am just thinking of performance but LINQ uses lazy loading anyway so I am wondering if this is an issue or not.
Create View Models to represent the different projections you require, and then use a select projection as follows.
from c in Cars
select new CarSummary
{
Registration = c.Registration,
...
}
This will create a query that only select the properties needed.
relationships will be resolved if they are represented in the data context diagram (dbml)
select new CarSummary
{
OwnerName = c.Owner.FirstName
}
Also you can nest objects inside the projection
select new CarSummary
{
...
Owner = new OwnerSummary
{
OwnerName = c.Owner.FirstName,
OwnerAge = c.Owner.Age
}
...
}
If you are using the same projection in many places, it man be helpful to write a method as follows, so that the projection happens in one place.
public IQueryable<CarSummary> CreateCarSummary(IQueryable<Car> cars)
{
return from c in cars
select new CarSummary
{
...
}
}
This can be used like this where required
public IQueryable<CarSummary> GetNewCars()
{
var cars = from c in Cars
select c;
return CreateCarSummary(cars);
}
I think that in your case lazy loading doesn't bring much benefit as you are going to use 1 property from each table, so sooner or later to render the page you will have to perform all the joins. In my opinion you could use the same query and convert from a Car model to a CarSummary model.
If performance is actually a concern or an issue currently, you should do a separate projection linq query so that the sql query only selects the 5 fields you need to populate your view model instead of returning all 20 fields.
I'm attempting to implement complete search functionality in my ASP.NET MVC (C#, Linq-to-Sql) website.
The site consists of about 3-4 tables that have about 1-2 columns that I want to search.
This is what I have so far:
public List<SearchResult> Search(string Keywords)
{
string[] split = Keywords.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
List<SearchResult> ret = new List<SearchResult>();
foreach (string s in split)
{
IEnumerable<BlogPost> results = db.BlogPosts.Where(x => x.Text.Contains(s) || x.Title.Contains(s));
foreach (BlogPost p in results)
{
if (ret.Exists(x => x.PostID == p.PostID))
continue;
ret.Add(new SearchResult
{
PostTitle= p.Title,
BlogPostID = p.BlogPostID,
Text=p.Text
});
}
}
return ret;
}
As you can see, I have a foreach for the keywords and an inner foreach that runs over a table (I would repeat it for each table).
This seems inefficent and I wanted to know if theres a better way to create a search method for a database.
Also, what can I do to the columns in the database so that they can be searched faster? I read something about indexing them, is that just the "Full-text indexing" True/False field I see in SQL Management Studio?
Also, what can I do to the columns in
the database so that they can be
searched faster? I read something
about indexing them, is that just the
"Full-text indexing" True/False field
I see in SQL Management Studio?
Yes, enabling full-text indexing will normally go a long way towards improving performance for this scenario. But unfortunately it doesn't work automatically with the LIKE operator (and that's what your LINQ query is generating). So you'll have to use one of the built-in full-text searching functions like FREETEXT, FREETEXTTABLE, CONTAINS, or CONTAINSTABLE.
Just to explain, your original code will be substantially slower than full-text searching as it will typically result in a table scan. For example, if you're searching a varchar field named title with LIKE '%ABC%' then there's no choice but for SQL to scan every single record to see if it contains those characters.
However, the built-in full-text searching will actually index the text of every column you specify to include in the full-text index. And it's that index that drastically speeds up your queries.
Not only that, but full-text searching provides some cool features that the LIKE operator can't give you. It's not as sophisticated as Google, but it has the ability to search for alternate versions of a root word. But one of my favorite features is the ranking functionality where it can return an extra value to indicate relevance which you can then use to sort your results. To use that look into the FREETEXTTABLE or CONTAINSTABLE functions.
Some more resources:
Full-Text Search (SQL Server)
Pro Full-Text Search in SQL Server 2008
The following should do the trick. I can't say off the top of my head whether the let kwa = ... part will actually work or not, but something similar will be required to make the array of keywords available within the context of SQL Server. I haven't used LINQ to SQL for a while (I've been using LINQ to Entities 4.0 and nHibernate for some time now, which have a different set of capabilities). You might need to tweak that part to get it working, but the general principal is sound:
public List<SearchResult> Search(string keywords)
{
var searcResults = from bp in db.BlogPosts
let kwa = keywords.Split(new char[]{' '}, StringSplitOptions.RemoveEmptyEntries);
where kwa.Any(kw => bp.Text.Contains(kw) || bp.Title.Contains(kw))
select new SearchResult
{
PostTitle = bp.Title,
BlogPostID = bp.BlogPostID,
Test = bp.Text
};
return searchResults.ToList();
}