Spring-data-elasticsearch: Result window is too large (index.max_result_window) - spring-data-elasticsearch

We retrieve information from Elasticsearch 2.7.0 and we allow the user to go through the results. When the user requests a high page number we get the following error message:
Result window is too large, from + size must be less than or equal to:
[10000] but was [10020]. See the scroll api for a more efficient way
to request large data sets. This limit can be set by changing the
[index.max_result_window] index level parameter
The thing is we use pagination in our requests so I don't see why we get this error:
#Autowired
private ElasticsearchOperations elasticsearchTemplate;
...
elasticsearchTemplate.queryForPage(buildQuery(query, pageable), Document.class);
...
private NativeSearchQuery buildQuery() {
BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery();
boolQueryBuilder.should(QueryBuilders.boolQuery().must(QueryBuilders.termQuery(term, query.toUpperCase())));
NativeSearchQueryBuilder nativeSearchQueryBuilder = new NativeSearchQueryBuilder().withIndices(DOC_INDICE_NAME)
.withTypes(indexType)
.withQuery(boolQueryBuilder)
.withPageable(pageable);
return nativeSearchQueryBuilder.build();
}
I don't understand the error because we retreive pageable.size (20 elements) everytime... Do you have any idea why we get this?

Unfortunately, Spring data elasticsearch even when paging results searchs for a much larger result window in the elasticsearch. So you have two options, the first is to change the value of this parameter.
The second is to use the scan / scroll API, however, as far as I understand, in this case the pagination is done manually, as it is used for infinite sequential reading (like scrolling your mouse).
A sample:
List<Pessoa> allItens = new ArrayList<>();
String scrollId = elasticsearchTemplate.scan(build, 1000, false, Pessoa.class);
Page<Pessoa> page = elasticsearchTemplate.scroll(scrollId, 5000L, Pessoa.class);
while (true) {
if (!page.hasContent()) {
break;
}
allItens.addAll(page.getContent());
page = elasticsearchTemplate.scroll(scrollId, 5000L, Pessoa.class);
}
This code, shows you how to read ALL the data from your index, you have to get the requested page inside scrolling.

Related

YouTube API - retrieve more than 5k items

I just want to fetch all my liked videos ~25k items. as far as my research goes this is not possible via the YouTube v3 API.
I have already found multiple issues (issue, issue) on the same problem, though some claim to have fixed it, but it only works for them as they don't have < 5000 items in their liked video list.
playlistItems list API endpoint with playlist id set to "liked videos" (LL) has a limit of 5000.
videos list API endpoint has a limit of 1000.
Unfortunately those endpoints don't provide me with parameters that I could use to paginate the requests myself (e.g. give me all the liked videos between date x and y), so I'm forced to take the provided order (which I can't get past 5k entries).
Is there any possibility I can fetch all my likes via the API?
more thoughts to the reply from #Yarin_007
if there are deleted videos in the timeline they appear as "Liked https://...url" , the script doesnt like that format and fails as the underlying elements dont have the same structure as existing videos
can be easily fixed with a try catch
function collector(all_cards) {
var liked_videos = {};
all_cards.forEach(card => {
try {
// ignore Dislikes
if (card.innerText.split("\n")[1].startsWith("Liked")) {
....
}
}
catch {
console.log("error, prolly deleted video")
}
})
return liked_videos;
}
to scroll down to the bottom of the page ive used this simple script, no need to spin up something big
var millisecondsToWait = 1000;
setInterval(function() {
window.scrollTo(0, document.body.scrollHeight);
console.log("scrolling")
}, millisecondsToWait);
when more ppl want to retrive this kind of data, one could think about building a proper script that is more convenient to use. If you check the network requests you can find the desired data in the response of requests called batchexecute. One could copy the authentification of one of them provide them to a script that queries those endpoints and prepares the data like the other script i currently manually inject.
Hmm. perhaps Google Takeout?
I have verified the youtube data contains a csv called "liked videos.csv". The header is Video Id,Time Added, and the rows are
dQw4w9WgXcQ,2022-12-18 23:42:19 UTC
prvXCuEA1lw,2022-12-24 13:22:13 UTC
for example.
So you would need to retrieve video metadata per video ID. Not too bad though.
Note: the export could take a while, especially with 25k videos. (select only YouTube data)
I also had an idea that involves scraping the actual liked videos page (which would save you 25k HTTP Requests). But I'm unsure if it breaks with more than 5000 songs. (also, emulating the POST requests on that page may prove quite difficult, albeit not impossible. (they fetch /browse?key=..., and have some kind of obfuscated / encrypted base64 strings in the request-body, among other parameters)
EDIT:
Look. There's probably a normal way to get a complete dump of all you google data. (i mean, other than takeout. Email them? idk.)
anyway, the following is the other idea...
Follow this deep link to your liked videos history.
Scroll to the bottom... maybe with selenium, maybe with autoit, maybe put something on the "end" key of your keyboard until you reach your first liked video.
Hit f12 and run this in the developer console
// https://www.youtube.com/watch?v=eZPXmCIQW5M
// https://myactivity.google.com/page?utm_source=my-activity&hl=en&page=youtube_likes
// go over all "cards" in the activity webpage. (after scrolling down to the absolute bottom of it)
// create a dictionary - the key is the Video ID, the value is a list of the video's properties
function collector(all_cards) {
var liked_videos = {};
all_cards.forEach(card => {
// ignore Dislikes
if (card.innerText.split("\n")[1].startsWith("Liked")) {
// horrible parsing. your mileage may vary. I Tried to avoid using any gibberish class names.
let a_links = card.querySelectorAll("a")
let details = a_links[0];
let url = details.href.split("?v=")[1]
let video_length = a_links[3].innerText;
let time = a_links[2].parentElement.innerText.split(" • ")[0];
let title = details.innerText;
let date = card.closest("[data-date]").getAttribute("data-date")
liked_videos[url] = [title,video_length, date, time];
// console.log(title, video_length, date, time, url);
}
})
return liked_videos;
}
// https://stackoverflow.com/questions/57709550/how-to-download-text-from-javascript-variable-on-all-browsers
function download(filename, text, type = "text/plain") {
// Create an invisible A element
const a = document.createElement("a");
a.style.display = "none";
document.body.appendChild(a);
// Set the HREF to a Blob representation of the data to be downloaded
a.href = window.URL.createObjectURL(
new Blob([text], { type })
);
// Use download attribute to set set desired file name
a.setAttribute("download", filename);
// Trigger the download by simulating click
a.click();
// Cleanup
window.URL.revokeObjectURL(a.href);
document.body.removeChild(a);
}
function main() {
// gather relevant elements
var all_cards = document.querySelectorAll("div[aria-label='Card showing an activity from YouTube']")
var liked_videos = collector(all_cards)
// download json
download("liked_videos.json", JSON.stringify(liked_videos))
}
main()
Basically it gathers all the liked videos' details and creates a key: video_ID - Value: [title,video_length, date, time] object for each liked video.
It then automatically downloads the json as a file.

Why is the server response after a Drag&Drop so large and slow

I'm having some performance issues dragging cards between grids. From a backend perspective, storing the data from the grids after a change takes about 200ms.
But then, when the backend work seems to be done, it takes another 2,5 seconds for the frontend to get the response from the request. The request that's taking so long contact 2 rpc events: grid-drop and grid-dragend.
The response is also unusually large I think. Just to give you an idea, see screenshot ... notice the tiny scrollbar at the right. 🙂
TTFB is 2,42s, download size about half a MB.
Any ideas what's going on here and how I can eliminate this?
I'm using Vaadin 21.0.4, spring boot 2.5.4.
Steps I've taken to optimise performance:
Optimize db query + indexing
Use #cacheable where possible
Implemented the cards using LitElement
This is the drop listener:
ComponentEventListener<GridDropEvent<Task>> dropListener = event -> {
if (dragSource != null) {
// The item ontop or below where the source item is dropped. Used to calculate the index of the newly dropped item(s)
Optional<Task> targetItem = event.getDropTargetItem();
// if the item is dropped on an existing row and the dragged item contains the same items that's being dropped.
if (targetItem.isPresent() && draggedItems.contains(targetItem.get())) {
return;
}
// Add dragged items to the grid of the target room
Grid<Task> targetGrid = event.getSource();
Optional<Room> room = dayPlanningView.getRoomForGrid(targetGrid);
// The items of the target Grid. Using listdataview so this would not retrigger the query
List<Task> targetItems = targetGrid.getListDataView().getItems().toList();
// Calculate the position of the dropped item
int index = targetItem.map(task -> targetItems.indexOf(task)
+ (event.getDropLocation() == GridDropLocation.BELOW ? 1 : 0))
.orElse(0);
room.ifPresent(r -> service.plan(draggedItems, r, index, dayPlanningView.getSelectedDate()));
// send event to update other users
Optional<ScheduleUpdatedEvent> scheduleUpdatedEvent = room.map(r -> new ScheduleUpdatedEvent(PlanningMasterDetailView.this, r.getId()));
scheduleUpdatedEvent.ifPresent(Broadcaster::broadcast);
// remove items from the source grid. using list provider so items can be removed without DB round-trip.
productionOrderGrid.getListDataView().removeItems(draggedItems);
}
};
I'm a bit stuck now, as I'm kinda out of ideas 😦
Thanks
You should use the TemplateRenderer/LitRenderer instead of the ComponentRenderer because the generated server-side components are affecting the performance:
Read more here: https://vaadin.com/blog/top-5-most-common-vaadin-performance-pitfalls-and-how-to-avoid-them

How do I reset the ScanStreamTransformer Accumulator?

I am writing a Flutter application using the BLOC pattern. I'm currently trying to search a database and return a list of results on the screen.
The first search will complete fine. The issue is when I hit the back button and try to do a second search. The first search results aren't cleared out. Instead they remain on the screen and the new search results are placed below them.
In a nutshell here's what I'm doing.(I'm using the RxDart library)
1.) Defining input and output streams:
PublishSubject<SearchRowModel> _searchFetcher =
PublishSubject<BibSearchRowModel>();
BehaviorSubject<Map<int, SearchRowModel>> _searchOutput =
BehaviorSubject<Map<int, SearchRowModel>>();
2.) Piping the streams together in the class constructor.
_searchFetcher.stream.transform(_resultsTransformer()).pipe(_searchOutput);
3.) Adding the results to the fetcher stream
results.searchRows.forEach((SearchRowModel row) {
_searchFetcher.sink.add(row);
});
4.) Using a ScanStreamTransformer to create a map of the results.
_resultsTransformer() {
return ScanStreamTransformer(
(Map<int, SearchRowModel> cache, SearchRowModel row, index) {
cache[index] = row;
return cache;
},
<int, SearchRowModel>{},
);
}
From my debugging, I've found that the code in step 4 appears to be the issue. That cache (or the Accumulator) isn't getting reset between searches. It just keeps adding additional results to the map.
I've yet to find a way of resetting the map in the accumulator / cache that works. I even tried completely destroying the streams and recreating them, but the original search data in the SearchstreamTransformer accumulator still persisted.

AWS DynamoDB session table keeps growing, can't delete expired sessions

the ASP.NET_SessionState table grows all the time, already at 18GB, not a sign of ever deleting expired sessions.
we have tried to execute DynamoDBSessionStateStore.DeleteExpiredSessions, but it seems to have no effect.
our system is running fine, sessions are created and end-users are not aware of the issue. however, it doesn't make sense the table keeps growing all the time...
we have triple checked permissions/security, everything seems to be in order. we use SDK version 3.1.0. what else remains to be checked?
With your table being over 18 GB, which is quite large (in this context), it does not surprise me that this isn't working after looking at the code for the DeleteExpiredSessions method on GitHub.
Here is the code:
public static void DeleteExpiredSessions(IAmazonDynamoDB dbClient, string tableName)
{
LogInfo("DeleteExpiredSessions");
Table table = Table.LoadTable(dbClient, tableName, DynamoDBEntryConversion.V1);
ScanFilter filter = new ScanFilter();
filter.AddCondition(ATTRIBUTE_EXPIRES, ScanOperator.LessThan, DateTime.Now);
ScanOperationConfig config = new ScanOperationConfig();
config.AttributesToGet = new List<string> { ATTRIBUTE_SESSION_ID };
config.Select = SelectValues.SpecificAttributes;
config.Filter = filter;
DocumentBatchWrite batchWrite = table.CreateBatchWrite();
Search search = table.Scan(config);
do
{
List<Document> page = search.GetNextSet();
foreach (var document in page)
{
batchWrite.AddItemToDelete(document);
}
} while (!search.IsDone);
batchWrite.Execute();
}
The above algorithm is executed in two parts. First it performs a Search (table scan) using a filter is used to identify all expired records. These are then added to a DocumentBatchWrite request that is executed as the second step.
Since your table is so large the table scan step will take a very, very long time to complete before a single record is deleted. Basically, the above algorithm is useful for lazy garbage collection on small tables, but does not scale well for large tables.
The best I can tell is that the execution of this is never actually getting past the table scan and you may be consuming all of the read throughput of your table.
A possible solution for you would be to run a slightly modified version of the above method on your own. You would want to call the the DocumentBatchWrite inside of the do-while loop so that records will start to be deleted before the table scan is concluded.
That would look like:
public static void DeleteExpiredSessions(IAmazonDynamoDB dbClient, string tableName)
{
LogInfo("DeleteExpiredSessions");
Table table = Table.LoadTable(dbClient, tableName, DynamoDBEntryConversion.V1);
ScanFilter filter = new ScanFilter();
filter.AddCondition(ATTRIBUTE_EXPIRES, ScanOperator.LessThan, DateTime.Now);
ScanOperationConfig config = new ScanOperationConfig();
config.AttributesToGet = new List<string> { ATTRIBUTE_SESSION_ID };
config.Select = SelectValues.SpecificAttributes;
config.Filter = filter;
Search search = table.Scan(config);
do
{
// Perform a batch delete for each page returned
DocumentBatchWrite batchWrite = table.CreateBatchWrite();
List<Document> page = search.GetNextSet();
foreach (var document in page)
{
batchWrite.AddItemToDelete(document);
}
batchWrite.Execute();
} while (!search.IsDone);
}
Note: I have not tested the above code, but it is just a simple modification to the open source code so it should work correctly, but would need to be tested to ensure the pagination works correctly on a table whose records are being deleted as it is being scanned.

How to implement pagination when using amazon Dynamo DB in rails

I want to use amazon Dynamo DB with rails.But I have not found a way to implement pagination.
I will use AWS::Record::HashModel as ORM.
This ORM supports limits like this:
People.limit(10).each {|person| ... }
But I could not figured out how to implement following MySql query in Dynamo DB.
SELECT *
FROM `People`
LIMIT 1 , 30
You issue queries using LIMIT. If the subset returned does not contain the full table, a LastEvaluatedKey value is returned. You use this value as the ExclusiveStartKey in the next query. And so on...
From the DynamoDB Developer Guide.
You can provide 'page-size' in you query to set the result set size.
The response of DynamoDB contains 'LastEvaluatedKey' which will indicate the last key as per the page size. If response does't contain 'LastEvaluatedKey' it means there are no results left to fetch.
Use the 'LastEvaluatedKey' as 'ExclusiveStartKey' while fetching next time.
I hope this helps.
DynamoDB Pagination
Here's a simple copy-paste-run proof of concept (Node.js) for stateless forward/reverse navigation with dynamodb. In summary; each response includes the navigation history, allowing user to explicitly and consistently request either the next or previous page (while next/prev params exist):
GET /accounts -> first page
GET /accounts?next=A3r0ijKJ8 -> next page
GET /accounts?prev=R4tY69kUI -> previous page
Considerations:
If your ids are large and/or users might do a lot of navigation, then the potential size of the next/prev params might become too large.
Yes you do have to store the entire reverse path - if you only store the previous page marker (per some other answers) you will only be able to go back one page.
It won't handle changing pageSize midway, consider baking pageSize into the next/prev value.
base64 encode the next/prev values, and you could also encrypt.
Scans are inefficient, while this suited my current requirement it won't suit all!
// demo.js
const mockTable = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
const getPagedItems = (pageSize = 5, cursor = {}) => {
// Parse cursor
const keys = cursor.next || cursor.prev || [] // fwd first
let key = keys[keys.length-1] || null // eg ddb's PK
// Mock query (mimic dynamodb response)
const Items = mockTable.slice(parseInt(key) || 0, pageSize+key)
const LastEvaluatedKey = Items[Items.length-1] < mockTable.length
? Items[Items.length-1] : null
// Build response
const res = {items:Items}
if (keys.length > 0) // add reverse nav keys (if any)
res.prev = keys.slice(0, keys.length-1)
if (LastEvaluatedKey) // add forward nav keys (if any)
res.next = [...keys, LastEvaluatedKey]
return res
}
// Run test ------------------------------------
const runTest = () => {
const PAGE_SIZE = 6
let x = {}, i = 0
// Page to end
while (i == 0 || x.next) {
x = getPagedItems(PAGE_SIZE, {next:x.next})
console.log(`Page ${++i}: `, x.items)
}
// Page back to start
while (x.prev) {
x = getPagedItems(PAGE_SIZE, {prev:x.prev})
console.log(`Page ${--i}: `, x.items)
}
}
runTest()
I faced a similar problem.
The generic pagination approach is, use "start index" or "start page" and the "page length". 
The "ExclusiveStartKey" and "LastEvaluatedKey" based approach is very DynamoDB specific.
I feel this DynamoDB specific implementation of pagination should be hidden from the API client/UI.
Also in case, the application is serverless, using service like Lambda, it will be not be possible to maintain the state on the server. The other side is the client implementation will become very complex.
I came with a different approach, which I think is generic ( and not specific to DynamoDB)
When the API client specifies the start index, fetch all the keys from
the table and store it into an array.
Find out the key for the start index from the array, which is
specified by the client.
Make use of the ExclusiveStartKey and fetch the number of records, as
specified in the page length.
If the start index parameter is not present, the above steps are not
needed, we don't need to specify the ExclusiveStartKey in the scan
operation.
This solution has some drawbacks -
We will need to fetch all the keys when the user needs pagination with
start index.
We will need additional memory to store the Ids and the indexes.
Additional database scan operations ( one or multiple to fetch the
keys )
But I feel this will be very easy approach for the clients, which are using our APIs. The backward scan will work seamlessly. If the user wants to see "nth" page, this will be possible.
In fact I faced the same problem and I noticed that LastEvaluatedKey and ExclusiveStartKey are not working well especially when using Scan So I solved Like this.
GET/?page_no=1&page_size=10 =====> first page
response will contain count of records and first 10 records
retry and increase number of page until all record come.
Code is below
PS: I am using python
first_index = ((page_no-1)*page_size)
second_index = (page_no*page_size)
if (second_index > len(response['Items'])):
second_index = len(response['Items'])
return {
'statusCode': 200,
'count': response['Count'],
'response': response['Items'][first_index:second_index]
}

Resources