youtube api - loop throug huge data - youtube-api

I need to retrieve the 'view count' for each video channel , and I’m using this library .
this is my code
okay the code works fine and print me the view count foreach video , except that i got these warnings with some other videos without printing the view count
A PHP Error was encountered Severity: Warning Message:
simplexml_load_string() [function.simplexml-load-string]: Entity:
line 547: parser error : attributes construct error
Message:
simplexml_load_string() [function.simplexml-load-string]:
outube_gdata'/>
Message:
simplexml_load_string() [function.simplexml-load-string]: ^
Message:
simplexml_load_string() [function.simplexml-load-string]: Entity:
line 547: parser error : Couldn't find end of Start Tag link line 547
Message: simplexml_load_string() [function.simplexml-load-string]:
outube_gdata'/>
how can i deal with this large number of videos and channels without causing this warning msgs and lost in time , cause if i tried the same code on one channel with fewer videos i got no errors
$channels=array('google','apple','mac','xyz','abc','test');
for ($j=0; $j<count($channels) $j++)
{
$JSON = file_get_contents("https://gdata.youtube.com/feeds/api/users/".$channels[$j]."/uploads?v=2&alt=jsonc&max-results=0");
$JSON_Data = json_decode($JSON);
$total_videos = $JSON_Data->{'data'}->{'totalItems'};
for($i=1; $i<=$total_videos; )
{
$this->get_userfeed($channels[$j],$maxresult=20,$start=$i);
$i+=20;
}
}
public function get_userfeed($ch_id,$maxresult=10,$start=0,$do=null)
{
$output = $this->youtube->getUserUploads($ch_id, array('max-results'=>$maxresult,'start-index'=>$start));
$xml = simplexml_load_string($output);
// single entry for testing
foreach($xml->entry as $entry)
{
foreach($entry->id as $key=>$val)
{
$id = explode('videos/', (string)$val);
$JSON = file_get_contents("https://gdata.youtube.com/feeds/api/videos/".$id[1]."?v=2&alt=json");
$JSON_Data = json_decode($JSON);
$v_count = $JSON_Data->{'entry'}->{'yt$statistics'}->{'viewCount'};
if($v_count == NULL) $v_count =0;
echo $v_count;
// store the v_count into database
}
}
}

You're doing a few things wrong.
First off, if you want to minimize the number of calls you're making to the API, you should be setting max-results=50, which is the largest value that the API supports.
Second, I don't understand why you're making individual calls to http://.../videos/VIDEO_ID to retrieve the statistics for each video, since that information is already returned as part of the video entries you're getting from the http://.../users/USER_ID/uploads feed. You can just store the values returned by that feed and avoid having to make all those additional calls to retrieve each video.
Finally, the underlying issue is almost certainly that you're running into quota errors, and you can read more about them at http://apiblog.youtube.com/2010/02/best-practices-for-avoiding-quota.html
Taking any of the steps I mention should cut down on the total requests that you're making and potentially get around any quota problems, but you should familiarize yourself with the quota system anyway.

Related

YouTube API - retrieve more than 5k items

I just want to fetch all my liked videos ~25k items. as far as my research goes this is not possible via the YouTube v3 API.
I have already found multiple issues (issue, issue) on the same problem, though some claim to have fixed it, but it only works for them as they don't have < 5000 items in their liked video list.
playlistItems list API endpoint with playlist id set to "liked videos" (LL) has a limit of 5000.
videos list API endpoint has a limit of 1000.
Unfortunately those endpoints don't provide me with parameters that I could use to paginate the requests myself (e.g. give me all the liked videos between date x and y), so I'm forced to take the provided order (which I can't get past 5k entries).
Is there any possibility I can fetch all my likes via the API?
more thoughts to the reply from #Yarin_007
if there are deleted videos in the timeline they appear as "Liked https://...url" , the script doesnt like that format and fails as the underlying elements dont have the same structure as existing videos
can be easily fixed with a try catch
function collector(all_cards) {
var liked_videos = {};
all_cards.forEach(card => {
try {
// ignore Dislikes
if (card.innerText.split("\n")[1].startsWith("Liked")) {
....
}
}
catch {
console.log("error, prolly deleted video")
}
})
return liked_videos;
}
to scroll down to the bottom of the page ive used this simple script, no need to spin up something big
var millisecondsToWait = 1000;
setInterval(function() {
window.scrollTo(0, document.body.scrollHeight);
console.log("scrolling")
}, millisecondsToWait);
when more ppl want to retrive this kind of data, one could think about building a proper script that is more convenient to use. If you check the network requests you can find the desired data in the response of requests called batchexecute. One could copy the authentification of one of them provide them to a script that queries those endpoints and prepares the data like the other script i currently manually inject.
Hmm. perhaps Google Takeout?
I have verified the youtube data contains a csv called "liked videos.csv". The header is Video Id,Time Added, and the rows are
dQw4w9WgXcQ,2022-12-18 23:42:19 UTC
prvXCuEA1lw,2022-12-24 13:22:13 UTC
for example.
So you would need to retrieve video metadata per video ID. Not too bad though.
Note: the export could take a while, especially with 25k videos. (select only YouTube data)
I also had an idea that involves scraping the actual liked videos page (which would save you 25k HTTP Requests). But I'm unsure if it breaks with more than 5000 songs. (also, emulating the POST requests on that page may prove quite difficult, albeit not impossible. (they fetch /browse?key=..., and have some kind of obfuscated / encrypted base64 strings in the request-body, among other parameters)
EDIT:
Look. There's probably a normal way to get a complete dump of all you google data. (i mean, other than takeout. Email them? idk.)
anyway, the following is the other idea...
Follow this deep link to your liked videos history.
Scroll to the bottom... maybe with selenium, maybe with autoit, maybe put something on the "end" key of your keyboard until you reach your first liked video.
Hit f12 and run this in the developer console
// https://www.youtube.com/watch?v=eZPXmCIQW5M
// https://myactivity.google.com/page?utm_source=my-activity&hl=en&page=youtube_likes
// go over all "cards" in the activity webpage. (after scrolling down to the absolute bottom of it)
// create a dictionary - the key is the Video ID, the value is a list of the video's properties
function collector(all_cards) {
var liked_videos = {};
all_cards.forEach(card => {
// ignore Dislikes
if (card.innerText.split("\n")[1].startsWith("Liked")) {
// horrible parsing. your mileage may vary. I Tried to avoid using any gibberish class names.
let a_links = card.querySelectorAll("a")
let details = a_links[0];
let url = details.href.split("?v=")[1]
let video_length = a_links[3].innerText;
let time = a_links[2].parentElement.innerText.split(" • ")[0];
let title = details.innerText;
let date = card.closest("[data-date]").getAttribute("data-date")
liked_videos[url] = [title,video_length, date, time];
// console.log(title, video_length, date, time, url);
}
})
return liked_videos;
}
// https://stackoverflow.com/questions/57709550/how-to-download-text-from-javascript-variable-on-all-browsers
function download(filename, text, type = "text/plain") {
// Create an invisible A element
const a = document.createElement("a");
a.style.display = "none";
document.body.appendChild(a);
// Set the HREF to a Blob representation of the data to be downloaded
a.href = window.URL.createObjectURL(
new Blob([text], { type })
);
// Use download attribute to set set desired file name
a.setAttribute("download", filename);
// Trigger the download by simulating click
a.click();
// Cleanup
window.URL.revokeObjectURL(a.href);
document.body.removeChild(a);
}
function main() {
// gather relevant elements
var all_cards = document.querySelectorAll("div[aria-label='Card showing an activity from YouTube']")
var liked_videos = collector(all_cards)
// download json
download("liked_videos.json", JSON.stringify(liked_videos))
}
main()
Basically it gathers all the liked videos' details and creates a key: video_ID - Value: [title,video_length, date, time] object for each liked video.
It then automatically downloads the json as a file.

Spring-data-elasticsearch: Result window is too large (index.max_result_window)

We retrieve information from Elasticsearch 2.7.0 and we allow the user to go through the results. When the user requests a high page number we get the following error message:
Result window is too large, from + size must be less than or equal to:
[10000] but was [10020]. See the scroll api for a more efficient way
to request large data sets. This limit can be set by changing the
[index.max_result_window] index level parameter
The thing is we use pagination in our requests so I don't see why we get this error:
#Autowired
private ElasticsearchOperations elasticsearchTemplate;
...
elasticsearchTemplate.queryForPage(buildQuery(query, pageable), Document.class);
...
private NativeSearchQuery buildQuery() {
BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery();
boolQueryBuilder.should(QueryBuilders.boolQuery().must(QueryBuilders.termQuery(term, query.toUpperCase())));
NativeSearchQueryBuilder nativeSearchQueryBuilder = new NativeSearchQueryBuilder().withIndices(DOC_INDICE_NAME)
.withTypes(indexType)
.withQuery(boolQueryBuilder)
.withPageable(pageable);
return nativeSearchQueryBuilder.build();
}
I don't understand the error because we retreive pageable.size (20 elements) everytime... Do you have any idea why we get this?
Unfortunately, Spring data elasticsearch even when paging results searchs for a much larger result window in the elasticsearch. So you have two options, the first is to change the value of this parameter.
The second is to use the scan / scroll API, however, as far as I understand, in this case the pagination is done manually, as it is used for infinite sequential reading (like scrolling your mouse).
A sample:
List<Pessoa> allItens = new ArrayList<>();
String scrollId = elasticsearchTemplate.scan(build, 1000, false, Pessoa.class);
Page<Pessoa> page = elasticsearchTemplate.scroll(scrollId, 5000L, Pessoa.class);
while (true) {
if (!page.hasContent()) {
break;
}
allItens.addAll(page.getContent());
page = elasticsearchTemplate.scroll(scrollId, 5000L, Pessoa.class);
}
This code, shows you how to read ALL the data from your index, you have to get the requested page inside scrolling.

Akka Router with multiple actors not receiving messages properly

Here i created a router with SmallestMailboxRouter
ActorRef actorRouter = this?.getContext()?.actorOf(new Props(RuleStandardActor.class).withRouter(new SmallestMailboxRouter(38)),"standardActorRouter")
Now in for loop i created 38 actors
for(int i=0;i <38;i++) {
ruleStandardActorRouter?.tell(new StandardActorMessage(standard: standard, responseVO: responseVO, report: report), getSelf());
}
each actor will process the logic and returns the score and message . i am receiving the message by overriding onreceive method and adding them to a list
If i run the program multiple times i am getting different scores. but it should return always same score as i am giving same input.
if (message instanceof StandardActorResponse) {
StandardActorResponse standardActorResponse = message
standardActorResponseList?.add(standardActorResponse)
}
here standardActorResponse contains message and score . if i am using same logic by just using for loop instead of akka framework i am reciving conisstant result. but in akka randomly getting different results. for example i have some rules like loginexistence and navigationexistence and alertsexistence rules. i have given one html source to these rules to check whether the html source have login,alerts,navigation links in that source. some times i am getting login doesnt exists, some times navigation doesnt exist, some times alerts doesnt exists by using akka routers and actors. but if i use for loop i am always getting same result
can any one help me to find the problem. i am using akka 2.1.4
Probably the for loop is already finished before the mailbox size is recognised. Try adding a sleep in the for loop to see the results.

YouTube video API not working with IDs beginning with a dash

I am accessing data from YouTube's API, I have everything working fine but the problem I'm having is that when there's a dash (-) at the beginning of the videoID that it's not returning the json data.
$videoID = -FIHqoTcZog;
$json = json_decode(file_get_contents("http://gdata.youtube.com/feeds/api/videos?q={$videoID}&alt=json"));
I am however able to return the thumbnail as always with it using this:
$thumbnail = "http://i4.ytimg.com/vi/".$videoID."/mqdefault.jpg";
This is the code that I use to pull the information from the above json that I want.
$title = $json->{'feed'}->{'entry'}[0]->{'title'}->{'$t'};
$description = $json->{'feed'}->{'entry'}[0]->{'media$group'}->{'media$description'}->{'$t'};
$thumbnail = "http://i4.ytimg.com/vi/".$videoID."/mqdefault.jpg";
$ratings = ((round($json->{'feed'}->{'entry'}[0]->{'gd$rating'}->{'average'}, 1)/$json->{'feed'}->{'entry'}[0]->{'gd$rating'}->{'max'})*100)."%";
$views = number_format($json->{'feed'}->{'entry'}[0]->{'yt$statistics'}->{'viewCount'});
$duration = $json->{'feed'}->{'entry'}[0]->{'media$group'}->{'yt$duration'}->{'seconds'};
Are you sure you're only getting a problem with IDs that have a dash in front of it? The code you pasted shouldn't be working with any Youtube ID, because the gdata feed returns, as part of the JSON, some text with the '$' character in it. That character is a PHP identifier, so you'll get 500 errors trying to run the json_decode function on whatever the feed returns.
One way to solve the problem is to use json_decode's 2nd parameter to give you an associative array rather than an object, like this:
$json = json_decode(file_get_contents("http://gdata.youtube.com/feeds/api/videos?q={$videoID}&alt=json"),true);
Of course, that requires you to work with an array, too, but the subsequent code changes should be minimal.
If you aren't getting errors with other videos using the exact same code, perhaps you could post it here?

Faster reading of inbox in Java

I'd like to get a list of everyone who's ever been included on any message in my inbox. Right now I can use the javax mail API to connect via IMAP and download the messages:
Folder folder = imapSslStore.getFolder("[Gmail]/All Mail");
folder.open(Folder.READ_ONLY);
Message[] messages = folder.getMessages();
for(int i = 0; i < messages.length; i++) {
// This causes the message to be lazily loaded and is slow
String[] from = messages[i].getFrom();
}
The line messages[i].getFrom() is slower than I'd like because is causes the message to be lazily loaded. Is there anything I can do to speed this up? E.g. is there some kind of bulk loading I can do instead of loading the messages one-by-one? Does this load the whole message and is there something I can do to only load the to/from/cc fields or headers instead? Would POP be any faster than IMAP?
You want to add the following before the for loop
FetchProfile fetchProfile = new FetchProfile();
fetchProfile.add(FetchProfile.Item.ENVELOPE);
folder.fetch(messages, fetchProfile);
This will prefetch the "envelope" for all the messages, which includes the from/to/subject/cc fields.
You can use fetch-method in Folder. According Javadocs:
Clients use this method to indicate that the specified items are
needed en-masse for the given message range. Implementations are
expected to retrieve these items for the given message range in a
efficient manner. Note that this method is just a hint to the
implementation to prefetch the desired items.
For fetching FROM appropriate FetchProfile is ENVELOPE. Of course it is still up to implementation and mail server does that really help.

Resources