After a lot of debugging, it finally occured to me that seemingly Youtube is only issueing the first 100 comments when using the v2 YouTube-API for getting comments. I finally tried using:
curl -Lk -X GET "http://gdata.youtube.com/feeds/api/videos/MShbP3OpASA/comments?alt=json&start-index=100&max-results=50"
And all I get is a response without an entry parameter. That is to say, I do not receive an error response or something like that - I get a perfectly good response, but without the entry parameter.
Digging a little deeper, in my response the value for openSearch$totalResults is 100, so in accordance to this resource this seems to be the expected result (although it tells about some kind of error message which I don't get?).
But here comes the kicker: When I use
curl -Lk -X GET "http://gdata.youtube.com/feeds/api/videos/MShbP3OpASA/comments?alt=json&start-index=1&max-results=50&orderby=published"
openSearch$totalResults equals 3141, the actual count of the comments.
Now here is my question: Since the v2 API is officially been deprecated about a week ago, is it possible that Google just set up a limit on the comments? So only the first 100 comments are accessible? Since the v3 API does not allow for comment retrieval, that would be a pretty bummer for me.
Does anyone have any ideas?
I've figured out how to retrieve all the comments using the navigation links embedded in the json response.
Suppose you retrieve the first using a link like (python here, but you get the point):
r'https://gdata.youtube.com/feeds/api/videos/' + aVideoID + r'/comments?alt=json&start-index=1&max-results=50&prettyprint=true&orderby=published'
Embedded in the json under "feed" (and before the comments) will be a four element array called "link". The fourth element will be called "rel": "next" and under "href" there will be a link you can use to get the next 50 comments. The link will look something like:
https://gdata.youtube.com/feeds/api/videos/fH0cEP0mvlU/comments?alt=json&orderby=published&alt=json&start-token=EgkI2NqyoZDRvgIosK%2FPosPRvgIw653cmsXRvgI4AUAC&max-results=50&orderby=published
for an original URL of:
https://gdata.youtube.com/feeds/api/videos/fH0cEP0mvlU/comments?alt=json&start-index=1&max-results=50&prettyprint=true&orderby=published
If you follow the next link it will return similar json to the original link, with another 50 comments. Continue this process over and over until you get all the comments (in my code I check for both the absence of this item in the json or zero comments in the json to determine when to stop).
You need the "&orderby=published" in the original URL because otherwise the "next" links eventually grow to be too large and cause an error (something in the token the API uses to track which comments you've seen in the default orderby takes a lot of space). Something about the published orderby keeps the "start-token" small, whereas after about 500 comments with the default orderby you will start getting 414 Request URI too long errors.
Hope this helps.
Related
I'm making a new iOS (swift) app to test some concepts, and I'm using the GitHub Serach API to retrieve a list of filtered repositories.
The request is working fine so far, but I'm having trouble understanding the pagination process and how to know I reached the end of the results.
For what I saw, the Search API returns a maximum of 1k results, broke in pages of 100 maximum results. But the field in the returned json with the total count shows way more available results (I imagine that it shows the total repositories that satisfy the query and not the maximum available for return in the API).
The only way I found so far to obtain information about the pages (and the pagination process) in GitHub Documentation comes in the header of the response, like:
Status: 200 OK
Link: <https://api.github.com/resource?page=2>; rel="next",
<https://api.github.com/resource?page=5>; rel="last"
X-RateLimit-Limit: 20
X-RateLimit-Remaining: 19
Anyone can suggest the best approach to detect the end of the pages in this case?
Should I try to parse the information from the header or infer it somehow based on the returned json? I even got the "Link" header value but don't know how to parse it.
I am using https://pushshift.io/ API endpoint, to get the count of submissions of my user's subreddit page (e.g 'u_username').
From what I understand it should something has to do with the API, because i get the JSON response correct, but when I count the objects they are more than the expected.
I tried several methods to count the JSON array with PHP count() and sizeof(), but these are not seem to be the problem. The problem may be in the API.
In PHP:
file_get_html('https://api.pushshift.io/reddit/search/submission/?subreddit=u_USERNAME&filter=id&sort=desc&size=500');
For posts more that 500, you should change the 'size' paramater.
I have 232 post and i get back 240... How is this possible?
Is there any other way to count my subreddit user page posts/submissions?
Thank you.
If a post has been deleted or removed, it is still recorded by Pushshift (they aren't removed from Pushshift when they are deleted or removed). This is the principle behind removeddit and ceddit.
For certain videos the call to commentThreads does not return the complete set of toplevelcomments. For example the following call only returns 16 toplevel comments when in fact the video has over 3458 total comments, and the toplevel comments are clearly more than 16.
https://www.googleapis.com/youtube/v3/commentThreads?part=snippet&videoId=h_9-3Fj3ZdI&key=[KEY]&maxResults=50
I have run into this issue on a couple of Videos and in both cases the result seems to break on a comment which is in fact missing from the web UI. I have looked around and and tried multiple ways (i.e. trying to skip using start-index etc) , but I haven't found a solution. You can verify the missing comment in the the above result: 'z131gd2gwqnby3lea23byp3yyt3pshcqb04'
I don't know if this is still important. But I can get all comments. You need to use the nextPageToken to get all comments (so more than your maxResults) and the comments-request to get the replies.
You can verify this in python as well. See the following code:
api_key = '...'
url = 'https://youtube.googleapis.com/youtube/v3/commentThreads?part=snippet&videoId=h_9-3Fj3ZdI&maxResults=100&key=' + api_key
response = requests.get(url=url).json()
print(response)
I have been trying to get a list of comments out using the new V3 Data API with mixed results.
For some videos, you only get a subset of comments out. I have noticed this on a few videos, but for this specific case I will use video ID = U55NGD9Jm7M
You can find all the comments on this video in the WebUI here: https://www.youtube.com/all_comments?v=U55NGD9Jm7M
At the time of posting, there were 5,499 comments on this video.
API Results:
When querying https://www.googleapis.com/youtube/v3/commentThreads?part=id,snippet,replies&textFormat=plainText&maxResults=100&videoId=U55NGD9Jm7M&key={YOUR_API_KEY} I am only getting about 317 comments (including Paging, and counting all replies) (sorted chronologically).
Verification Research:
If you select "Top Comments" from the drop down and then scroll down and click "More" over and over again, you get over 1,000 comments (I stopped at about 1,000)
If you then select "Newest First" from the dropdown and repeat the process (more ... more ... more) you will find that there are about 317 comments before you are unable to show any more comments.
I find it quite odd that there is a discrepancy in the UI, but thankful that the API lines up with part of the UI. Has anyone else noticed this? Is there a way to get the full text of all 5,499 comments?
Thanks!
Jason
Follow-up 1
As a follow-up, I was able to isolate one comment using View->Source (Thread ID z12wzfzhtybgz13kj22ocvsz2unrtn1qj04) and fetch all the information from this comment in the API here: https://www.googleapis.com/youtube/v3/commentThreads?part=id%2Csnippet%2Creplies&id=z12wzfzhtybgz13kj22ocvsz2unrtn1qj04&maxResults=100&key={YOUR_API_KEY})
It even mentions the correct VideoID that the comment is associated with. However, when you query by Video, this comment ID is not returned.
Follow-up 2
I refresh the Web UI of the All Comments, and there was a significantly different list of comments that are being returned
The commentsThread.list call can only return a maximum of 100 results (see maxResults in the documentation). If you wanted to get more comment threads, you'd have to pass in the nextPageToken you get from your initial call into a subsequent API call.
For example:
https://www.googleapis.com/youtube/v3/commentThreads?part=snippet&videoId=U55NGD9Jm7M&maxResults=100&key=API_KEY
gives you 100 comment threads, and the nextPageToken is Cg0Qk9fa7fHgxgIgACgBEhQIARCY49LZ5eDGAhi4rNGIrZrGAhgCIGM. If you include that token in a new API call, like so:
https://www.googleapis.com/youtube/v3/commentThreads?part=snippet&videoId=Dlj6SUg9B04&&maxResults=100&nextPageToken=Cg0Qk9fa7fHgxgIgACgBEhQIARCY49LZ5eDGAhi4rNGIrZrGAhgCIGM&key=API_KEY
You get a completely different set of comment threads. You can double check this by specifying order=time in both API calls. You'll see that the earliest comment threads for both calls are different, and you won't find the comment Thread ID for either in the other call's results. To get even more comment threads, you take the nextPageToken from the newer call's results and do the same thing again (until the call doesn't give you another nextPageToken, meaning you're on the last page, and there are no more comment threads to return).
I am working on YouTube Content ID API to fetch assests which are added today.
First of all I try to explore API on YouTube Content ID API explorer and then find asset search suitable for my criteria.So I provided the required parameters and got the response but response include only 25 results each time so I used NextPageToken
recived from response to get next assests,so far so good ,but for all this responses I noticed the ResultsPerPage varies for each request which confused me.As I assumed that ResultsPerPage indicates the all assests for the particular content owner and considering this I decided to code but now I'm unable to decide how should procced.
Can any one help me to understand this
Both the totalResults and ResultsPerPage are not reliable, according to employees at YouTube. You can only rely on the data that comes through. You can verify this with your TAM/Partner Manager.
In order to get the real count of your assets (I'm assuming you're using AssetSearch?), you have to keep paginating until there's no "nextPageToken' in the response, and count your results.
By the way, if you set the parameter "maxResults=50" in your request, you'll get 50 per page (until there's less than 50 left to display, which should only happen on the last page, given that you have a number of assets not divisible by 50).