I use youtube reporting API to get VideoIDs and some metrics. Then I also use Youtube Data API to get list of ALL VideoNames. But when I combine these two groups (to get names to these IDs), I found out that a lot of names are missing.
HTTP request: GET https://www.googleapis.com/youtube/v3/playlistItems
What is the best HTTP request to get ALL existing VideoNames historically?
Why playlistItems doesn't work properly and not showing all VideoNames?
Thank you
def get_videos():
for f in glob.glob(f'YoutubeAnalytics/videos/*.json'):
os.unlink(f)
for ch_name, token_file, ch_id in channels:
print(ch_name)
print(ch_id, 'UU' + ch_id[2:])
jsn = json.load(open(TOKEN_PATH + token_file))
svc = get_youtube_data(jsn)
name = token_file.replace('.json', '')
rsp = svc.playlistItems().list(part='snippet', playlistId= 'UU' + ch_id[2:], maxResults=50).execute()
# rsp = svc.channels().list(part='id,snippet', mine=True).execute()
i = 0
while 1:
# tak se to stahne to originalniho folderu Python
with open(f'YoutubeAnalytics/videos/{name}_{i:04d}.json', 'w') as w:
json.dump(rsp, w)
if 'nextPageToken' in rsp:
i += 1
if i % 10 == 0:
print(i)
rsp = svc.playlistItems().list(part='snippet', playlistId= 'UU' + ch_id[2:], maxResults=50, pageToken=rsp['nextPageToken']).execute()
else:
break
def make_videos_csv():
htag = re.compile(r"\s#\S+")
with open(f'YoutubeAnalytics/videos/videos.csv', 'w', encoding='utf-8', newline='') as csvf:
wrt = csv.writer(csvf)
for f in glob.glob(f'YoutubeAnalytics/videos/*.json'):
jsn = json.load(open(f))
for i in jsn['items']:
snip = i['snippet']
descr = snip['description']
tags = ','.join([ t[1:] for t in htag.findall(descr) ])
wrt.writerow((snip['resourceId']['videoId'], i['id'], i['etag'], snip['channelId'], snip['publishedAt'][:-1], snip['title'], snip['description'], tags))
Consider that some videos may be marked as private, then, if you're not the owner of the video(s), you wont be able to get the details of such video(s).
In your comment, you added these video_ids examples:
zzr8YwY0y2U, zypHHsc3Q_Y, zyXCdTAdL2s, zvgtoZvL-Gs
Here are the results of each one:
zzr8YwY0y2U = Catch the Crooks - LEGO City - mini movie: Ep. 13
zypHHsc3Q_Y = LEGO Marvel: Spider-Man 'Vexed By Venom' Episode 4: Paging Ghost-Spider
zyXCdTAdL2s = PUBLIC INFO NOT AVAILABLE - this video is private.
zvgtoZvL-Gs = ~どんなどうぶつが暮らすサファリをつくる?篇~「みやぞんとあらぽんがつくる!つながる、ひろがる、 レゴ シティ!」
See the answers here and modify your code in order to handle such unavailable video(s) you might get with your current code - or just accept that you cannot get data if those video(s) are not publicly available.
Related
I have tried to apply the advice in this question but I still don't seem to get the code to work. I am trying to pass two variables
How to pass variables in python flask to mysqldb?
cur = mysql.connection.cursor()
for row in jdict:
headline = row['title']
url = row['url']
cur.execute("INSERT INTO headlinetitles (Title, Url) VALUES ('%s','%s');",str(headline),str(url))
mysql.connection.commit()
I get the error TypeError: execute() takes from 2 to 3 positional arguments but 4 were given
I found the solution. Maybe this will help someone.
cur = mysql.connection.cursor()
for row in jdict:
headline = row['title']
url = row['url']
sqldata = ("INSERT INTO headlinetitles (Title, Url) VALUES ('%s','%s');",str(headline),str(url))
cur.execute = (sqldata)
mysql.connection.commit()
Using AKSamplerDescriptor
I am using an adapted AKSampler example, in which I try to use the sforzando output of Fluid.sf3 melodicSounds. Sforzando creates .sfz files for each instrument, but all pointing for the global sample to a huge .wav file.
In all the instrument.sfz files there is an offset and endpoint description for the part of the wave file to be used.
When I load the .sfz file I get a crash due to memory problems. It seems that for every defined region in the .sfz file the complete .wav file (140 mB) is loaded again.
The most likely is that loading the sample file with the AKSampleDescriptor as done in the AKSampler example will ignore offset and endpoint (AKSampleDescriptor.startPoint and AKSampleDescriptor.endPoint) while reloading the complete .wav file.
Is there a way to load just the part start-to-end wanted from the sample file, because the complete file has al the sample data for all the instruments (I know and use polyphony that extracts only one instrument at the time and works fine, but this is for other use)
Or, and that seems the best to me, just load the file once and than have the sampledescriptors point to the data in memory
Good suggestions, Rob. I just ran into this one-giant-WAV issue myself, having never seen it before. I was also using Sforzando for conversion. I'll look into adding the necessary capabilities to AKSampler. In the meantime, it might be easier to write a program to cut up the one WAV file into smaller pieces and adjust the SFZ accordingly.
Here is some Python 2.7 code to do this, which I have used successfully with a Sforzando-converted sf2 soundfont. It might need changes to work for you--there is huge variability among sfz files--but at least it might help you get started. This code requires the PyDub library for manipulating WAV audio.
import os
import re
from pydub import AudioSegment
def stripComments(text):
def replacer(match):
s = match.group(0)
if s.startswith('/'):
return " " # note: a space and not an empty string
else:
return s
pattern = re.compile(
r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"',
re.DOTALL | re.MULTILINE
)
return re.sub(pattern, replacer, text)
def updateSplitList(splitList, regionLabels, values):
if len(values) > 3:
start = int(values['offset'])
length = int(values['end']) - start
name = regionLabels.pop(0)
splitList.add((name, start, length))
def lookupSplitName(splitList, offset, end):
for (name, start, end) in splitList:
if offset == start and end == end:
return name
return None
def outputGroupAndRegion(outputFile, splitList, values):
if values.has_key('lokey') and values.has_key('hikey') and values.has_key('pitch_keycenter'):
outputFile.write('<group> lokey=%s hikey=%s pitch_keycenter=%s\n' % (values['lokey'], values['hikey'], values['pitch_keycenter']))
elif values.has_key('key') and values.has_key('pitch_keycenter'):
outputFile.write('<group> key=%s pitch_keycenter=%s\n' % (values['key'], values['pitch_keycenter']))
if len(values) > 3:
outputFile.write(' <region> ')
if values.has_key('lovel') and values.has_key('hivel'):
outputFile.write('lovel=%s hivel=%s ' % (values['lovel'], values['hivel']))
if values.has_key('tune'):
outputFile.write('tune=%s ' % values['tune'])
if values.has_key('volume'):
outputFile.write('volume=%s ' % values['volume'])
if values.has_key('offset'):
outputFile.write('offset=0 ')
if values.has_key('end'):
outputFile.write('end=%d ' % (int(values['end']) - int(values['offset'])))
if values.has_key('loop_mode'):
outputFile.write('loop_mode=%s ' % values['loop_mode'])
if values.has_key('loop_start'):
outputFile.write('loop_start=%d ' % (int(values['loop_start']) - int(values['offset'])))
if values.has_key('loop_end'):
outputFile.write('loop_end=%d ' % (int(values['loop_end']) - int(values['offset'])))
outputFile.write('sample=samples/%s' % lookupSplitName(splitList, int(values['offset']), int(values['end'])) + '.wav\n')
def process(inputFile, outputFile):
# create a list of region labels
regionLabels = list()
for line in open(inputFile):
if line.strip().startswith('region_label'):
regionLabels.append(line.strip().split('=')[1])
# read entire input SFZ file
sfz = open(inputFile).read()
# strip comments and create a mixed list of <header> tags and key=value pairs
sfz_list = stripComments(sfz).split()
inSection = "none"
default_path = ""
global_sample = None
values = dict()
splitList = set()
# parse the input SFZ data and build up splitList
for item in sfz_list:
if item.startswith('<'):
inSection = item
updateSplitList(splitList, regionLabels, values)
values.clear()
continue
elif item.find('=') < 0:
#print 'unknown:', item
continue
key, value = item.split('=')
if inSection == '<control>' and key == 'default_path':
default_path = value.replace('\\', '/')
elif inSection == '<global>' and key == 'sample':
global_sample = value.replace('\\', '/')
elif inSection == '<region>':
values[key] = value
# split the wav file
bigWav = AudioSegment.from_wav(global_sample)
#print "%d channels, %d bytes/sample, %d frames/sec" % (bigWav.channels, bigWav.sample_width, bigWav.frame_rate)
frate = float(bigWav.frame_rate)
for (name, start, length) in splitList:
startMs = 1000 * start / frate
endMs = 1000 * (start + length) / frate
wav = bigWav[startMs : endMs]
wavName = 'samples/' + name + '.wav'
wav.export(wavName, format='wav')
# parse the input SFZ data again and generate the output SFZ
for item in sfz_list:
if item.startswith('<'):
inSection = item
outputGroupAndRegion(outputFile, splitList, values)
values.clear()
continue
elif item.find('=') < 0:
#print 'unknown:', item
continue
key, value = item.split('=')
if inSection == '<control>' and key == 'default_path':
default_path = value.replace('\\', '/')
elif inSection == '<global>' and key == 'sample':
global_sample = value.replace('\\', '/')
elif inSection == '<region>':
values[key] = value
dirPath = '000'
fileNameList = os.listdir(dirPath)
for fileName in fileNameList:
if fileName.endswith('.sfz'):
inputFile = os.path.join(dirPath, fileName)
outputFile = open(fileName, 'w')
print fileName
process(inputFile, outputFile)
I wanted to take none phrases of tweets, code is following. The problem is that it only process 300 tweets at a time and spend 5 minutes, how to speed up?
by the way, some code edited according to text blob.
I use dataset of gate-EN-twitter(https://gate.ac.uk/wiki/twitter-postagger.html) and NLTK interface to the Stanford POS tagger to tag tweets
from nltk.tag import StanfordPOSTagger
from nltk.tokenize import word_tokenize
import time,nltk
start_time = time.time()
CFG = {
('NNP', 'NNP'): 'NNP',
('NN', 'NN'): 'NNI',
('NNI', 'NN'): 'NNI',
('JJ', 'JJ'): 'JJ',
('JJ', 'NN'): 'NNI',
}
st = StanfordPOSTagger('/models/gate-EN-twitter.model','/twitie_tagger/twitie_tag.jar', encoding='utf-8')
def _normalize_tags(chunk):
'''Normalize the corpus tags.
("NN", "NN-PL", "NNS") -> "NN"
'''
ret = []
for word, tag in chunk:
if tag == 'NP-TL' or tag == 'NP':
ret.append((word, 'NNP'))
continue
if tag.endswith('-TL'):
ret.append((word, tag[:-3]))
continue
if tag.endswith('S'):
ret.append((word, tag[:-1]))
continue
ret.append((word, tag))
return ret
def noun_phrase_count(text):
matches1=[]
print('len(text)',len(text))
for i in range(len(text)//1000):
tokenized_text = word_tokenize(text[i*1000:i*10000+1000])
classified_text = st.tag(tokenized_text)
tags = _normalize_tags(classified_text)
merge = True
while merge:
merge = False
for x in range(0, len(tags) - 1):
t1 = tags[x]
t2 = tags[x + 1]
key = t1[1], t2[1]
value = CFG.get(key, '')
if value:
merge = True
tags.pop(x)
tags.pop(x)
match = '%s %s' % (t1[0], t2[0])
pos = value
tags.insert(x, (match, pos))
break
matches = [t[0] for t in tags if t[1] in ['NNP', 'NNI']]
matches1+=matches
print("--- %s seconds ---" % (time.time() - start_time))
fdist = nltk.FreqDist(matches1)
return [(tag,num) for (tag, num) in fdist.most_common()]
noun_phrase_count(tweets)
Looks like a duplicate of Stanford POS tagger with GATE twitter model is slow so you may find more info there.
Additionally; if there's any chance of stumbling upon identical inputs (tweets) twice (or more), you can consider a dictionary with the tweet (plain str) as key, and tagged as value, so that when you encounter a tweet, you first check if it's in your dict already. If not, tag it and put it there (and if this route is viable, why not pickle/unpickle that dictionary so that debugging/subsequent runs of your code go faster as well).
I am trying to export JIRA tasks via API and I hit a wall on excel due to JIRA only allowing a 1000 limit. I can do an export manually to CSV and get over 1000 results and was wondering if anyone had any luck with large JIRA exports via REST API and can help point me in the right direction on this.
Guessing an export to CSV then pull into excel for reporting might work?
Thanks!
JIRA's REST API supports pagination to prevent that clients of the API can put too much load on the application. This means you cannot just pull in all issue data with 1 REST call.
You can only retrieve "pages" of max 1000 issues using the paging query parameters startAt and maxResults. See the Pagination section here.
If you run a JIRA standalone server then you can tweak the maximum number of results that JIRA returns, but for a cloud instance this is not possible. See this KB article for more info.
using jira-python (according to your tag)
# search_issues can only return 1000 issues, so if there are more we have to search again, thus startAt=count
issues = []
count = 0
while True:
tmp_issues = jira_connection.search_issues('', startAt=count, maxResults=count + 999)
if len(tmp_issues) == 0:
# Since Python does not offer do-while, we have to break here.
break
issues.extend(tmp_issues)
count += 999
The code below will fetch results 200 records at a time, till all records are exported.
you can export max 1000 records at a go by updating the page size, it will recursively fetch 1000 records until everything is exported
var windowSlider = 200
const request = require('request')
const fs = require('fs')
const chalk = require('chalk')
var windowSlider = 200
var totlExtractedRecords = 0;
fs.writeFileSync('output.txt', '')
const option = {
url: 'https://jira.yourdomain.com/rest/api/2/search',
json: true,
qs: {
jql: "project in (xyz)",
maxResults: 200,
startAt: 0,
}
}
const callback = (error, response) => {
const body = response.body
console.log(response.body)
const dataArray = body.issues
const total = body.total
totlExtractedRecords = dataArray.length + totlExtractedRecords
if (totlExtractedRecords > 0) {
option.qs.startAt = windowSlider + option.qs.startAt
}
dataArray.forEach(element => {
fs.appendFileSync('output.txt', element.key + '\n')
})
console.log(chalk.red.inverse('Total extracted data : ' + totlExtractedRecords))
console.log(chalk.red.inverse('Total data: ' + total))
if (totlExtractedRecords < total) {
console.log('Re - Running with start as ' + option.qs.startAt)
console.log('Re - Running with maxResult as ' + option.qs.maxResults)
request(option, callback).auth('api-reader', 'APITOKEN', true)
}
}
request(option, callback).auth('api-reader', 'APITOKEN', true)
I want to get all video information posible from Youtube for my proyect. I know that the limit page is 100.
I do the next code:
ArrayList<String> videos = new ArrayList<>();
int i = 1;
String peticion = "http://gdata.youtube.com/feeds/api/videos?category=Comedy&alt=json&max-results=50&page=" + i;
URL oracle = new URL(peticion);
URLConnection yc = oracle.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
yc.getInputStream()));
String inputLine = in.readLine();
while (in.readLine() != null)
{
inputLine = inputLine + in.readLine();
}
System.out.println(inputLine);
JSONObject jsonObj = new JSONObject(inputLine);
JSONObject jsonFeed = jsonObj.getJSONObject("feed");
JSONArray jsonArr = jsonFeed.getJSONArray("entry");
while(i<=100)
{
for (int j = 0; j < jsonArr.length(); j++) {
videos.add(jsonArr.getJSONObject(j).getJSONObject("id").getString("$t"));
System.out.println("Numero " + videosTotales + jsonArr.getJSONObject(j).getJSONObject("id").getString("$t"));
videosTotales++;
}
i++;
}
When the program finish, I have 5000 videos per category, but I need much more, much much more, but the limit is page = 100.
So, how can I get more than 10 millions of videos?
Thank you!
Are those 5000 also unique id's ?
I see the use of max-results=50, but not a start-index parameter in your url.
There is a limit on the results you can get per request. There is also a limit on the number of requests that you can send within some time interval. By checking the statuscode of the response and any error message you can find these limits, as they may change in time.
Besides the category parameter, use some other parameters too. For instance, you may vary the q parameter (used with some keywords) and/or order parameter to get a different results set.
See the documentation for available parameters.
Note, that you are using api version 2, which is deprecated. There is an api version 3.