Reading the fileset from a torrent - parsing

I want to (quickly) put a program/script together to read the fileset from a .torrent file. I want to then use that set to delete any files from a specific directory that do not belong to the torrent.
Any recommendations on a handy library for reading this index from the .torrent file? Whilst I don't object to it, I don't want to be digging deep into the bittorrent spec and rolling a load of code from scratch for this simple purpose.
I have no preference on language.

I would use rasterbar's libtorrent which is a small and fast C++ library.
To iterate over the files you could use the torrent_info class (begin_files(), end_files()).
There's also a python interface for libtorrent:
import libtorrent
info = libtorrent.torrent_info('test.torrent')
for f in info.files():
print "%s - %s" % (f.path, f.size)

Effbot has your question answered. Here is the complete code to read the list of files from .torrent file (Python 2.4+):
import re
def tokenize(text, match=re.compile("([idel])|(\d+):|(-?\d+)").match):
i = 0
while i < len(text):
m = match(text, i)
s = m.group(m.lastindex)
i = m.end()
if m.lastindex == 2:
yield "s"
yield text[i:i+int(s)]
i = i + int(s)
else:
yield s
def decode_item(next, token):
if token == "i":
# integer: "i" value "e"
data = int(next())
if next() != "e":
raise ValueError
elif token == "s":
# string: "s" value (virtual tokens)
data = next()
elif token == "l" or token == "d":
# container: "l" (or "d") values "e"
data = []
tok = next()
while tok != "e":
data.append(decode_item(next, tok))
tok = next()
if token == "d":
data = dict(zip(data[0::2], data[1::2]))
else:
raise ValueError
return data
def decode(text):
try:
src = tokenize(text)
data = decode_item(src.next, src.next())
for token in src: # look for more tokens
raise SyntaxError("trailing junk")
except (AttributeError, ValueError, StopIteration):
raise SyntaxError("syntax error")
return data
if __name__ == "__main__":
data = open("test.torrent", "rb").read()
torrent = decode(data)
for file in torrent["info"]["files"]:
print "%r - %d bytes" % ("/".join(file["path"]), file["length"])

Here's the code from Constantine's answer above, slightly modified to handle Unicode characters in torrent filenames and fileset filenames in torrent info:
import re
def tokenize(text, match=re.compile("([idel])|(\d+):|(-?\d+)").match):
i = 0
while i < len(text):
m = match(text, i)
s = m.group(m.lastindex)
i = m.end()
if m.lastindex == 2:
yield "s"
yield text[i:i+int(s)]
i = i + int(s)
else:
yield s
def decode_item(next, token):
if token == "i":
# integer: "i" value "e"
data = int(next())
if next() != "e":
raise ValueError
elif token == "s":
# string: "s" value (virtual tokens)
data = next()
elif token == "l" or token == "d":
# container: "l" (or "d") values "e"
data = []
tok = next()
while tok != "e":
data.append(decode_item(next, tok))
tok = next()
if token == "d":
data = dict(zip(data[0::2], data[1::2]))
else:
raise ValueError
return data
def decode(text):
try:
src = tokenize(text)
data = decode_item(src.next, src.next())
for token in src: # look for more tokens
raise SyntaxError("trailing junk")
except (AttributeError, ValueError, StopIteration):
raise SyntaxError("syntax error")
return data
n = 0
if __name__ == "__main__":
data = open("C:\\Torrents\\test.torrent", "rb").read()
torrent = decode(data)
for file in torrent["info"]["files"]:
n = n + 1
filenamepath = file["path"]
print str(n) + " -- " + ', '.join(map(str, filenamepath))
fname = ', '.join(map(str, filenamepath))
print fname + " -- " + str(file["length"])

bencode.py from the original Mainline BitTorrent 5.x client (http://download.bittorrent.com/dl/BitTorrent-5.2.2.tar.gz) would give you pretty much the reference implementation in Python.
It has an import dependency on the BTL package but that's trivially easy to remove. You'd then look at bencode.bdecode(filecontent)['info']['files'].

Expanding on the ideas above, I did the following:
~> cd ~/bin
~/bin> ls torrent*
torrent-parse.py torrent-parse.sh
~/bin> cat torrent-parse.py
# torrent-parse.py
import sys
import libtorrent
# get the input torrent file
if (len(sys.argv) > 1):
torrent = sys.argv[1]
else:
print "Missing param: torrent filename"
sys.exit()
# get names of files in the torrent file
info = libtorrent.torrent_info(torrent);
for f in info.files():
print "%s - %s" % (f.path, f.size)
~/bin> cat torrent-parse.sh
#!/bin/bash
if [ $# -lt 1 ]; then
echo "Missing param: torrent filename"
exit 0
fi
python torrent-parse.py "$*"
You'll want to set permissions appropriately to make the shell script executable:
~/bin> chmod a+x torrent-parse.sh
Hope this helps someone :)

Related

How to handle large in-memory data in Apache Beam Pipeline to run on Google Dataflow Runner

I'm having a simple following code. The size of the variable word_to_id in memory is ~50MB. This causing error in submitting pipeline to Dataflow Runner.
413 Request Entity Too Large
word_to_id = {tok: idx for idx, tok in enumerate(vocab)}
def extract_word_ids(tokens):
return [word_to_id[w] for w in tokens if word_to_id.get(w, None)]
with beam.pipeline.Pipeline(
options=get_pipeline_option()) as p:
lines = p | 'Read' >> beam.io.ReadFromText(path)
word_ids = (
lines
| 'TokenizeLines' >> beam.Map(words)
| 'IntergerizeTokens' >> beam.Map(extract_word_ids)
)
Please provide me an alternate solution for this.
You can use GCS buckets as sources for both the text and the variable and use the variable as side input. You can use this side inputs as list, dict or singleton.
Here you have an example of a wordcount removing the stopwords, which are stored in a GCS bucket
with beam.Pipeline() as p:
path = "gs://dataflow-samples/shakespeare/kinglear.txt"
stopwords_path = "<BUCKET/stopwords>"
output_path = "<BUCKET>"
def split_words(text, stopwords):
words = re.split('\W+', text)
try:
words.remove('')
except:
pass
return [x for x in words if x.lower() not in stopwords]
stopwords_p = (p | "Read Stop Words" >> ReadFromText(stopwords_path)
| FlatMap(lambda x: x.split(", ")))
text = p | "Read Text" >> ReadFromText(path)
(text | "Split Words" >> FlatMap(split_words, stopwords=beam.pvalue.AsList(stopwords_p))
| "Count" >> Count.PerElement()
| "Write" >> WriteToText(file_path_prefix=output_path, file_name_suffix=".txt"))
Finally, I'm managed to solve it and it worked. I used DoFn.setup to initialize my variable from GCS bucket.
class IntergerizeTokens(beam.DoFn):
"""Beam line processing function."""
def __init__(self, vocab_filename):
self.vocab_filename = vocab_filename
def setup(self):
with tf.io.gfile.GFile(tf.io.gfile.glob(self.vocab_filename + '*')[0], 'r') as fh:
# read from GCS bucket
self.word_to_id = {tok: idx for idx, tok in enumerate(vocab)}
print('Setup done!')
def process(self, tokens):
"""Takes a lines and yield a list of (token, 1) tuples."""
return [[self.word_to_id[w] for w in tokens if self.word_to_id.get(w, None)]]
Now pass the DoFn in ParDo
with beam.pipeline.Pipeline(
options=get_pipeline_option()) as p:
lines = p | 'Read' >> beam.io.ReadFromText(path)
word_ids = (
lines
| 'TokenizeLines' >> beam.Map(words)
| 'IntergerizeTokens' >> beam.ParDo(IntergerizeTokens(vocab_temp_path))
)
This is one way to solve it. I think DoFn.setup is good for initializing large variables in memory.

Change Named Entity Recognition Format from ENAMEX to CoNLL

I have a dataset which is in ENAMEX format like this:
<ENAMEX TYPE="LOCATION">Italy</ENAMEX>'s business world was rocked by the announcement <TIMEX TYPE="DATE">last Thursday</TIMEX> that Mr. <ENAMEX TYPE=„PERSON">Verdi</ENAMEX> would leave his job as vicepresident of <ENAMEX TYPE="ORGANIZATION">Music Masters of Milan, Inc</ENAMEX> to become operations director of <ENAMEX TYPE="ORGANIZATION">Arthur Andersen</ENAMEX>.
I want to change it into CoNLL format:
Italy LOCATION
's O
business O
world O
was O
rocked O
by O
the O
announcement O
last DATE
Thursday DATE
...
. O
How can I do that? Is there a standard script for such format conversion?
I wrote one myself that worked for me though is not heavily tested here:
from __future__ import unicode_literals
import os
from os import path
import re
import os
import re
import en_core_web_sm #spacy
# to convert formats such as <ENAMEX type="LOCATION">Italy</ENAMEX> is experiencing an economic boom.
def xml_iter(file_):
with open(file_, 'r') as fin:
for line in fin:
yield line.strip()
def markupline2bio(line):
#print(line.split('\t')[0])
record = line.split('\t')[0]
#print(record)
#print(parse(record))
#print(record[35:40], record[81:90])
#tags = re.findall(r'<ENAMEX\s+TYPE=\"(.+?)\">(.+?)</ENAMEX>', record)
prev_start = 0
prev_end = 0
all_tokens = []
all_tags = []
for f in re.finditer(r'<ENAMEX\s+TYPE=\"(.+?)\">(.+?)</ENAMEX>', record):
#print(record[f.start(0):f.end(0)], f.start(0), f.end(0))
annotations = re.findall(r'<ENAMEX\s+TYPE=\"(.+?)\">(.+?)</ENAMEX>', record[f.start(0):f.end(0)])
before_text = record[prev_end:f.start(0)]
prev_start, prev_end = f.start(0), f.end(0)
for tok in nlp(before_text):
if str(tok).strip():
all_tokens.append(tok)
all_tags.append('O')
for phrasetag in annotations:
tag, phrase = annotations[0]
tokens = nlp(phrase)
for entity_tok_index, tok in enumerate(tokens):
if str(tok).strip():
all_tokens.append(tok)
if entity_tok_index == 0:
all_tags.append("B-" + tag)
else:
all_tags.append("I-" + tag)
else:
entity_tok_index -= 1
after_text = record[prev_end:]
for tok in nlp(after_text):
if str(tok).strip():
all_tokens.append(tok)
all_tags.append('O')
return all_tokens, all_tags
if __name__ == '__main__':
data_dir = './data/indonesian_bert_all/Indonesian/ner/'
xml_iterator = xml_iter(os.path.join(data_dir, 'data_train_ugm.txt'))
output_file = os.path.join(data_dir, 'data_train_ugm.bio')
#nlp = spacy.load("en_core_web_sm")
nlp = en_core_web_sm.load()
with open(output_file, 'w') as fout:
for i, line in enumerate(xml_iterator):
if i > 10:
#break
pass
all_tokens, all_tags = markupline2bio(line.strip())
#print(all_tokens)
#print(all_tags)
#print(line)
for tok, tag in zip(all_tokens, all_tags):
#print(tok, tag)
fout.write(str(tok) + '\t' + tag)
fout.write('\n')
fout.write('\n')

Using AKSampleDescriptor

Using AKSamplerDescriptor
I am using an adapted AKSampler example, in which I try to use the sforzando output of Fluid.sf3 melodicSounds. Sforzando creates .sfz files for each instrument, but all pointing for the global sample to a huge .wav file.
In all the instrument.sfz files there is an offset and endpoint description for the part of the wave file to be used.
When I load the .sfz file I get a crash due to memory problems. It seems that for every defined region in the .sfz file the complete .wav file (140 mB) is loaded again.
The most likely is that loading the sample file with the AKSampleDescriptor as done in the AKSampler example will ignore offset and endpoint (AKSampleDescriptor.startPoint and AKSampleDescriptor.endPoint) while reloading the complete .wav file.
Is there a way to load just the part start-to-end wanted from the sample file, because the complete file has al the sample data for all the instruments (I know and use polyphony that extracts only one instrument at the time and works fine, but this is for other use)
Or, and that seems the best to me, just load the file once and than have the sampledescriptors point to the data in memory
Good suggestions, Rob. I just ran into this one-giant-WAV issue myself, having never seen it before. I was also using Sforzando for conversion. I'll look into adding the necessary capabilities to AKSampler. In the meantime, it might be easier to write a program to cut up the one WAV file into smaller pieces and adjust the SFZ accordingly.
Here is some Python 2.7 code to do this, which I have used successfully with a Sforzando-converted sf2 soundfont. It might need changes to work for you--there is huge variability among sfz files--but at least it might help you get started. This code requires the PyDub library for manipulating WAV audio.
import os
import re
from pydub import AudioSegment
def stripComments(text):
def replacer(match):
s = match.group(0)
if s.startswith('/'):
return " " # note: a space and not an empty string
else:
return s
pattern = re.compile(
r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"',
re.DOTALL | re.MULTILINE
)
return re.sub(pattern, replacer, text)
def updateSplitList(splitList, regionLabels, values):
if len(values) > 3:
start = int(values['offset'])
length = int(values['end']) - start
name = regionLabels.pop(0)
splitList.add((name, start, length))
def lookupSplitName(splitList, offset, end):
for (name, start, end) in splitList:
if offset == start and end == end:
return name
return None
def outputGroupAndRegion(outputFile, splitList, values):
if values.has_key('lokey') and values.has_key('hikey') and values.has_key('pitch_keycenter'):
outputFile.write('<group> lokey=%s hikey=%s pitch_keycenter=%s\n' % (values['lokey'], values['hikey'], values['pitch_keycenter']))
elif values.has_key('key') and values.has_key('pitch_keycenter'):
outputFile.write('<group> key=%s pitch_keycenter=%s\n' % (values['key'], values['pitch_keycenter']))
if len(values) > 3:
outputFile.write(' <region> ')
if values.has_key('lovel') and values.has_key('hivel'):
outputFile.write('lovel=%s hivel=%s ' % (values['lovel'], values['hivel']))
if values.has_key('tune'):
outputFile.write('tune=%s ' % values['tune'])
if values.has_key('volume'):
outputFile.write('volume=%s ' % values['volume'])
if values.has_key('offset'):
outputFile.write('offset=0 ')
if values.has_key('end'):
outputFile.write('end=%d ' % (int(values['end']) - int(values['offset'])))
if values.has_key('loop_mode'):
outputFile.write('loop_mode=%s ' % values['loop_mode'])
if values.has_key('loop_start'):
outputFile.write('loop_start=%d ' % (int(values['loop_start']) - int(values['offset'])))
if values.has_key('loop_end'):
outputFile.write('loop_end=%d ' % (int(values['loop_end']) - int(values['offset'])))
outputFile.write('sample=samples/%s' % lookupSplitName(splitList, int(values['offset']), int(values['end'])) + '.wav\n')
def process(inputFile, outputFile):
# create a list of region labels
regionLabels = list()
for line in open(inputFile):
if line.strip().startswith('region_label'):
regionLabels.append(line.strip().split('=')[1])
# read entire input SFZ file
sfz = open(inputFile).read()
# strip comments and create a mixed list of <header> tags and key=value pairs
sfz_list = stripComments(sfz).split()
inSection = "none"
default_path = ""
global_sample = None
values = dict()
splitList = set()
# parse the input SFZ data and build up splitList
for item in sfz_list:
if item.startswith('<'):
inSection = item
updateSplitList(splitList, regionLabels, values)
values.clear()
continue
elif item.find('=') < 0:
#print 'unknown:', item
continue
key, value = item.split('=')
if inSection == '<control>' and key == 'default_path':
default_path = value.replace('\\', '/')
elif inSection == '<global>' and key == 'sample':
global_sample = value.replace('\\', '/')
elif inSection == '<region>':
values[key] = value
# split the wav file
bigWav = AudioSegment.from_wav(global_sample)
#print "%d channels, %d bytes/sample, %d frames/sec" % (bigWav.channels, bigWav.sample_width, bigWav.frame_rate)
frate = float(bigWav.frame_rate)
for (name, start, length) in splitList:
startMs = 1000 * start / frate
endMs = 1000 * (start + length) / frate
wav = bigWav[startMs : endMs]
wavName = 'samples/' + name + '.wav'
wav.export(wavName, format='wav')
# parse the input SFZ data again and generate the output SFZ
for item in sfz_list:
if item.startswith('<'):
inSection = item
outputGroupAndRegion(outputFile, splitList, values)
values.clear()
continue
elif item.find('=') < 0:
#print 'unknown:', item
continue
key, value = item.split('=')
if inSection == '<control>' and key == 'default_path':
default_path = value.replace('\\', '/')
elif inSection == '<global>' and key == 'sample':
global_sample = value.replace('\\', '/')
elif inSection == '<region>':
values[key] = value
dirPath = '000'
fileNameList = os.listdir(dirPath)
for fileName in fileNameList:
if fileName.endswith('.sfz'):
inputFile = os.path.join(dirPath, fileName)
outputFile = open(fileName, 'w')
print fileName
process(inputFile, outputFile)

how to speed up "POS tag" with StanfordPOSTagger?

I wanted to take none phrases of tweets, code is following. The problem is that it only process 300 tweets at a time and spend 5 minutes, how to speed up?
by the way, some code edited according to text blob.
I use dataset of gate-EN-twitter(https://gate.ac.uk/wiki/twitter-postagger.html) and NLTK interface to the Stanford POS tagger to tag tweets
from nltk.tag import StanfordPOSTagger
from nltk.tokenize import word_tokenize
import time,nltk
start_time = time.time()
CFG = {
('NNP', 'NNP'): 'NNP',
('NN', 'NN'): 'NNI',
('NNI', 'NN'): 'NNI',
('JJ', 'JJ'): 'JJ',
('JJ', 'NN'): 'NNI',
}
st = StanfordPOSTagger('/models/gate-EN-twitter.model','/twitie_tagger/twitie_tag.jar', encoding='utf-8')
def _normalize_tags(chunk):
'''Normalize the corpus tags.
("NN", "NN-PL", "NNS") -> "NN"
'''
ret = []
for word, tag in chunk:
if tag == 'NP-TL' or tag == 'NP':
ret.append((word, 'NNP'))
continue
if tag.endswith('-TL'):
ret.append((word, tag[:-3]))
continue
if tag.endswith('S'):
ret.append((word, tag[:-1]))
continue
ret.append((word, tag))
return ret
def noun_phrase_count(text):
matches1=[]
print('len(text)',len(text))
for i in range(len(text)//1000):
tokenized_text = word_tokenize(text[i*1000:i*10000+1000])
classified_text = st.tag(tokenized_text)
tags = _normalize_tags(classified_text)
merge = True
while merge:
merge = False
for x in range(0, len(tags) - 1):
t1 = tags[x]
t2 = tags[x + 1]
key = t1[1], t2[1]
value = CFG.get(key, '')
if value:
merge = True
tags.pop(x)
tags.pop(x)
match = '%s %s' % (t1[0], t2[0])
pos = value
tags.insert(x, (match, pos))
break
matches = [t[0] for t in tags if t[1] in ['NNP', 'NNI']]
matches1+=matches
print("--- %s seconds ---" % (time.time() - start_time))
fdist = nltk.FreqDist(matches1)
return [(tag,num) for (tag, num) in fdist.most_common()]
noun_phrase_count(tweets)
Looks like a duplicate of Stanford POS tagger with GATE twitter model is slow so you may find more info there.
Additionally; if there's any chance of stumbling upon identical inputs (tweets) twice (or more), you can consider a dictionary with the tweet (plain str) as key, and tagged as value, so that when you encounter a tweet, you first check if it's in your dict already. If not, tag it and put it there (and if this route is viable, why not pickle/unpickle that dictionary so that debugging/subsequent runs of your code go faster as well).

How do I generate a websocket handshake from Lua?

Is there an existing function to generate the server response key in Lua? Here is the solution in python: websocket handshake problem
I do have the two key numbers captured, the spaces counted, the third string captured and hoping the rest lies in an existing function...
If need the older handshake (protocol 0), you can use the following code to get the handshake value from the two keys:
md5 = require 'md5'
function getnumbers(str)
local num = ""
str:gsub('%d', function(d) num = num .. d end)
return tonumber(num)
end
function countspaces(str)
return select(2, str:gsub(' ', ' '))
end
function to32bitint(i)
return string.char(i/256^3 % 256, i/256^2 % 256, i/256 % 256, i % 256)
end
function websocketresponse(key1, key2, end8)
local n1, s1 = getnumbers(key1), countspaces(key1)
local n2, s2 = getnumbers(key2), countspaces(key2)
local cat = to32bitint(n1/s1) .. to32bitint(n2/s2) .. ending8
return md5.sum(cat)
end
websocket_key1 = "18x 6]8vM;54 *(5: { U1]8 z [ 8"
websocket_key2 = "1_ tx7X d < nw 334J702) 7]o}` 0"
ending8 = "Tm[K T2u"
print(websocketresponse(websocket_key1, websocket_key2, ending8))
--> fQJ,fN/4F4!~K~MH
This produces the same value as the example given in the protocol draft. This example uses MD5 library to calculate the checksum and is available compiled in LuaForWindows.
The implementation for WebSocket protocol version 6 is much simpler:
crypto = require 'crypto'
mime = require 'mime'
function websocketresponse6(key)
local magic = key .. "258EAFA5-E914-47DA-95CA-C5AB0DC85B11"
return (mime.b64(crypto.digest('sha1', magic, true)))
end
key6 = "x3JJHMbDL1EzLkh9GBhXDw=="
print(websocketresponse6(key6))
--> HSmrc0sMlYUkAGmm5OPpG2HaGWk=
This example uses the LuaCrypto for SHA1 sum and MIME from LuaSocket.
Have a look at the lua-websockets implementation. Here is the sha1 stuff.

Resources