I have ~70gb output of MD simulations. A pattern of a fixed-number-of-lines explanation and a fixed-number-of-lines data regularly repeat in the file. How can I read the file in Dask Dataframe chunk by chunk in which the explanation lines are ignored?
I successfully wrote a lambda function in the skiprows argument of the pandas.read_csv to ignore the explanation lines and only read the data lines. I converted the pandas-entered code to dask one but it does not work. Here you can see the dask code written by replacing pandas.read_csv with dd.read_csv:
# First extracting number of atoms and hence, number of data lines:
with open(filename[0],mode='r') as file: # The same as Chanil's code
line = file.readline()
line = file.readline()
line = file.readline()
line = file.readline() # natoms
natoms = int(line)
skiplines = 9 # Number of explanation lines repeating after nnatoms lines of data
def logic_for_chunk(index):
"""This function read a chunk """
if index % (natoms+skiplines) > 8:
return False
return True
df_chunk = dd.read_csv('trajectory.txt',sep=' ',header=None,index_col=False,skiprows=lambda x: logic_for_chunk(x),chunksize=natoms)
Here the indexes of the dataframe is line numbers of the file. Using above code, at the first chunk, lines 0 to 8 in file are ignored, then the lines 9 to 58 are read. At the next chunk, the line 59 to 67 are ignored and then a natoms-size chunk from line 68 to 117 are read. This happens until all the data snapshots are read.
Unfortunately, while the above code works well in pandas, it does not works in dask. How can I implement a similar procedure in dask dataframe?
The dask dataframe read_csv function cuts the file up at byte locations. It is unable to determine exactly how many lines are in each partition, so it is unwise to depend on the row index within each partition.
If there is some other way to detect a bad line then I would try that. Ideally you will be able to determine a bad line based on the content of the line, not on its location within the file (like every eighth line).
I am following this document clustering tutorial. As an input I give a txt file which can be downloaded here. It's a combined file of 3 other txt files divided with a use of \n. After creating a tf-idf matrix I received this warning:
,,UserWarning: Your stop_words may be inconsistent with your preprocessing.
Tokenizing the stop words generated tokens ['abov', 'afterward', 'alon', 'alreadi', 'alway', 'ani', 'anoth', 'anyon', 'anyth', 'anywher', 'becam', 'becaus', 'becom', 'befor', 'besid', 'cri', 'describ', 'dure', 'els', 'elsewher', 'empti', 'everi', 'everyon', 'everyth', 'everywher', 'fifti', 'forti', 'henc', 'hereaft', 'herebi', 'howev', 'hundr', 'inde', 'mani', 'meanwhil', 'moreov', 'nobodi', 'noon', 'noth', 'nowher', 'onc', 'onli', 'otherwis', 'ourselv', 'perhap', 'pleas', 'sever', 'sinc', 'sincer', 'sixti', 'someon', 'someth', 'sometim', 'somewher', 'themselv', 'thenc', 'thereaft', 'therebi', 'therefor', 'togeth', 'twelv', 'twenti', 'veri', 'whatev', 'whenc', 'whenev', 'wherea', 'whereaft', 'wherebi', 'wherev', 'whi', 'yourselv'] not in stop_words.
'stop_words.' % sorted(inconsistent))".
I guess it has something to do with the order of lemmatization and stop words removal, but as this is my first project in txt processing, I am a bit lost and I don't know how to fix this...
import pandas as pd
import nltk
from nltk.corpus import stopwords
import re
import os
import codecs
from sklearn import feature_extraction
import mpld3
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
stopwords = stopwords.words('english')
stemmer = SnowballStemmer("english")
def tokenize_and_stem(text):
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
stems = [stemmer.stem(t) for t in filtered_tokens]
return stems
def tokenize_only(text):
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
return filtered_tokens
totalvocab_stemmed = []
totalvocab_tokenized = []
with open('shortResultList.txt', encoding="utf8") as synopses:
for i in synopses:
allwords_stemmed = tokenize_and_stem(i) # for each item in 'synopses', tokenize/stem
totalvocab_stemmed.extend(allwords_stemmed) # extend the 'totalvocab_stemmed' list
allwords_tokenized = tokenize_only(i)
totalvocab_tokenized.extend(allwords_tokenized)
vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)
print ('there are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame')
print (vocab_frame.head())
#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
min_df=0.2, stop_words='english',
use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))
with open('shortResultList.txt', encoding="utf8") as synopses:
tfidf_matrix = tfidf_vectorizer.fit_transform(synopses) #fit the vectorizer to synopses
print(tfidf_matrix.shape)
The warning is trying to tell you that if your text contains "always" it will be normalised to "alway" before matching against your stop list which includes "always" but not "alway". So it won't be removed from your bag of words.
The solution is to make sure that you preprocess your stop list to make sure that it is normalised like your tokens will be, and pass the list of normalised words as stop_words to the vectoriser.
I had the same problem and for me the following worked:
include stopwords into tokenize function and then
remove stopwords parameter from tfidfVectorizer
Like so:
1.
stopwords = stopwords.words('english')
stemmer = SnowballStemmer("english")
def tokenize_and_stem(text):
tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
#exclude stopwords from stemmed words
stems = [stemmer.stem(t) for t in filtered_tokens if t not in stopwords]
return stems
Delete stopwords parameter from vectorizer:
tfidf_vectorizer = TfidfVectorizer(
max_df=0.8, max_features=200000, min_df=0.2,
use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3)
)
I faced this problem because of PT-BR language.
TL;DR: Remove the accents of your language.
# Special thanks for the user Humberto Diogenes from Python List (answer from Aug 11, 2008)
# Link: http://python.6.x6.nabble.com/O-jeito-mais-rapido-de-remover-acentos-de-uma-string-td2041508.html
# I found the issue by chance (I swear, haha) but this guy gave the tip before me
# Link: https://github.com/scikit-learn/scikit-learn/issues/12897#issuecomment-518644215
import spacy
nlp = spacy.load('pt_core_news_sm')
# Define default stopwords list
stoplist = spacy.lang.pt.stop_words.STOP_WORDS
def replace_ptbr_char_by_word(word):
""" Will remove the encode token by token"""
word = str(word)
word = normalize('NFKD', word).encode('ASCII','ignore').decode('ASCII')
return word
def remove_pt_br_char_by_text(text):
""" Will remove the encode using the entire text"""
text = str(text)
text = " ".join(replace_ptbr_char_by_word(word) for word in text.split() if word not in stoplist)
return text
df['text'] = df['text'].apply(remove_pt_br_char_by_text)
I put the solution and references in this gist.
Manually adding those words in the 'stop_words' list can solve the problem.
stop_words = safe_get_stop_words('en')
stop_words.extend(['abov', 'afterward', 'alon', 'alreadi', 'alway', 'ani', 'anoth', 'anyon', 'anyth', 'anywher', 'becam', 'becaus', 'becom', 'befor', 'besid', 'cri', 'describ', 'dure', 'els', 'elsewher', 'empti', 'everi', 'everyon', 'everyth', 'everywher', 'fifti', 'forti', 'henc', 'hereaft', 'herebi', 'howev', 'hundr', 'inde', 'mani', 'meanwhil', 'moreov', 'nobodi', 'noon', 'noth', 'nowher', 'onc', 'onli', 'otherwis', 'ourselv', 'perhap', 'pleas', 'sever', 'sinc', 'sincer', 'sixti', 'someon', 'someth', 'sometim', 'somewher', 'themselv', 'thenc', 'thereaft', 'therebi', 'therefor', 'togeth', 'twelv', 'twenti', 'veri', 'whatev', 'whenc', 'whenev', 'wherea', 'whereaft', 'wherebi', 'wherev', 'whi', 'yourselv'])
I'm trying to create a list of integers, similar to python where one would say
x = input("Enter String").split() # 1 2 3 5
x = list(map(int,x)) # Converts x = "1","2",3","5" to x = 1,2,3,5
Here's my code asking for the input, then splitting the input into a table, i need help converting the contents of the table to integers as they're being referenced later in a function, and i'm getting a string vs integer comparison error. I've tried changing the split for-loop to take a number but that doesn't work, I'm familiar with a python conversion but not with Lua so I'm looking for some guidance in converting my table or handling this better.
function main()
print("Hello Welcome the to Change Maker - LUA Edition")
print("Enter a series of change denominations, separated by spaces")
input = io.read()
deno = {}
for word in input:gmatch("%w+") do table.insert(deno,word) end
end
--Would This Work?:
--for num in input:gmatch("%d+") do table.insert(deno,num) end
Just convert your number-strings to numbers using tonumber
local number = tonumber("1")
So
for num in input:gmatch("%d+") do table.insert(deno,tonumber(num)) end
Should do the trick
I have been trying to produce a checksum based on a file header and am receiving conflicting results. In the slave devices manual, it states the following to produce the checksum:
"A simple eight-bit calculation is used for the header checksum. The steps required are as follows:
Calculate the sum of the header bytes in a single byte. Alternatively calculate
the sum and then AND the result with FFhex.
The checksum = FFhex - the sum from step 1."
Here, I have created the following code in Lua:
function header_checksum(string)
local sum = 0
for i = 1, #string do
sum = sum + string.byte(i)
end
local chksum = 255 - (sum & 255)
return chksum
end
If I send the following (4x byte) string down print(header_checksum("0181B81800")) I get the following result:
241 (string sent as you see it)
0 (each byte is changed to hex and then sent to function)
In the example given, it states that the byte should be AD, which is 173(dec) or \255.
Can someone please tell me what is wrong with what I am doing; either the code written, my approach, or both?
function header_checksum(header)
local sum = -1
for i = 1, #header do
sum = sum - header:byte(i)
end
return sum % 256
end
print(header_checksum(string.char(0x01,0x81,0xB8,0x18,0x00))) --> 173
I want to test if a string of bytes that I'm extracting from a file results in valid ISO-8859-15 encoded text.
The first thing I came across is this similar case about UTF-8 validation:
https://stackoverflow.com/a/5259160/1209004
So based on that, I thought I was being clever by doing something similar for ISO-8859-15. See the following demo code:
#! /usr/bin/env python
#
def isValidISO885915(bytes):
# Test if bytes result in valid ISO-8859-15
try:
bytes.decode('iso-8859-15', 'strict')
return(True)
except UnicodeDecodeError:
return(False)
def main():
# Test bytes (byte x95 is not defined in ISO-8859-15!)
bytes = b'\x4A\x70\x79\x6C\x79\x7A\x65\x72\x20\x64\x95\x6D\x6F\xFF'
isValidLatin = isValidISO885915(bytes)
print(isValidLatin)
main()
However, running this returns True, even though x95 is not a valid code point in ISO-8859-15! Am I overlooking something really obvious here? (BTW I tried this with Python 2.7.4 and 3.3, results are identical in both cases).
I think I've found a workable solution myself, so I might as well share it.
Looking at the codepage layout of ISO 8859-15 (see here), I really only need to check for the presence of code points 00 -1f and 7f - 9f. These corrrepond to the C0 and C1 control codes.
In my project I was already using something based on the code here for removing control characters from a string (C0 + C1). So, using that as a basis I came up with this:
#! /usr/bin/env python
#
import unicodedata
def removeControlCharacters(string):
# Remove control characters from string
# Based on: https://stackoverflow.com/a/19016117/1209004
# Tab, newline and return are part of C0, but are allowed in XML
allowedChars = [u'\t', u'\n',u'\r']
return "".join(ch for ch in string if
unicodedata.category(ch)[0] != "C" or ch in allowedChars)
def isValidISO885915(bytes):
# Test if bytes result in valid ISO-8859-15
# Decode bytes to string
try:
string = bytes.decode("iso-8859-15", "strict")
except:
# Empty string in case of decode error
string = ""
# Remove control characters, and compare result against
# input string
if removeControlCharacters(string) == string:
isValidLatin = True
else:
isValidLatin = False
return(isValidLatin)
def main():
# Test bytes (byte x95 is not defined in ISO-8859-15!)
bytes = b'\x4A\x70\x79\x6C\x79\x7A\x65\x72\x20\x64\x95\x6D\x6F\xFF'
print(isValidISO885915(bytes))
main()
There may be more elegant / Pythonic ways to do this, but it seems to do the trick, and works with both Python 2.7 and 3.3.