Don't estimate total runtime with tqdm - tqdm

I'm using tqdm to generate the progress bar for a loop where iterations take an increasing amount of time with increasing value of the iterator. The iterations per second and estimated completion metrics are thus not particularly meaningful, as previous iterations cannot (easily) be used to predict the runtime of future iterations.
Is there an easy way to disable displaying the estimation of iterations per second and total runtime with tqdm?
Relevant example code:
from tqdm import tqdm
import time
for t in tqdm(range(10)):
time.sleep(t)

tqdm's README describes the bar_format argument as follows:
Specify a custom bar string formatting. May impact performance.
[default: '{l_bar}{bar}{r_bar}'], where l_bar='{desc}: {percentage:3.0f}%|' and r_bar='| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, {rate_fmt}{postfix}]'...
Since the part you don't care about is mostly in "{r_bar}", you can just tweak that part of the default value as follows to omit [{elapsed}<{remaining}, {rate_fmt}:
from time import sleep
from tqdm import tqdm
for time in tqdm(range(10),
bar_format = "{l_bar}|{bar}| {n_fmt}/{total_fmt}{postfix}"):
sleep(time)

Related

Give tqdm a hint about task duration to improve initial accuracy

I am using tqdm to track progress when processing 6 large files. Each file is very large and takes about 10 min. During the first 10 min, tqdm is unable to offer an accurate time estimate because it has not collected any data points yet. After the first file is done, the progress report works satisfactorily.
Is there a way to hint tqdm with my own estimate of how long the task will take? For example, something like tqdm(file_list, hint="10 min") would have tqdm start with "60 min remaining" rather than "? min remaining". As it actually progresses through the iterable, it would then update this initial estimate accordingly.

Find the daily and monthly mean from daily data

I am very new to python so please bare with me.
So far I have a loop that identifies my netcdf files within a date range.
I now need to calculate the daily averages and then the monthly averages for each month and add it to my dataframe so I can plot a time series.
Heres my code so far
# !/usr/bin/python
# MODIS LONDON TIMESERIES
print ('Initiating AQUA MODIS Time Series')
import pandas as pd
import xarray as xr
from glob import glob
import netCDF4
import numpy as np
import matplotlib.pyplot as plt
import os
print ('All packages imported')
#list of files
days = pd.date_range (start='4/7/2002', end='31/12/2002')
#create dataframe
df = final_data = pd.DataFrame(index = days, columns = ['day', 'night'])
print ('Data frame created')
#for loop iterating through %daterange stated in 'days' to find path using string
for day in days:
path = "%i/%02d/%02d/ESACCI-LST-L3C-LST-MODISA-LONDON_0.01deg_1DAILY_DAY-%i%02d%02d000000-fv2.00.nc" % (day.year, day.month, day.day, day.year, day.month, day.day)
print(path)
Welcome to SO! As suggested, please try to make a minimal reproducible example.
If you are able to create an Xarray dataset, here is how to take monthly avearges
import xarray as xr
# tutorial dataset with air temperature every 6 hours
ds = xr.tutorial.open_dataset('air_temperature')
# reasamples along time dimension
ds_monthly = ds.resample(time='1MS').mean()
resample() is used for upscaling and downscaling the temporal resolution. If you are familiar with Pandas, it effectively works the same way.
What resample(time='1MS') means is group along the time and 1MS is the frequency. 1MS means sample by 1 month (this is the 1M part) and have the new time vector begin at the start of the month (this is the S part). This is very powerful, you can supply different frequencies, see the Pandas offset documentation
.mean() takes the average of the data over our desired frequency. In this case, each month.
You could replace mean() with min(), max(), median(), std(), var(), sum(), and maybe a few others.
Xarray has wonderful documentation, the resample() doc is here

dask dataframe: merge two dataframes, impute missing value and write to csv only use partial CPUs (20% in each CPU)

I want to merge two dask dataframes, impute missing values with column median and export the merged dataframe to csv files.
I got one problem: my current code cannot utilize all the 8 CPUs (~20% of each CPU)
I am not sure which part limits the CPU usage. Here is the repeatable code
import numpy as np
import pandas as pd
df1 = pd.DataFrame(
np.c_[(np.random.randint(100, size=(10000, 1)), np.random.randn(10000, 3))],
columns=['id', 'a', 'b', 'c'])
df2 = pd.DataFrame(
np.c_[(np.array(range(100)), np.random.randn(100, 10000))],
columns=['id'] + ['d_' + str(i) for i in range(10000)])
df1.id=df1.id.astype(int).astype(object)
df2.id=df2.id.astype(int).astype(object)
## some cells are missing in df2
df2.iloc[:, 1:] = df2.iloc[:,1:].mask(np.random.random(df2.iloc[:, 1:].shape) < .05)
## dask codes starts here
import dask.dataframe as dd
from dask.distributed import Client
ddf1 = dd.from_pandas(df1, npartitions=3)
ddf2 = dd.from_pandas(df2, npartitions=3)
ddf = ddf1.merge(ddf2, how='left', on='id')
ddf = ddf.fillna(ddf.quantile())
ddf.to_csv('train_*.csv', index=None, header=None)
Although all the 8 CPUs are invoked to use, only ~20% of each CPU is utilized. Can I code to improve the CPU usage?
Firstly, not that if you don't specify otherwise, Dask will use threads for execution. In threads, only one python operation can occur at a time (the "GIL"), except some lower-level code which explicitly releases the lock. The "merge" operation involves a lot of shuffling of data in memory, and I suspect releases the lock some of the time.
Secondly, all of the output is being written to the filesystem, so you will always have a bottleneck here: however fast other processing may be, you still need to feed all of it through the storage bus.
If the CPUs are working ~20%, I daresay this is still faster than a single-core version? Put simply, some workloads just parallelise better than others.

KNN classifier taking too much time even on gpu

I am classifying the MNSIT digit using KNN on kaggle but at last step it is taking to much time to execute and mnsit data is juts 15 mb like i am still waiting can you point any problem that is in my code thanks.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
print(os.listdir("../input"))
#Loading datset
train=pd.read_csv('../input/mnist_test.csv')
test=pd.read_csv('../input/mnist_train.csv')
X_train=train.drop('label',axis=1)
y_train=train['label']
X_test=test.drop('label',axis=1)
y_test=test['label']
from sklearn.neighbors import KNeighborsClassifier
clf=KNeighborsClassifier(n_neighbors=3)
clf.fit(X_train,y_train)
accuracy=clf.score(X_test,y_test)
accuracy
There isn't anything wrong with your code per se. KNN is just a slow algorithm, it's slower for you because computing distances between images is hard at scale, and it's slower for you because the problem is large enough that your cache can't really be used effectively.
Without using a different library or coding your own GPU kernel, you can probably get a speed boost by replacing
clf=KNeighborsClassifier(n_neighbors=3)
with
clf=KNeighborsClassifier(n_neighbors=3, n_jobs=-1)
to at least use all of your cores.
because you are not using gpu on kaggle actually. KNeighborsClassifier do not support gpu
In order to use the GPU for KNN, you need to specify it otherwise it defaults to CPU the documentation is here: https://simbsig.readthedocs.io/en/latest/KNeighborsClassifier.html
knn = KNeighborsClassifier(n_neighbors=3, device = 'gpu')

Feedback in NaiveBayes Text Classification

I am a newbie in machine Learning, i am building a complaint categorizer and i want to provide a feedback model so that it can improve over time
import numpy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
value=[
'drought',
'robber',
]
targets=[
'water_department',
'police_department',
]
classifier = MultinomialNB()
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(value)
classifier.partial_fit(counts[:1], targets[:1],classes=numpy.unique(targets))
for c,t in zip(counts[1:],targets[1:]):
classifier.partial_fit(c, t.split())
value.append('dogs') #new value to train
targets.append('animal_department') #new target
vectorize = CountVectorizer()
counts = vectorize.fit_transform(value)
print counts
print targets
print vectorize.vocabulary_
####problem lies here
classifier.partial_fit(counts["""dont know the index of new value"""], targets[-1])
####problem lies here
Even if i somehow find the index of newly inserted value, it is giving the error
ValueError: Number of features 3 does not match previous data 2.
even thought i made it to insert one value at a time
I will try to answer the question from a general point of view. There are two sources of problem in the Naive Bayes (NB) approach described here:
Out-of-vocabulary (OOV) problem
Incremental training of NB
OOV problem: The simplest way to tackle the OOV problem is to decompose every word into character 3 grams. How many such 3-grams are possible? Assuming lower-casing there are only 26 possible ways to fill each place and hence the total number of possible character 3-grams is 26^3=17576, which is significantly lower than the number of possible English words that you're likely to see in text.
Hence, generally speaking, while training NB, a good idea is to use probabilities of character n-grams (n=3,4,5). This will drastically reduce the OOV problem.
Incremental training: For incremental training, given a new sentence decompose it into terms (character n-grams). Update the count of of each term for its corresponding observed class label. For example, if count(t,c) denotes how many times was the term t observed in class c, simply update the count if you see t in class 0 (or class 1) during incremental training. Updating the counts will update the maximum likelihood probability estimates as well.

Resources