python pandas print element of dataframe - printing

I have a pandas data frame named country_codes:
>>> country_codes.head(3)
COUNTRY FIPS ISO2 ISO3
0 Afghanistan AF AF AFG
1 Albania AL AL ALB
2 Algeria AG DZ DZA
given a particular fips code:
>>> fips = 'RS'
I select the country name corresponding to that fips code:
>>> country = country_codes[country_codes['FIPS']==fips]['COUNTRY']
and print it:
>>> print(country)
201 Russia
Name: COUNTRY, dtype: object
I want to use that country name in the title of a matplotlib plot. I want the country name only. I do not want the index number or the line that says Name: COUNTRY, dtype: object.
How do I get the name only?

You're getting a series from indexing the dataframe
>>> country = country_codes[country_codes['FIPS']==fips]['COUNTRY']
>>> type(country)
<class 'pandas.core.series.Series'>
For a Series, selection by position:
>>> country.iloc[0]
'Russia'

I think create a series with FIPS as the key and COUNTRY as the value will make the code simpler:
fips = pd.Series(df["COUNTRY"].values, index=df["FIPS"])
then you can get the country by:
fips["AL"]

if you have pandas data series , and how to access via index is as below
import numpy as np
import pandas as pd
data=np.array([176.2,158.4,167.6,156.2,161.4])
heights=pd.Series(data,index=['s1','s2','s3','s4','s5'])
print(heights['s2'])

Related

Huggingface Load_dataset() function throws "ValueError: Couldn't cast"

My goal is to train a classifier able to do sentiment analysis in Slovak language using loaded SlovakBert model and HuggingFace library. Code is executed on Google Colaboratory.
My test dataset is read from this csv file:
https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_games.csv
and train dataset:
https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_accomodation.csv
Data has two columns: column of Slovak sentences and 2nd column of labels which indicate sentiment of the sentence. Labels have values -1, 0 or 1.
Load_dataset() function throws this error:
ValueError: Couldn't cast
Vrtuľník je veľmi zraniteľný pri dobre mierenej streľbe zo zeme. Brániť sa, unikať, alebo vedieť zneškodniť nepriateľa je vecou sekúnd, ak nie stotín, kedy ide život. : string
-1: int64
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 954
to
{'Priestorovo a vybavenim OK.': Value(dtype='string', id=None), '1': Value(dtype='int64', id=None)}
because column names don't match
Code:
!pip install transformers==4.10.0 -qqq
!pip install datasets -qqq
from re import M
import numpy as np
from datasets import load_metric, load_dataset, Dataset
from transformers import TrainingArguments, Trainer, AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding
import pandas as pd
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer
#links to dataset
test = 'https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_games.csv'
train = 'https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_accomodation.csv'
model_name = 'gerulata/slovakbert'
#Load data
dataset = load_dataset('csv', data_files={'train': train, 'test': test})
What is done wrong while loading the dataset?
The reason is since delimiter is used in first column multiple times the code fails to automatically determine number of columns ( some time segment a sentence into multiple columns as it cannot automatically determine , is a delimiter or a part of sentence.
But, the solution is simple: (just add column names)
dataset = load_dataset('csv', data_files={'train': train,'test':test},column_names=['sentence','label'])
output:
DatasetDict({
train: Dataset({
features: ['sentence', 'label'],
num_rows: 89
})
test: Dataset({
features: ['sentence', 'label'],
num_rows: 91
})
})

Size of vocabulary SpaCy model 'en_core_web_sm'

I tried to see the number of words in vocabulary in SpaCy small model:
model_name="en_core_web_sm"
nlpp=spacy.load(model_name)
len(list(nlpp.vocab.strings))
which only gave me 1185 words. I also tried in my colleagues' machines and gave me different results (1198 and 1183).
Is it supposed to be like this to have only such a small vocabulary to train Part-Of-Speech tagging? When I use this in my dataset, I lose a lot of words. Why the number of words vary in different machines?
Thanks!
The vocabulary is dynamically loaded so you don't have all the words in the StringStore when you first load the vocab. You can see this if you try the following...
>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> len(nlp.vocab.strings)
1180
>>> 'lawyer' in nlp.vocab.strings
False
>> doc = nlp('I am a lawyer')
>>> 'lawyer' in nlp.vocab.strings
True
>>> len(nlp.vocab.strings)
1182
It's probably easiest to simply load the vocabulary from the raw file like this..
>>> import json
>>> fn = '/usr/local/lib/python3.6/dist-packages/spacy/data/en/en_core_web_sm-2.0.0/vocab/strings.json'
>>> with open(fn) as f:
>>> strings = json.load(f)
>>> len(strings)
78930
Note that the above file location is for Ubuntu 18.04. If you're on Windows there will be a similar file but in a different location.

Dask Equivalent of pd.to_numeric

I am trying to read multiple CSV files, each around 15 GB using dask read_csv. While performing this task, dask interprets a particular column as float, however it has some few values which are of string type and later on it fails when I try to perform some operation stating it cannot convert string to float. Hence I used dtype=str argument to read all the columns as string. Now I want to convert the particular column to numeric with errors='coerce' so that I those records contain string are converted to NaN values and rest get converted to float correctly. Can you please advise how this can be achieved using dask?
Have already tried: astype conversion
import dask.dataframe as dd
df = dd.read_csv("./*.csv", encoding='utf8',
assume_missing = True,
usecols =col_names.values.tolist(),
dtype=str)
df["mycol"] = df["mycol"].astype(float)
search_df = df.query('mycol >0').compute()
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.
+-----------------------------------+--------+----------+
| Column | Found | Expected |
+-----------------------------------+--------+----------+
| mycol | object | float64 |
+-----------------------------------+--------+----------+
The following columns also raised exceptions on conversion:
- mycol
ValueError("could not convert string to float: 'cliqz.com/tracking'")
#Reproducible example
import dask.dataframe as dd
df = dd.read_csv("mydata.csv", encoding='utf8',
assume_missing = True)
df.dtypes #count column will appear as float but it has a couple of dirty values as string
search_df = df.query('count >0').compute() #This line will give the type conversion error
#Edit with one possible solution, but is this optimal while using dask?
import dask.dataframe as dd
import pandas as pd
to_n = lambda x: pd.to_numeric(x, errors="coerce")
df = dd.read_csv("mydata.csv", encoding='utf8',
assume_missing = True,
converters={"count":to_n}
)
df.dtypes
search_df = df.query('count >0').compute()
I had a similar problem and I solved it using .where.
p = ddf.from_pandas(pandas.Series(["1", "2", np.nan, "3", "4"]), 1)
p.where(~p.isna(), 999).astype("u4")
Or perhaps replacing the second line with:
p.where(p.str.isnumeric(), 999).astype("u4")
In my case my DataFrame (or Series) was the result of other operations, so I couldn't apply it directly to read_csv.
As of March 2020, dask.dataframe.to_numeric() has been implemented and is documented here
Here's a minimal example:
import pandas as pd
import dask.dataframe as dd
# create dask dataframe with dummy data incl. number as string
data = {'A': '1', 'B': 2, 'C': 3}
df = pd.DataFrame([data])
ddf = dd.from_pandas(df, npartitions=3)
# inspect dtypes
ddf.dtypes
> A object
> B int64
> C int64
> dtype: object
# apply to_numeric method
ddf.A = dd.to_numeric(ddf.A)
# verify dtypes
ddf.dtypes
> A int64
> B int64
> C int64
> dtype: object

I am not able Training models in sklearn (scikit-learn) using python

i have data file it contain data to predict the admission in MS.
it contain 9 column 8 column contain student data and 9th column contain chance of selection of student.
i am new and i don't understand error come in training model
import pandas
import numpy as np
import sklearn as sl
from sklearn.neural_network import MLPClassifier
classifier = MLPClassifier()
data = pandas.read_csv('Addmition.csv')
data_array = np.array(data)
X = data_array[:,1:8]
y = data_array[:,8]
classifier.fit(X,y)
print(classifier)
Traceback (most recent call last):
File "c.py", line 14, in <module>
classifier.fit(X,y)
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\neural_network\multilayer_perceptron.py", line 977, in fit
hasattr(self, "classes_")))
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\neural_network\multilayer_perceptron.py", line 324, in _fit
X, y = self._validate_input(X, y, incremental)
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\neural_network\multilayer_perceptron.py", line 920, in _validate_input
self._label_binarizer.fit(y)
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\preprocessing\label.py", line 413, in fit
self.classes_ = unique_labels(y)
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\utils\multiclass.py", line 96, in unique_labels
raise ValueError("Unknown label type: %s" % repr(ys))
ValueError: Unknown label type: (array
Try this:
import numpy as np
import sklearn as sl
from sklearn.neural_network import MLPRegressor
classifier = MLPRegressor()
data = pandas.read_csv('Addmition.csv')
data_array = np.array(data)
X = data_array[:,1:8]
y = data_array[:,8]
classifier.fit(X,y)
print(classifier)
Explanation:
In machine learning we may have two types of problems:
1) Classification:
Ex: Predict if a person is male or female. (discrete)
2) Regression:
Ex: Predict the age of the person. (continuous)
With this in hand we are going to see your problem, your label (chance of selection) is continous, thus we have a regression problem.
See that you are using the MLPClassifier, resulting in the 'Unknown label error'.
Try using the MLPRegressor.

Biopython retrieving particular CDS from a whole genome

I am new to Stackoverflow. I am trying to automate search process using Biopython. I have two lists, one with protein GI numbers and one with corresponding nucleotide GI numbers.
For example:
protein_GI=[588489721,788136950,409084506]
nucleo_GI=[588489708,788136846,409084493]
Second list was created using ELink. However, the nucleotide GIs correspond to whole genomes. I need to retrieve particular CDS from each genome which match the protein GI.
I tried using again ELink with different link names ("protein_nucleotide_cds","protein_nuccore") but all I get is id numbers for whole genomes. Should I try some other link names?
I also tried the following EFetch code:
import Bio
from Bio import Entrez
Entrez.email = None
handle=Entrez.efetch(db="sequences",id="588489708,588489721",rettype="fasta",retmode="text")
print(handle.read())
This method gives me nucleotide and protein sequences in fasta file but the nucleotide sequence is a whole genome.
I would be very grateful, if somebody could help me.
Thanking you in advance!
I hope help you
import Bio
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "mail#example.com"
gi_protein = "GI:588489721"
gi_genome = "GI:588489708"
handle=Entrez.efetch(db="sequences", id=gi_protein,rettype="fasta", retmode="text")
protein = SeqIO.parse(handle, "fasta").next()
handle=Entrez.efetch(db="sequences", id=gi_genome, rettype="gbwithparts", retmode="text")
genome = SeqIO.parse(handle, "gb").next()
#to extract feature with 'id' equal to protein
feature = [f for f in gb.features if "db_xref" in f.qualifiers and gi_protein in f.qualifiers["db_xref"]]
#to get location of CDS
start = feature[0].location.start.position
end = feature[0].location.end.position
strand = feature[0].location.strand
seq = genome[start: end]
if strand == 1:
print seq.seq
else:
#if strand is -1 then to get reverse complement
print seq.reverse_complement().seq
print protein.seq
then you get:
ATGGATTATATTGTTTCAGCACGAAAATATCGTCCCTCTACCTTTGTTTCGGTGGTAGGG
CAGCAGAACATCACCACTACATTAAAAAATGCCATTAAAGGCAGTCAACTGGCACACGCC
TATCTTTTTTGCGGACCGCGAGGTGTGGGAAAGACGACTTGTGCCCGTATCTTTGCTAAA
ACCATCAACTGTTCGAATATATCAGCTGATTTTGAAGCGTGTAATGAGTGTGAATCCTGT
AAGTCTTTTAATGAGAATCGTTCTTATAATATTCATGAACTGGATGGAGCCTCCAATAAC
TCAGTAGAGGATATCAGGAGTCTGATTGATAAAGTTCGTGTTCCACCTCAGATAGGTAGT
TATAGTGTATATATTATCGATGAGGTTCACATGTTATCGCAGGCAGCTTTTAATGCTTTT
CTTAAAACATTGGAAGAGCCACCCAAGCATGCCATCTTTATTTTGGCCACTACTGAAAAA
CATAAAATACTACCAACGATCCTGTCTCGTTGCCAGATTTACGATTTTAATAGGATTACC
ATTGAAGATGCGGTAGGTCATTTAAAATATGTAGCAGAGAGTGAGCATATAACTGTGGAA
GAAGAGGGGTTAACCGTCATTGCACAAAAAGCTGATGGAGCTATGCGGGATGCACTTTCC
ATCTTTGATCAGATTGTGGCTTTCTCAGGTAAAAGTATCAGCTATCAGCAAGTAATCGAT
AATTTGAATGTATTGGATTATGATTTTTACTTTAGGTTGGTGGATGCTTTTCTGGCAGAA
GATACTACACAAACACTATTGATTTTTGATGAGATATTGAAACGGGGATTTGATGCACAT
CATTTTATTTCCGGTTTAAGTTCTCATTTGCGTGATTTACTTGTATGTAAGGATGCAGCC
ACCATTCAGTTGCTGGATGTGGGTGCTAAAATTAAGGAGAAGTACGGTGTTCAGGCGCAA
AAAAGTACGATTGACTTTTTAATGGATGCTTTAAATATTACCAACGATTGCGATTTGCAA
TATAGGGTGGCTAAAAATAAGCGTTTGCATGTGGAGTTTGCTCTTCTTAAGATAGCACGT
GTATTAGATGAACAAAGAAAAAAGTAG
MDYIVSARKYRPSTFVSVVGQQNITTTLKNAIKGSQLAHAYLFCGPRGVGKTTCARIFAK
TINCSNISADFEACNECESCKSFNENRSYNIHELDGASNNSVEDIRSLIDKVRVPPQIGS
YSVYIIDEVHMLSQAAFNAFLKTLEEPPKHAIFILATTEKHKILPTILSRCQIYDFNRIT
IEDAVGHLKYVAESEHITVEEEGLTVIAQKADGAMRDALSIFDQIVAFSGKSISYQQVID
NLNVLDYDFYFRLVDAFLAEDTTQTLLIFDEILKRGFDAHHFISGLSSHLRDLLVCKDAA
TIQLLDVGAKIKEKYGVQAQKSTIDFLMDALNITNDCDLQYRVAKNKRLHVEFALLKIAR
VLDEQRKK

Resources