Mahout Recommender no output - about input file format - mahout

I'am using mahout-distribution-0.9. I have a problem in my program.
import java.io.File;
import java.util.List;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.similarity.UserSimilarity;
import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity;
import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood;
import org.apache.mahout.cf.taste.recommender.RecommendedItem;
import org.apache.mahout.cf.taste.recommender.Recommender;
import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender;
class RecommenderIntro {
public static void main(String[] args) throws Exception {
DataModel model =
//new FileDataModel (new File("F:\\ml-10M100K\\intro.csv"));
new FileDataModel (new File("F:\\ml-10M100K\\ratingsShort.dat"),"::");
UserSimilarity similarity = new PearsonCorrelationSimilarity (model);
UserNeighborhood neighborhood = new NearestNUserNeighborhood (2, similarity, model);
Recommender recommender = new GenericUserBasedRecommender (model, neighborhood, similarity);
List<RecommendedItem> recommendations = recommender.recommend(1, 2);
for (RecommendedItem recommendation : recommendations) {
System.out.println(recommendation);
}
}
}
The content in File intro.csv is like:
1,101,5.0
1,102,3.0
1,103,2.5
2,101,2.0
2,102,2.5
2,103,5.0
When I use intro.csv to run this ,it has output in eclipse:
RecommendedItem[item:104, value:4.257081]
RecommendedItem[item:106, value:4.0]
The content in File ratingsShort.dat is like:
1::122::5::838985046
1::185::5::838983525
1::231::5::838983392
1::292::5::838983421
2::733::3::868244562
2::736::3::868244698
or change the content of ratingsShort.dat to :
1,539,5
1,589,5
2,110,5
2,151,3
2,733,3
2,802,2
2,1210,4
2,1544,3
3,1246,4
3,1408,3.5
3,1552,2
3,1564,4.5
When I use ratingsShort.dat,there is no output in eclipse.
FileDataModel(File dataFile, String delimiterRegex)
The method in Mahout support this usage,but why it has no output?
Can anybody who give me some advise? Thanks a lot!

OK.I figure out my problem.I changed my movielens from ml-10m.zip to ml-1m.zip. It does have output.
So,This issue is because THE DATASET I intercept IS not appropriate!The intro.csv from Internet is
sufficient for mahout to caculate the recommend value but not my dataset that I cut as will.

You need to translate your IDs into Mahout IDs. Mahout treats user and items IDs as the row and column numbers of the rating. So the first ID for row/user will be "0", which corresponds to your id of "1", The same for column/item IDs. If your IDs were only the ones shown above they would need to be translated to Mahout ids as below:
0,2,5
0,3,5
1,0,5
1,1,3
1,4,3
1,5,2
1,6,4
1,10,3
2,7,4
2,8,3.5
2,9,2
2,11,4.5
It doesn't matter how you map row/user and column/item IDs to mahout IDs (i did it above by sort order but this is not required) but the Mahout IDs must be contiguous non-negative integers. Then when you get recommendations they must be translated back into your IDs.

Related

GluonTS example airpassengers dataset not found

I am trying to run the GluonTS example code, going through some struggle to install the libraries, now I get the following error:
FileNotFoundError: C:\Users\abcde\.mxnet\gluon-ts\datasets\airpassengers\test
The C:\Users\abcde\.mxnet\gluon-ts\datasets\airpassengers\ does exist but contains only train folder. Have tried reinstalling but to no avail. Any ideas how to fix this and run the example, even if finding the dataset in correct format elsewhere?
EDIT: To clarify, I was referring to an example on https://ts.gluon.ai/stable/
import matplotlib.pyplot as plt
from gluonts.dataset.util import to_pandas
from gluonts.dataset.pandas import PandasDataset
from gluonts.dataset.repository.datasets import get_dataset
from gluonts.mx import DeepAREstimator, Trainer
dataset = get_dataset("airpassengers")
deepar = DeepAREstimator(prediction_length=12, freq="M", trainer=Trainer(epochs=5))
model = deepar.train(dataset.train)
# Make predictions
true_values = to_pandas(list(dataset.test)[0])
true_values.to_timestamp().plot(color="k")
prediction_input = PandasDataset([true_values[:-36], true_values[:-24], true_values[:-12]])
predictions = model.predict(prediction_input)
for color, prediction in zip(["green", "blue", "purple"], predictions):
prediction.plot(color=f"tab:{color}")
plt.legend(["True values"], loc="upper left", fontsize="xx-large")
There was an incorrect import on the earlier version of the example, which was since corrected, also I needed to specify regenerate=True while getting the dataset, so:
dataset = get_dataset("airpassengers", regenerate=True).

Huggingface Load_dataset() function throws "ValueError: Couldn't cast"

My goal is to train a classifier able to do sentiment analysis in Slovak language using loaded SlovakBert model and HuggingFace library. Code is executed on Google Colaboratory.
My test dataset is read from this csv file:
https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_games.csv
and train dataset:
https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_accomodation.csv
Data has two columns: column of Slovak sentences and 2nd column of labels which indicate sentiment of the sentence. Labels have values -1, 0 or 1.
Load_dataset() function throws this error:
ValueError: Couldn't cast
Vrtuľník je veľmi zraniteľný pri dobre mierenej streľbe zo zeme. Brániť sa, unikať, alebo vedieť zneškodniť nepriateľa je vecou sekúnd, ak nie stotín, kedy ide život. : string
-1: int64
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 954
to
{'Priestorovo a vybavenim OK.': Value(dtype='string', id=None), '1': Value(dtype='int64', id=None)}
because column names don't match
Code:
!pip install transformers==4.10.0 -qqq
!pip install datasets -qqq
from re import M
import numpy as np
from datasets import load_metric, load_dataset, Dataset
from transformers import TrainingArguments, Trainer, AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding
import pandas as pd
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer
#links to dataset
test = 'https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_games.csv'
train = 'https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_accomodation.csv'
model_name = 'gerulata/slovakbert'
#Load data
dataset = load_dataset('csv', data_files={'train': train, 'test': test})
What is done wrong while loading the dataset?
The reason is since delimiter is used in first column multiple times the code fails to automatically determine number of columns ( some time segment a sentence into multiple columns as it cannot automatically determine , is a delimiter or a part of sentence.
But, the solution is simple: (just add column names)
dataset = load_dataset('csv', data_files={'train': train,'test':test},column_names=['sentence','label'])
output:
DatasetDict({
train: Dataset({
features: ['sentence', 'label'],
num_rows: 89
})
test: Dataset({
features: ['sentence', 'label'],
num_rows: 91
})
})

pyspark.sql.utils.IllegalArgumentException: 'Field "features" does not exist

I am trying to perform topic modelling and sentimental analysis on text data over SparkNLP. I have done all the pre-processing steps on the dataset but getting an error in LDA.
Error
Program is:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StopWordsRemover, CountVectorizer, IDF
from pyspark.ml.clustering import LDA
from pyspark.sql.functions import col, lit, concat, regexp_replace
from pyspark.sql.utils import AnalysisException
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType
from pyspark.ml.clustering import LDA
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import Normalizer
from pyspark.ml.linalg import Vectors
dataframe_new = spark.read.format('com.databricks.spark.csv') \
.options(header='true', inferschema='true') \
.load('/home/cdh#psnet.com/Gourav/chap3/abcnews-date-text.csv')
get_tokenizers = Tokenizer(inputCol="headline_text", outputCol="get_tokens")
get_tokenized = get_tokenizers.transform(dataframe_new)
remover = StopWordsRemover(inputCol="get_tokens", outputCol="row")
get_remover = remover.transform(get_tokenized)
counter_vectorized = CountVectorizer(inputCol="row", outputCol="get_features")
getmodel = counter_vectorized.fit(get_remover)
get_result = getmodel.transform(get_remover)
idf_function = IDF(inputCol="get_features", outputCol="get_idf_feature")
train_model = idf_function.fit(get_result)
outcome = train_model.transform(get_result)
lda = LDA(k=10, maxIter=10)
model = lda.fit(outcome)
Schema of DataFrame after the IDF :
According to the documentation, LDA includes a featuresCol argument, with default value featuresCol='features', i.e. the name of the column that holds the actual features; according to your shown schema, such a column is not present in your dataframe, hence the expected error.
It is not exactly clear which column contains the features in your dataframe - get_features or get_idf_feature (they look identical in the sample you show); assuming it is get_idf_feature, you should change the LDA call to:
lda = LDA(featuresCol='get_idf_feature', k=10, maxIter=10)
Spark (including pyspark) ML API has a quite distinct and different logic than, say, scikit-learn and similar frameworks; one of the differences is indeed that the features have to be all in a single column of the respective dataframe. For a general demonstration of the idea, see own answer in KMeans clustering in PySpark (it is about K-Means, but the logic is identical).

Combing the NLTK text features with sklearn Vectorized features

I am trying to combine the dict type features used in NLTK along with the SKLEARN tfidf feature for each instance.
Sample Input:
instances=[["I am working with text data"],["This is my second sentence"]]
instance = "I am working with text data "
def generate_features(instance):
featureset["suffix"]=tokenize(instance)[-1]
featureset["tfidf"]=self.tfidf.transform(instance)
return features
from sklearn.linear_model import LogisticRegressionCV
from nltk.classify.scikitlearn import SklearnClasskifier
self.classifier = SklearnClassifier(LogisticRegressionCV())
self.classifier.train(feature_sets)
This tfidf is trained on all the instances. But when I train the nltk classifier using this featureset it throws the following error.
self.classifier.train(feature_sets)
File "/Library/Python/2.7/site-packages/nltk/classify/scikitlearn.py", line 115, in train
X = self._vectorizer.fit_transform(X)
File "/Library/Python/2.7/site
packages/sklearn/feature_extraction/dict_vectorizer.py", line 226, in fit_transform
return self._transform(X, fitting=True)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 174, in _transform
values.append(dtype(v))
TypeError: float() argument must be a string or a number
I understand the issue here, that it cannot vectorize the already vectorized features. But is there a way to fix this ?
For those who might visit this question in future, I did the following thing which solved the issue.
from sklearn.linear_model import LogisticRegressionCV
from scipy.sparse import hstack
def generate_features(instance):
featureset["suffix"]=tokenize(instance)[-1]
return features
feature_sets=[(generate_features(instance),label) for instance in instances]
X = self.vec.fit_transform([item[0] for item in feature_sets]).toarray()
Y = [item[1] for item in feature_sets]
tfidf=TfidfVectorizer.fit_transform(instances)
X=hstack((X,tfidf))
classifier=LogisticRegressionCV()
classifier.fit(X,Y)
i don't know if it's help or not. In my case, featureset["suffix"]'s values must be string or number. For example :
featureset["suffix"] = "some value"

Biopython retrieving particular CDS from a whole genome

I am new to Stackoverflow. I am trying to automate search process using Biopython. I have two lists, one with protein GI numbers and one with corresponding nucleotide GI numbers.
For example:
protein_GI=[588489721,788136950,409084506]
nucleo_GI=[588489708,788136846,409084493]
Second list was created using ELink. However, the nucleotide GIs correspond to whole genomes. I need to retrieve particular CDS from each genome which match the protein GI.
I tried using again ELink with different link names ("protein_nucleotide_cds","protein_nuccore") but all I get is id numbers for whole genomes. Should I try some other link names?
I also tried the following EFetch code:
import Bio
from Bio import Entrez
Entrez.email = None
handle=Entrez.efetch(db="sequences",id="588489708,588489721",rettype="fasta",retmode="text")
print(handle.read())
This method gives me nucleotide and protein sequences in fasta file but the nucleotide sequence is a whole genome.
I would be very grateful, if somebody could help me.
Thanking you in advance!
I hope help you
import Bio
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "mail#example.com"
gi_protein = "GI:588489721"
gi_genome = "GI:588489708"
handle=Entrez.efetch(db="sequences", id=gi_protein,rettype="fasta", retmode="text")
protein = SeqIO.parse(handle, "fasta").next()
handle=Entrez.efetch(db="sequences", id=gi_genome, rettype="gbwithparts", retmode="text")
genome = SeqIO.parse(handle, "gb").next()
#to extract feature with 'id' equal to protein
feature = [f for f in gb.features if "db_xref" in f.qualifiers and gi_protein in f.qualifiers["db_xref"]]
#to get location of CDS
start = feature[0].location.start.position
end = feature[0].location.end.position
strand = feature[0].location.strand
seq = genome[start: end]
if strand == 1:
print seq.seq
else:
#if strand is -1 then to get reverse complement
print seq.reverse_complement().seq
print protein.seq
then you get:
ATGGATTATATTGTTTCAGCACGAAAATATCGTCCCTCTACCTTTGTTTCGGTGGTAGGG
CAGCAGAACATCACCACTACATTAAAAAATGCCATTAAAGGCAGTCAACTGGCACACGCC
TATCTTTTTTGCGGACCGCGAGGTGTGGGAAAGACGACTTGTGCCCGTATCTTTGCTAAA
ACCATCAACTGTTCGAATATATCAGCTGATTTTGAAGCGTGTAATGAGTGTGAATCCTGT
AAGTCTTTTAATGAGAATCGTTCTTATAATATTCATGAACTGGATGGAGCCTCCAATAAC
TCAGTAGAGGATATCAGGAGTCTGATTGATAAAGTTCGTGTTCCACCTCAGATAGGTAGT
TATAGTGTATATATTATCGATGAGGTTCACATGTTATCGCAGGCAGCTTTTAATGCTTTT
CTTAAAACATTGGAAGAGCCACCCAAGCATGCCATCTTTATTTTGGCCACTACTGAAAAA
CATAAAATACTACCAACGATCCTGTCTCGTTGCCAGATTTACGATTTTAATAGGATTACC
ATTGAAGATGCGGTAGGTCATTTAAAATATGTAGCAGAGAGTGAGCATATAACTGTGGAA
GAAGAGGGGTTAACCGTCATTGCACAAAAAGCTGATGGAGCTATGCGGGATGCACTTTCC
ATCTTTGATCAGATTGTGGCTTTCTCAGGTAAAAGTATCAGCTATCAGCAAGTAATCGAT
AATTTGAATGTATTGGATTATGATTTTTACTTTAGGTTGGTGGATGCTTTTCTGGCAGAA
GATACTACACAAACACTATTGATTTTTGATGAGATATTGAAACGGGGATTTGATGCACAT
CATTTTATTTCCGGTTTAAGTTCTCATTTGCGTGATTTACTTGTATGTAAGGATGCAGCC
ACCATTCAGTTGCTGGATGTGGGTGCTAAAATTAAGGAGAAGTACGGTGTTCAGGCGCAA
AAAAGTACGATTGACTTTTTAATGGATGCTTTAAATATTACCAACGATTGCGATTTGCAA
TATAGGGTGGCTAAAAATAAGCGTTTGCATGTGGAGTTTGCTCTTCTTAAGATAGCACGT
GTATTAGATGAACAAAGAAAAAAGTAG
MDYIVSARKYRPSTFVSVVGQQNITTTLKNAIKGSQLAHAYLFCGPRGVGKTTCARIFAK
TINCSNISADFEACNECESCKSFNENRSYNIHELDGASNNSVEDIRSLIDKVRVPPQIGS
YSVYIIDEVHMLSQAAFNAFLKTLEEPPKHAIFILATTEKHKILPTILSRCQIYDFNRIT
IEDAVGHLKYVAESEHITVEEEGLTVIAQKADGAMRDALSIFDQIVAFSGKSISYQQVID
NLNVLDYDFYFRLVDAFLAEDTTQTLLIFDEILKRGFDAHHFISGLSSHLRDLLVCKDAA
TIQLLDVGAKIKEKYGVQAQKSTIDFLMDALNITNDCDLQYRVAKNKRLHVEFALLKIAR
VLDEQRKK

Resources