querying on inverted file index - search-engine

I have a project in school which I need to create a search engine using inverted index and I'm a bit blocked with how to continue.
I stored all the word that were in my docs (4 docs) using inverted file index, but in a way each word in a specific file has a row, so lets say the word 'like' can appears in doc 2 three times and doc 4 twice- so it will have 2 rows, word:like docid:2 hit:3 instoplist:0 and word:like docid:4 hit:2 instopelist:0 (hit is the number of time the word appeared inside the doc and instoplist if it part of the words that is in stop-list).
now I need to be able to do querying on that index.
lets say I need to find - car and (motorcycle or bicycle)
which is the best way to do that? how do I write the order of the search? how do I know to take motorcycle and bicycle first and do 'or' between them and then do 'and' with car ?
*ps- using php to write the code
I will appreciate any kind of help,
Thanks

You can use intersection of documents containing car with ( union of documents containing motorcycle or bicycle)
Car : doc1 , doc2 , doc3
motorcycle : doc1 , doc4
bicycle: doc1, doc2
So your final list of document should be doc1, doc2
For finding intersection and union in php.
Lets say you have 3 array $car , $motorcycle and $bicycle containing document having these words
<?php
$car = ['doc1','doc2','doc3'];
$motorcycle = ['doc1','doc4'];
$bicycle = ['doc1','doc2'];
$intersect = array_merge($motorcycle, $bicycle);
$result = array_intersect($car , $intersect);
for($x = 0; $x < count($result); $x++) {
echo $result[$x];
echo "<br>";
}
?>

Related

How to aggregate multiple rows into one in CSV?

I have following problem:
I have a CSV file, which looks like this:
1,12
1,15
1,18
2,10
2,11
3,20
And I would like to parse it somehow to get this:
1,12,15,18
2,10,11
3,20
Do you have any solution? Thanks!
Here is one solution for you.
This first part just sets up the example for testing. I am assuming you already have a file with values in the second part of the script.
$path = "$env:TEMP\csv.txt"
$data =#"
1,12
1,15
1,18
2,10
2,11
3,20
"#
$data | Set-Content $path
This should be all you need:
$path = "$env:TEMP\csv.txt"
$results = #{}
foreach($line in (Get-Content $path))
{
$split = $line -split ','
$rowid = $split[0]
$data = $split[1]
if(-not($results.$rowid))
{
$results.$rowid = $rowid
}
$results.$rowid += "," + $data
}
$results.values | Sort-Object
Your original dataset does not need to be sorted for this one to work. I slice the data up and insert it into a hashtable.
I don't know your exact code requirement. I will try to write some logic which may help you!
CSV means a text file which I can read into a string or an array
If one will look at the above CSV data, there is a common pattern i.e. after each pair there is a space in-between.
So my parsing will be depending on 2 phases
parse with ' ' i.e. single space and will insert into an array (say elements)
then parse with ',' i.e. comma from each element of elements and save into another array (say details) where odd indexes will be containing the left hand values and even indexes will be containing the right hand values.
So next while printing or using skip the odd index if you have an existing value.
Hope this helps...
Satyaranjan,
thanks for your answer! To clarify - I don't have any code requirements, I can use any language to achieve results. The point is to take unique values from first position (1,2,3) and put all related numbers on the right (1 - 12, 15 and 18 etc.). It is something like GROUP_CONCAT function in MySQL - but unfortunately I don't have such a function, so I am looking for some workaround.
Hope it is more clear now. Thanks

Categorizing Hastags based on similarities

I have different documents with a list of hashtags in each. I would like to group them under the most relevant hashtag (which would be present in the document itself).
Egs: If there are #Eco, # Ecofriendly # GoingGreen - I would like to group all these under the most relevant and representative Hashtag (say #Eco). How should I be approaching this and what techniques and algorithms should I be looking at?
I would create a bipartite graph of documents-hashtags and use clustering on a bipartite graph:
http://www.cs.utexas.edu/users/inderjit/public_papers/kdd_bipartite.pdf
This way I am not using the content of the document, but just clustering the hashtags, which is what you wanted.
Your question is not very strict, and as such may have multiple answers, however, if we assume that you literally want "I would like to group all these under the most common Hashtag", then simply loop through all hashtags, compute have often they come up, and then for each document select the one with highest number of occurences.
Something like
N = {}
for D in documents:
for h in D.hashtags:
if h not in N: N[h] = 0
N[h] += 1
for D in documents:
best = None
for h in D.hashtags:
if best==None or N[best] < N[h]:
best = h
print 'Document ',D,' should be tagged with ',best

R package Twitter to analyze tweets text

I'm using TwitteR package (specifically, the searchTwitter function) to export in a csv format all the tweets containing a specific hashtag.
I would like to analyze their text and discover how many of them contain a specific list of words that I have just saved in a file called importantwords.txt.
How can I create a function that could return me a score of how many tweets contain the words that I have written in my file importantwords.txt?
Pseudocode:
> for (every word in importantwords.txt):
> int i = 0;
> for (every line in tweets.csv):
> if (line contains(word)):
> i = i+1
> print(word: i)
Is that along the lines of what you wanted?
I think best bet is to use the tm package.
http://cran.r-project.org/web/packages/tm/index.html
This fella uses it to create Word Clouds with the information. Looking through his code will probably help you out too.
http://davetang.org/muse/2013/04/06/using-the-r_twitter-package/
If your important words is just to avoid "the" "a" and things like that this will work fine. If its for something in particular you'll need to loop over the corpus with your list of words retrieving the counts.
Hope it helps
Nathan

Parsing a CSV file with rows of varying lenghs

I am calling a webservice that's returning a comma separated dataset with varying columns and multiple text-qualified rows (the first row denotes the column names) . I need to insert each row into a database while concatenating the rows that are varied.
The data is returned like so
"Email Address","First Name","Last Name", "State","Training","Suppression","Events","MEMBER_RATING","OPTIN_TIME","CLEAN_CAMPAIGN_ID"
"scott#example.com","Scott","Staph","NY","Campaigns and activism","Social Media","Fundraiser",1,"2012-03-08 17:17:42","Training"
There can be up to 60 columns between State and Member_Rating, and the data in those fields are to get concatenated and inserted into one database column. The first four fields and the last three fields in the list will always be the same. I'm unsure the best way to tackle this.
I am not sure if this solution fits your needs. I hope so. It's a perl script that joins with - surrounded with spaces all fields but first four and last three. It uses a non standard module, Text::CSV_XS that must be installed using CPAN or similar tool.
Content of infile:
"Email Address","First Name","Last Name","State","Training","Suppression","Events","MEMBER_RATING","OPTIN_TIME","CLEAN_CAMPAIGN_ID"
"scott#example.com","Scott","Staph","NY","Campaigns and activism","Social Media","Fundraiser",1,"2012-03-08 17:17:42","Training"
Content of script.pl:
use warnings;
use strict;
use Text::CSV_XS;
my $csv = Text::CSV_XS->new({
allow_whitespace => 1,
});
open my $fh, q[<], $ARGV[0] or die qq[Open: $!\n];
while ( my $row = $csv->getline( $fh ) ) {
my $concat = join q[ - ], (#$row)[4 .. #$row-4];
splice #$row, 4, scalar #$row - (3 + 4), $concat;
$csv->print( \*STDOUT, $row );
print qq[\n];
}
Run it like:
perl script.pl infile
With following output:
"Email Address","First Name","Last Name",State,"Training - Suppression - Events",MEMBER_RATING,OPTIN_TIME,CLEAN_CAMPAIGN_ID
scott#example.com,Scott,Staph,NY,"Campaigns and activism - Social Media - Fundraiser",1,"2012-03-08 17:17:42",Training

Data collection task

I have data that follows this kind of patten:
ID Name1 Name2 Name3 Name4 .....
41242 MCJ5X TUAW OXVM4 Kcmev 1
93532 AVEV2 WCRB3 LPAQ 2 DVL2
.
.
.
As of now this is just format in a spreadsheet and has about 6000 lines. What I need to do is to create a new row for each Name after Name1 and associate that with the ID on its current row. For example, see below:
ID Name1
41242 MCJ5X
41242 TUAW
41242 OXVM4
41242 Kcmev 1
93532 AVEV2
93532 WCRB3
93532 LPAQ 2
93532 DVL2
Any ideas how I could do this? I feel like this shouldn't be too complicated but not sure of the best approach. Whether a script or some function I'd really appreciate the help.
If possible, you might want to use a csv file. These files are plain-text and most spreadsheet programs can open/modify them (I know Excel and the OpenOffice version can). If you go with this approach, your algorithm will look something like this:
read everything into a string array
create a 1 to many data structure (maybe a Dictionary<string, List<string>> or list of (string, string) tuple types)
loop over each line of the file
splice the current line on the ','s and loop over those
if this is the first splice, add a new item to the 1 to many data structure with the current splice as the Id
otherwise, add this splice to the "many" (name) part of the last item in the data structure
create a new csv file or open the old one for writing
output the "ID, Name1" row
loop over each 1-many item in the data collection
loop over the many items in the current 1-many item
output the 1 (id) + "," + current many item (current name)
You could do this in just about any language. If its a one-time use script then Python, Ruby, or Powershell (depending on platform) would probably be a good choice.

Resources