how to find the abnormal id from so many ids - machine-learning

We run an affiliate program. Users who sign up can gain points when they successfully recruit other users. However, spammers are abusing this program, and automatically signing up large numbers of accounts. We want to prevent this from happening by closing down clearly machine-generated accounts. My idea for this is to write a program to identify machine-generated account names, or at least select a subset for manual inspection.
So far, we have found that there are two types of abnormal ids:
The first one is that there are some ids looks very similar to others, such as:
wss12345
wss12346
wss12347
test1
test2
...
The second one is that there are some ids looks like randomly generated with out rules, such as:
MiDjiSxxiDekiE
NiMjKhJixLy
DAFDAB7643
...
For the first one, I use the Levenshtein(edit) distance. This method can find out some ids, which was illustrate in type 1. (I have done this, and can get good performance)
For the second one, I can calculate the probabilty for the ids, just like:
id = "DAFDAB7643:
p(id) = p(D)*p(A|D)*p(F|A)*p(D|F)*...*p(3|4)
So I can use the probability to filter out the abnormal ids. (Just an idea; I haven't tried it out.)
Can anyone give me other suggestions about this topic? How else could I approach this problem? Can you see flaws or omissions in my attempts?

Assuming that these new accounts refer back to the the recruiter's ID, I'd look at the rate and/or sheer number of new accounts associated with a given recruiter.
Some analysis on IP addresses or similar may also indicate if multiple users are coming from the same computer.
I'd use a dictionary of words, and kind of do the reverse of detecting poor passwords -- human user names should have dictionary words, personal names, lack punctuation, not include repeated characters, be mostly lower case etc.
Sort of going back to 1. above -- if a recruiter has an anamalously tight cluster of IDs, using the features you've already identified, would be a good flag. I think that this might be, essentially, #larsmans comment directly under the question.
I'd be curious to know if re-purposing password checking algorithms (item 3) provides any benefit.

You're not telling us what sort of site you are running, so this is a bit on the speculative side; but consider Stack Overflow as a prime example of successfully promoting good behavior through the use of a user reputation system, and weeding out many kinds of unwanted behaviors.
A quick, hackish fix might be to progressively deduct from the score when the amount of dormant recruit accounts grows larger, but a more rewarding and compelling fix is to award higher reputation scores for actually contributing to the site's content. However, this depends on the type of site you have; a stock market tips site, say, obviously works quite differently from a techical discussion forum.

Related

How should BDD statements be properly constructed? Is there a convention used in teams?

Is there a preferred way of creating BDD scenarios in small agile teams and amongst the community? I'm using courgette and it gives an example on https://courgette-testing.com/bdd
Scenario: Refunded items should be returned to stock
Given a customer previously bought a black sweater from me
And I have three black sweaters in stock.
When they return the black sweater for a refund
Then I should have four black sweaters in stock.
Does this sound like a good idea? Would this work well for communication in teams?
I've used their web steps bit, and am now doing the refactor bit to make it clear to the business.
Any links would help. Thanks
The conversations in BDD are more important than the tools. Rather than starting with the finely-grained specification in Courgette's example, try talking to the business first. Ask them for an example of the kind of behaviour they want.
When you write it down, start by just writing it the way they describe it. It's amazing how few people listen properly! After you've got the example from them, take a look at it. Can you see which bits are the contexts (Givens) and which are the outcomes (Thens)? Which is the step that's associated with triggering the behaviour you're interested in (Whens)?
Once you've worked that out, there are a couple more questions I like to ask:
Is there any other context which, for this same event, gives a different outcome?
Is there any other outcome that's important?
For instance, if I was implementing this behaviour for a big supermarket, I might come across an example like:
"Oh! No, don't add food back to stock. We don't know how it's been stored. We refund it if there's something wrong with it, but we bin it."
You can probably see how that might change your code!
Testers are really great at asking these questions and spotting missing scenarios! This leads us to the "Three Amigos" pattern. I like to include:
A business person, Product Owner, subject matter expert or person with the problem
A tester
A dev (or a pair of devs).
You can also include UI designers, technical writers, etc. - Matt Wynne says it's "Three Amigos where three is a number between 3 and 7".
I really like it when the developer writes the scenarios down, in any form that allows them to get to the "Given, When, Then". Sometimes I'll do it in the meeting; sometimes I do it later and show it or send it to my business person.
Courgette's example is something that typically happens when people don't have these conversations. If you start with the conversations, you're much more likely to get something that matches the above. Not only are those declarative steps easier for business to read and for the whole team to talk about, but they're also easier to maintain, as the detail of how they're achieved is hidden (usually in Step Definitions, and further in Page Objects).
There's all kinds of useful posts for BDD newcomers on my blog if you want to know more!

BDD: Should I add multiple givens and outcomes to a single scenario or split scenarios based on outcome?

Given a scenario that tests sending a message to a 3rd-party API, I can add multiple givens and outcomes to a single scenario, for each property of the message. This makes the scenario quite complex.
I can also break these out into separate scenarios. But they really are not different scenarios.
This is a scenario with multiple givens and outcomes:
Scenario 1: An order
Given an order
And that has order ID equal to 42
And that has affiliate reference equal to foo
When the conversion for the order is sent
Then the conversion has an ID equal to 42
And the conversion has an affiliate ID equal to foo
And here I have broken it up into multiple scenarios:
Scenario 1: An order with a specific order ID
Given an order that has order ID equal to 42
When the conversion for the order is sent
Then the conversion has an ID equal to 42
Scenario 2: An order with a specific affiliate reference
Given an order that has affiliate reference equal to foo
When the conversion for the order is sent
Then the conversion has an affiliate ID equal to foo
Try having a conversation with someone in the business about the order. Ask them for an example of the kind of order that has an affiliate reference.
If they naturally talk about an order with a certain ID and affiliate reference, and those two things come together, it's fine to put it in one scenario. You'll probably hear them talk about both things in the same clause, for instance:
Bus: So, when we send the order for conversion, it should have the same ID and
affiliate reference.
MvO: Can you give me an example of those? The ID and affiliate
reference?
Bus: Sure, an ID is a simple integer, so, 42, and the affiliate
reference is the name of our affiliate, so something like 'Foo'.
(By the way, use realistic affiliate names if you can - it makes it easier for the business to spot if you've missed something!)
When we convert this to Gherkin, keeping the language as natural as possible (I wrote a blog post on this), we get something like:
Given an order with ID 42 and affiliate reference "foo"
When we send the order for conversion
Then the conversion should have the same ID and affiliate reference.
If, however, there are some orders which don't have affiliate references, or retaining the affiliate reference is a completely separate capability and the business talk about it separately, probably you want two scenarios.
Note there are some other benefits to talking to your business representatives!
First, they'll probably phrase the "when" in the active voice (we send the order) rather than the passive (the order is sent) which makes it much easier to see who's doing what. This is especially important in scenarios with multiple roles, and helps us think about who or what triggers the outcome. (Here's a blog post about tenses and voices in BDD.)
Second, you get a chance to question them! "Are there any orders which don't have affiliate references? Do all orders have IDs like that, or do you have some old orders with old-style IDs floating around in the system?" And so forth. If you can't think of questions to ask easily, bring a tester with you. Testers are great at thinking of questions to ask. (I wrote a blog post on this, too.)
Third, you're more likely to carry the same language the business use into the code, so it's going to be easier to maintain, and you'll be able to have conversations about it more easily too.
If your business aren't actually interested in conversations around what the API does then don't use Gherkin-based tools for the API tests. You can maintain a little DSL in plain old XUnit much more easily than in English.
To cover your question more generically: yes, it's fine to have multiple givens and outcomes in a scenario. I generally reckon that once you've got more than seven steps, you want to be splitting it into separate scenarios.
Make sure you have conversations around the scenarios, though, because a lot of these problems go away when you do.

How precise user stories should be?

I've just started using SpecFlow. It's a tool for creating business understandable test scenarios in a BDD manner. Basically it transforms user stories to unit tests.
I'm a beginner to user stories and I wonder about its length. Is this a good practice to create very precise user stories? Here's an example:
In order to get help
As a StackOverflow user
I want to add post
with name and content
and add tags to it
and format the content
and the information about my post edits to be stored in the system
and some more things like that
Should I keep my stories compact? If so - how can I manage detailed requirements? Or maybe it's nothing wrong in very long and precise I want section in a user story?
If you could develop an entire system in a couple of weeks, and do that reliably, nobody would ever worry about "user stories". They'd just get you to develop the system, sit with you, and tweak it as it went.
User stories only exist in order to get feedback from people who can't be with you all the time, and to help you learn what it is that your users (and other stakeholders) really want.
Here's how I treat a list like this:
In order to get help
As a StackOverflow user
I want to add post
with name and content
and add tags to it
and format the content
and the information about my post edits to be stored in the system
You want to get help. Which of these actually add to your ability to get help? Is it you wanting help, or do you want to offer help to other people? Do you want recognition for the help you're offering other people? The top part of this seems false (and it's why it's really difficult to have these conversations with fake requirements).
I think there are multiple requirements here, and far beyond the scope of just one user story. With an analyst hat on, here's how I might break this down:
In order to award great content with appropriate recognition,
as Stack Exchange,
we want people's usernames to appear with their content.
Of course, the users want this too, but they're not paying for it (except through adverts). So work out who's paying for this, and why.
In order to get more page impressions and keep people on the site for longer,
as Stack Exchange,
we want users to be able to find similar content really easily.
Hm. This one's a bit trickier. See, the user doesn't really want to spend their entire life on StackOverflow. It's just that if we give them the appropriate recognition, and make it easier for others to find their content, they might do that. Not all "user stories" actually benefit users. Find out who's paying for them, and why; then you find your real stakeholder. It's also OK for a story to benefit more than one stakeholder, and it's easy to see how to rephrase this from the user's point of view as well.
format the content
Honestly not sure about this one. It might be about being able to emphasise important points, etc. There are a ton of aesthetic ideals that don't lend themselves well to BDD and automated scenarios. Sometimes the only way to do this is to try, and get feedback.
In order to avoid retyping my request every time
As the user
I want the information about my post edits to be stored in the system
Well, yes, that would be nice.
The thing is that each of these can be developed independently. If you can think of any feature, any item that you could get rid of and still have the release be valuable, put it in a separate story.
If you can replace "I want to..." with "I want to be able to..." it's likely that what you have there isn't a story, but an entire capability. Most people do this instinctively. Lots of people call those "epics".
I've just shown you how I break them down. It's a pretty simple process.
First, look at your requirements. If there's anything for which you can say, "I want to be able to..." or "Someone wants to be able to..." then you know that's a completely different capability, which means it's going to be a separate story.
You can then separate those into contexts. So you might have stories like:
In order to free up our junior traders
We want them to be able to generate contracts automatically
So that they can help with the trade analysis instead of typing.
If that seems too big for the feedback cycle (typically a two-week sprint), you can divide it further.
In order to free up our junior traders
We want them to be able to generate *orange juice* contracts automatically
So that they can help with the trade analysis instead of typing.
Here, we're focusing on being able to trade orange juice, but we could equally narrow the story down to the FTSE, or the US, or the NY stock exchange. This is how we focus the efforts on the thing that will deliver: protecting revenue, lowering costs or generating value.
To turn these into scenarios, I ask, "Can you give me an example of an OJ trade on the NY stock exchange?" If I see anything generic that I don't understand, I ask, "Can you give me an example of that?"
That example becomes my first scenario. The context (given) is defined by the limits of the story. The event (when) is the performance of the capability. The outcome (then) is the resulting value.
In answer to your question - yes, I think it's important to create precise user stories. That means knowing why it's valuable, defining the context that you're going to cover, and suggesting an example of what the outcome might be.
The example you gave is more than just one story, though. It's not precise enough. Hopefully the advice here will help you to narrow stories down to something useful. One or two days is a good length for a story, but if you're starting down this path and find they're a bit longer, that's OK.
Your changes are also stories.
I always advise the following:
Try cutting your stories in scenarios. The more scenarios, the better you can pinpoint when something is going wrong. Give all scenarios subjective names.
Now for example, your test. If step 1 goes wrong, all your other steps are not going to get tested.
Also use the Given, When and Then tags to read your scenarios easily.
So instead, you could say:
Feature: As a StackOverflow user I want to add a post
Scenario: I go to stackoverflow website
Given I open the browser
And I go to the stackoverflow website
When I click New Post
Then a new page appears to insert my data
Scenario: I fill in data for my post - Name and content
Given I do not modify this page
When I fill in name
And I fill in content
Then I add tags to it
And I format the content
Scenario: Check if information about post edits are stored in the system
Given...
Guess you will get where this is going :-)
There is no right detail level of user stories, as user stories shrink in size (scope) and grow in detail (specifications) over time. This slide shows a nice visualization from Gojko Adzic about this: http://www.slideshare.net/chassa/2015-0214agile-reqend2endcomplete/6
For the question on how precise and detailed a Gherkin scenario should be: Scenario should reveal interesting aspects of the user story to be implemented. They should use concrete (key) examples rather than abstract descriptions. The examples should focus on the aspect that should be illustrated. The scenario title should be an abstract description of the rule or aspect that is illustrated with the example(s) provided in the scenario.
You usually start with a main aspect (happy path) scenario, and then try to “break the model” by coming up with new examples (cases) that explore other aspects of the story. You start by asking the questions “How would you try out the story when it was implemented?” (happy path) and “What should happen if …?” to collect potential scenarios to consider (probably defining some of the questions to be out of scope for this story).
After that, you’re trying to answer these questions (scenario title) and illustrate them with concrete examples (scenario steps). This slide gives an idea of “break the model”: http://www.slideshare.net/chassa/2015-0214agile-reqend2endcomplete/61

How to document non-functional requirements (NFRs) in a story/feature?

The Specification By Example book states the non-functional requirements (commonly referred to as NFRs) can be specified using examples.
I've also been told by a colleague that non-functional requirements may be specified using SBE stories using the format:
Scenario: ...
Given ...
When ...
Then ...
Here is an example functional and non-functional requirement taken from wikipedia:
A system may be required to present the user with a display of the
number of records in a database. This is a functional requirement. How
up-to-date this number needs to be is a non-functional requirement. If
the number needs to be updated in real time, the system architects
must ensure that the system is capable of updating the displayed
record count within an acceptably short interval of the number of
records changing.
Question 1: Can the non-functional requirement be specified as a story?
Question 2: Should the non-functional requirement be specified as a story?
Question 3: What would the story look like?
I'll give an answer by working through an example.
Let us say that your team has already implemented the following story:
Scenario: User can log in to the website
Given I have entered my login credentials
When I submit these credentials
Then I get navigated to my home screen
To answer Question 1) - Can the non-functional requirement be specified as a story?
The project stakeholders have given you a NFR which reads:
For all website actions, a user should wait no longer that five
seconds for a response.
You could create a story for this as follows:
Scenario: User can log in to the website in a timely fashion
Given I have entered my login credentials
When I submit these credentials
Then I get navigated to my home screen
And I should have to wait no longer than the maximum acceptable wait time
Note that instead of imperatively specifying '5' seconds, I have kept the scenario declarative and instead specified "wait no longer than the maximum acceptable wait time".
To answer question 2) - Should the non-functional requirement be specified as a story?
The NFRs should definitely be specified as a story.
Creating a story will allow this task's complexity to be estimated (so that the team can determine how difficult it is relative to past stories), plus the team can break the story down into tasks (which can be estimated in hours, so that you can work out if the team can implement this story in the current sprint).
Hence in my contrived example, the team would have already implemented the code to log-in, but they'd then determine how to implement the requirement that it must take no longer than 5 seconds to log in. You will also allow be able to explore the inverse of this problem i.e. what happens if it takes longer than five seconds to log-in? e.g.
Scenario: User encounters a delay when logging in to the website
Given I have entered my login credentials
When I submit these credentials
And I wait for over the the maximum acceptable wait time
Then the Production team is informed
And the problem is logged
And I get navigated to my home screen
And finally, regarding question 3) - What would the story look like?
I've detailed how the stories would look like in answers 1) and 2)
Q1: Yes, definitely they can.
Take a look on that article describing Handling Non Functional Requirements in User Stories.
Q2. From my perspective if you able to create them it's really worth of keeping and tracking them in such a way. But citing this article
There is no magical agile practice that helps you uncover NFR. The
first step is to take responsibility. NFR can be represented as User
Stories if the team finds a that this helps to keep these visible.
However, be aware that surfacing such stories may create issues around
the priority of work done on them against more obvious features.
Q3. Take a look on the mentioned article from Q1.
I think the boundaries of NFRs are still not fully agreed upon by everyone. Consider a story that says "As a manager, my employee must get all responses within 5 seconds to avoid hiring a second data entry person and adding $50,000 in payroll expenses." I consider that a fully functional business requirement, along with any performance requirements that focus on the end user experience.
I categorize "traditional" NFRs as stories where the impacted person is not in the end user's or stakeholder's organization. "As a support person I need logs of the web site traffic to help me troubleshoot problems," or "As a software maintainer, I need a block architecture diagram to help me make changes." Including the role as you would with any user story helps with prioritization. It also helps identify the stakeholder for that NFR, should you have any questions about it.
NFRs may include some aspects of performance, at least those that don't impact the end user. "As a system administrator, I want to allocate no more than 10GB of disk space to the database in order to use SQL Express and avoid expensive SQL Server licenses."
Consider a typical NFR that might only state "Databases are limited to 10GB." It's an arbitrary number with no meaning or rationale, and there's no way to question it. Having the story-like role and explanation helps everyone understand that there is a valid reason for the NFR, so when you're prioritizing them you can ask smart questions. They lead to conversations like "I need to expand my table space to 20GB, but the sysadmin has this NFR about database size. How much do SQL Server licenses really cost him? OMG, that much? OK, I'll denormalize a few tables and save a few GB to fit it in there."
As both #bensmith and #siemic show, yes, you can can capture NFRs as stories.
Should you capture them in this way?
I don't think you want to capture NFRs as part of regular feature stories.
Most NFRs apply to more than one story. "The system must be responsive" means every story needs to define maximum wait times. "The system must not consume more than 10GB of disk space" means every story needs to consider disk space. The list of "and"s in the story becomes unmanageable in even trivial cases.
You may want to capture NFRs as independent stories, if both the product owner and team are comfortable with this.
For instance:
Given I have a PC with at least a dual core processor
and 8GB of RAM
and a gigabit connection to the system
when I interact with the system
then I never have to wait more than 5 seconds for a response
and 90% of attempts respond within 1 second
This provides a clear requirement, with measurable targets. You just have to make sure that each story takes all of the NFRs into account.
I think you need to look at a few things,
NFRs should follow the life span of the application, software, product etc. backup and recovery scenarios should be covered regularly, security scans and performance should be measured in prod as well as in development.
Many NFRs need validation from teams outside of the development group so would not be expected to have a script or code written to verify. So obviously security, performance, scalability, resilience etc can and should be tested within the development phase or before code gets promoted into live.
Most NFRs can be written up as stories but as said I dont think all need development effort to cover them.
regards
Martin

How specific do I get in BDD scenarios?

Take two different ways of stating the same behavior.
Option A:
Given a customer has 50 items in their shopping cart
When they check out
Then they will receive a 10% discount on their order
Option B:
Given a customer has a high volume of items in their shopping cart
When they check out
Then they will receive a high volume discount on their order
The former is far more specific. If someone has some question about exactly when a customer gets a high volume discount or how much to give them, reading this scenario makes it very clear. Serving the purposes of documenting the behavior, it's about as specific as it can be, although any change in those values will require changing the scenario.
The second is more generalized and doesn't have the clarity of the first. Automating it would require incorporating the values "50" and "10" in the step implementations. On the other hand, the scenario captures the core business need: a high volume customer gets a discount. If we later decide to use "40" and "15", the scenario doesn't have to change because the core business need hasn't really changed (though the step implementation would). Also, the term "high volume customer" communicates something about why we're giving them the discount.
So, which is better? Rather, under what circumstances should I favor the former or the latter?
I think I'll go for option A.
The thing is that BDD scenarios must serve as documentation of the system.
So if a non technical wants to know how your discount system is working (A business guy, a tester or someone from the customer support team), they surely would like to know what it means to have a high volume of item and what it is the applied discount.
And they would not want to have to go in the plumbing code to get this information back.
I think this information is important and can not be hidden from the reader.
Another benefit is that it will allow for a non developer (a tester for example) to write new scenarios and check what will happen if there are 1 item in the cart or 100 items.
When you get too much abstract about thing, it gets harder to apply deliberate discovery.
So with a scenario as in Option B, you loose the opportunity to ask your self these questions:
What happen if we have more than 50 items like 100 items is there any other discount available
What happen if we have 1 item, surrely we need to not apply a discount or should we apply a discount based on the total price of the cart instead of the number of items in it, someone buying only one really expensive item should benefit a discount too ?
is 10% the only available type of discount, do we have for example fixed amount discounts ? Do we have more complex discount strategies ?
When the business variable are visible, you can play around with them and figure out stuff that you may have forgotten or think about new interesting (or not) features.
As a general rule, I'd hide what it does not matter to know in a scenario and in that case the number of items and the applied discount value do really matter to the reader.

Resources