Running a crowd-sourced experiment on using LLMs for testing

Vipul Kocher

Published Sep 30, 2023

[1] Since this is an ongoing experiment, I have decided to keep updating the article. The process I am going to use is to mark the text added in the first revision as [1] ... [/1], in the second revision as [2] ... [/2] etc. so that people can understand what has changed. I will also link the comments etc. inline wherever possible and required. The idea is to provide widely relevant information in the article and more specific answers in the comments. I will also use the Dividers provided in the formatting to delineate new paragraphs.

There are various comments related to (a) The tests are very generic etc. - please understand that I gave just one screenshot as an input and I asked Bard to provide test cases on that basis. Of course, these would be generic. I am hoping that the wisdom of crowds will help provide specifics so that LLM can provide me specific tests. I personally am more interested in the test idea than specific test data. (b) LLMs will not be able to replace thinking testers etc. - answer below (c) test cases/script focussed thought process being less useful - answer below

I have to say the following - Title of my post is about an experiment. Crowd-sourced. It is not about whether LLMs are better than testing experts. Further I wrote clearly in the original post that the objectives of the experiment are -

"I have never tested a banking application. I wanted to know what kind of tests might be required for testing such an application. If I am totally unaware of the domain, I don't have much to go by in terms of the requirements, will I be able to take advantage of these LLMs to

Design meaningful test cases
Design them quickly
Iteratively improve my tests and my own understanding of a banking application
Use multiple LLMs to cross-check their respective results and gain more insights and knowledge about the domain"

Which means that I am clearly looking at using LLMs to help me design tests. While I didn't specifically state that I will also evaluate LLMs about their capabilities to generate a good set of tests, let us add that too to my objectives.

I ran a workshop on testing AI applications and how to use AI for help in testing for some testers in Denmark. One of them worked for pension department and one worked for a bank. An informal discussion with them revealed that they were happy with the tests provided by Bard that I had documented and that the pension related tests were also impressive (that it was able to give good tests, for Danish pension system, in such a short time). So they concluded that it was a useful tool. That doesn't prove anything but provides encouragement to continue the experiment. I am looking for core-banking testers who can evaluate these tests and suggest prompts that can take the test design to a deeper level.

[/1]

LLMs/ChatGPT, food isn't flavorful without mentioning them at least a few times. Debates have been raging, and articles, from inane to insightful, all have been written. Jason Arbon and Rahul Verma have exchanged comments, James Bach and Michael Bolton have weighed in with their articles. I tried staying away from the discussion as I am more involved with the testing of AI models rather than using AI for testing. However, for reasons best not explained here, I had to start doing some experiments and what better way than to publish what I am doing, what I have done and how collective wisdom of the crowds might help everyone gain more insights.

Before I begin, I need to state my position on the matter. I run a startup that has developed a tool or toolchain for AI governance, testing of AI models and generally improve the model quality using generative AI. So I am not totally unaware of the technology behind ML, LLMs and concepts like explainability etc. As I said, we use generative AI in our product. Totally sold on AI/ML, I know (I believe, I know) that

AI will be able to do in near future what a very large percentage of testers do today, for test design and execution. In fact, I had predicted that it will happen, way back in 2008/2009.
It is now just a matter of a little time and a little work on training the models with the right set of data for it to be able to do a good job of test idea generation. Will it surpass geniuses? Possibly not. Not at least in the next 5 years but I believe that too will happen sooner or later.

I have never tested a banking application. I wanted to know what kind of tests might be required for testing such an application. If I am totally unaware of the domain, I don't have much to go by in terms of the requirements, will I be able to take advantage of these LLMs to

Design meaningful test cases
Design them quickly
Iteratively improve my tests and my own understanding of a banking application
Use multiple LLMs to cross-check their respective results and gain more insights and knowledge about the domain
[1] Evaluate the ability of LLMs to provide good set of tests and if so, figure out a process of arriving at good tests [/1]

I had no access to any banking application and I didn't want to burden my friends asking them questions about what are the features of a core-banking application and what kind of tests should be done. So I searched the net for banking requirements document to use a text based LLM like ChatGPT to design tests. Another possibility was to use a LLM which ingests images as well and give it an image/screenshot of a banking application. The latter was easier than the former and Google bard then seemed to be a good choice to start my experiment. So I fed the screenshot presented at the top of this article to Bard and asked it to generate tests. I will now present the conversation I had with Bard.

Please note that my instruction given to Bard are shown as -

Instructions to Bard

My own comments or thoughts appear in bold as -

My own comments or thoughts

and Bard's answers appear as normal text as shown below -

Bard's answers

_____________________________________________________________________

To Bard - Write test cases for this banking software screen. The software is finacle from infosys.

Here are some test cases for the Finacle banking software screen from your image:

Functional testing:

Verify that the user can successfully log in to the system.
Verify that the user can access the Interest Adjustment Register Maintenance (IARM) screen.
Verify that the user can create a new IARM entry with valid data.
Verify that the user can update an existing IARM entry with valid data.
Verify that the user can delete an existing IARM entry.
Verify that the user cannot create or update an IARM entry with invalid data.

Verify that the user cannot delete an IARM entry that is associated with other transactions.

Negative testing:

Try to log in to the system with invalid credentials.
Try to access the IARM screen without logging in.
Try to create a new IARM entry with invalid data.
Try to update an existing IARM entry with invalid data.
Try to delete an IARM entry that is associated with other transactions.

Additional test cases:

Verify that the user can view the history of IARM entries.
Verify that the user can export IARM entries to a file.
Verify that the user can search for IARM entries by various criteria.
Verify that the system generates accurate error messages when the user makes mistakes.
Verify that the system performs well under load conditions.

UI/UX testing:

[I have removed these generic tests from here to save some space]

These are just a few examples of test cases that can be written for the Finacle banking software screen from your image. The specific test cases that need to be executed will depend on the specific features and functionality of the software.

I liked the fact that Bard could figure out the main purpose of the screen and give me some generic test cases and some specific test cases. I was more interested in "more" banking related tests. Since, I know little about banking software, I asked Bard the following:

Give detailed test cases for Interest Adjustment Register Maintenance

Here are some detailed test cases for Interest Adjustment Register Maintenance (IARM):

Create a new IARM entry:

Verify that the user can access the IARM screen.
Verify that the user can click the "New" button to create a new IARM entry.
Verify that the user can enter all the required fields in the new IARM entry dialog box.
Verify that the user can click the "Submit" button to save the new IARM entry.
Verify that the system displays a confirmation message indicating that the new IARM entry was saved successfully.

Update an existing IARM entry:

Verify that the user can access the IARM screen.
Verify that the user can double-click on an existing IARM entry to open it in edit mode.
Verify that the user can update any of the fields in the IARM entry dialog box.
Verify that the user can click the "Submit" button to save the updated IARM entry.
Verify that the system displays a confirmation message indicating that the updated IARM entry was saved successfully.

Delete an existing IARM entry:

Verify that the user can access the IARM screen.
Verify that the user can select an existing IARM entry to delete.
Verify that the user can click the "Delete" button to delete the selected IARM entry.
Verify that the system displays a confirmation message indicating that the IARM entry was deleted successfully.

Negative testing:

Try to create a new IARM entry with invalid data.
Try to update an existing IARM entry with invalid data.
Try to delete an IARM entry that is associated with other transactions.

Additional test cases:

Verify that the user can view the history of IARM entries.
Verify that the user can export IARM entries to a file.
Verify that the user can search for IARM entries by various criteria.
Verify that the system generates accurate error messages when the user makes mistakes.
Verify that the system performs well under load conditions.

Here are some additional test cases that can be written for IARM:

Verify that the system calculates the interest adjustment amount correctly.
Verify that the system can handle complex interest adjustments involving multiple accounts and currencies.
Verify that the system can integrate with other banking systems, such as the core banking system and the general ledger system.
Verify that the system can generate reports on IARM activity.
Verify that the system can be audited to ensure that all IARM transactions are accurate and compliant.

These are just a few examples of test cases that can be written for IARM. The specific test cases that need to be executed will depend on the specific features and functionality of the IARM system.

My bad! I should not have said "detailed test cases" as the model gave details of the test cases it had already given me. It impressed me nonetheless because Bard could figure out for creating a new IARM entry (from the screenshot) that New button has to be used for a new IARM entry (actually it is "Add" button but that's not really important here) and that submit needs to be pressed to save the new entry! It added on its own that a confirmation message for successfully adding a new entry also needs to be presented! Neat!

The detailed steps are reproduced below.

Verify that the user can access the IARM screen.
Verify that the user can click the "New" button to create a new IARM entry.
Verify that the user can enter all the required fields in the new IARM entry dialog box.
Verify that the user can click the "Submit" button to save the new IARM entry.
Verify that the system displays a confirmation message indicating that the new IARM entry was saved successfully.

I also liked the additional tests generated which are core banking and I wasn’t really going to be able to write some of them so quickly. These additional tests are reproduced below -

Verify that the system calculates the interest adjustment amount correctly.
Verify that the system can handle complex interest adjustments involving multiple accounts and currencies.
Verify that the system can integrate with other banking systems, such as the core banking system and the general ledger system.
Verify that the system can generate reports on IARM activity.
Verify that the system can be audited to ensure that all IARM transactions are accurate and compliant.

Time to figure out how deep is the rabbit hole.

Interest Adjustment Register Maintenance - write some scenarios that you would like to test as a banking clerk who is doing user acceptance testing of the software

As a banking clerk doing user acceptance testing of the Interest Adjustment Register Maintenance (IARM) software, I would like to test the following scenarios:

Create a new IARM entry for a customer who has been overcharged interest.
Update an existing IARM entry to reflect changes in the customer's account status.
Delete an IARM entry that is no longer needed.
View the history of IARM entries for a particular customer or account.
Export IARM entries to a file for analysis.
Search for IARM entries by various criteria, such as customer name, account number, date, and amount.
Verify that the system calculates the interest adjustment amount correctly.
Verify that the system can handle complex interest adjustments involving multiple accounts and currencies.
Verify that the system can integrate with other banking systems, such as the core banking system and the general ledger system.
Verify that the system can generate reports on IARM activity.
Verify that the system can be audited to ensure that all IARM transactions are accurate and compliant.

In addition to these general scenarios, I would also like to test specific scenarios that are relevant to my bank's policies and procedures. For example, I would like to test the following scenarios:

Create an IARM entry for a customer who has been granted a waiver of interest charges due to financial hardship.
Update an IARM entry to reflect a change in the customer's interest rate.
Delete an IARM entry that is associated with a fraudulent transaction.
Search for IARM entries that exceed a certain amount.
Generate a report of all IARM entries that were created or updated in the past month.
Run an audit of all IARM entries for a particular customer or account.
By testing these scenarios, I can help to ensure that the IARM software is working properly and that it meets the needs of my bank and its customers.

I am not sure what is IARM so I ask Bard for more details on the concept

what is Interest Adjustment Register Maintenance entry?

Interest Adjustment Register Maintenance (IARM) entry is a record of a change made to the interest charged on a customer's account. IARM entries can be used to adjust interest for a variety of reasons, such as:

Correcting an error in the interest calculation
Providing a customer with a waiver of interest charges
Reflecting a change in the customer's interest rate
Reversing interest charges that were associated with a fraudulent transaction

IARM entries are important because they ensure that customers are charged the correct amount of interest on their accounts. IARM entries are also used to track and report on interest adjustments, which can help banks to identify and address potential problems.

IARM entries are typically maintained in a software system that is designed for this purpose. The IARM system typically allows users to create, update, and delete IARM entries, as well as to view the history of IARM entries for a particular customer or account.

Here are some examples of IARM entries:

An IARM entry to correct an overcharge of interest on a customer's account
An IARM entry to waive interest charges for a customer who is experiencing financial hardship
An IARM entry to reflect a change in a customer's interest rate due to a change in their credit rating
An IARM entry to reverse interest charges that were associated with a fraudulent transaction
IARM entries are an important part of the banking process and help to ensure that customers are charged the correct amount of interest on their accounts.

Can I get into details of each test now?

Write test cases for An IARM entry to correct an overcharge of interest on a customers account

Here are some test cases for an IARM entry to correct an overcharge of interest on a customer's account:

Positive test cases:

Verify that the IARM system allows the user to create a new IARM entry with the following information: Customer ID, Account number, Interest amount to be adjusted, Reason for adjustment (in this case, to correct an overcharge of interest)
Verify that the IARM system automatically calculates the correct interest adjustment amount.
Verify that the IARM system updates the customer's account balance to reflect the interest adjustment.
Verify that the IARM system creates a record of the interest adjustment in the IARM register.

Negative test cases:

Try to create a new IARM entry with invalid data (e.g., a negative interest amount).
Try to create a new IARM entry for a customer who does not exist.
Try to create a new IARM entry for an account that does not exist.
ry to create a new IARM entry with a reason for adjustment that is not valid.

Additional test cases:

[Duplicate tests so removed from here]

These are just a few examples of test cases that can be written for an IARM entry to correct an overcharge of interest on a customer's account. The specific test cases that need to be executed will depend on the specific features and functionality of the IARM system.

I stopped at this point of time. I could have done the following

Get some requirements specific to IARM overcharge, design some tests and compare the tests.
Ask Bard about various rules specific to IARM and be able to create some EQ/BV based tests.
Ask bard to give me test ideas based on tours such as user tour
Ask for tests related to various types of testing - Localization, Internationalization, Database testing, Round tripping of data, Performance testing and so on...
Ask same questions to ChatGPT, Bing etc. and compare the results
Call up friends who are experts in this area and ask them about the quality of tests provided by the AI and how do these tests stack up against their own tests.

It is for this last thing, I am calling this post as an experiment. Maybe you all can contribute tests specific to IARM and we try to figure out if we can coax similar tests from the AI models and what does it take to get those answers out. We can also call out the test ideas provided by the models which were really good. You can also suggest what else should be done or what else should be differently done!

आ नो भद्राः क्रतवो यन्तु विश्वतः (Aano bhadra krtavo yantu vishwatah) meaning "Let noble thoughts come to me from all directions"

I look forward to the inputs from all sources...

Michael Stahl

9mo

Vipul - the only downside I see to this article that it, and the responses, are so intereseting I am not getting any work done :-) . Thanks for sharing the experience. While you can argue back and forth on "will AI replace us", I think you clearly showed that spending a relatively short time interrogating an LLM about a test challenge will bring up some directions and ideas you may have not thought of. So until we have better data that AI can totally replace us at test design, I'd go with: - First, do your own, unbiased thinking about what to test - Then "talk" to LLM to get more ideas and re-affirm that what you have already thought of is valid Which means that effectively you add the LLM as one of your best reviewers.

1 Reaction

Michael Bolton

9mo

I've done what I predicted no one else would do: provide a detailed response to Vipul's experiment here. There are some remarks to contextualize my reply here: https://lnkd.in/gxYXhC9H And then's my analysis here: https://lnkd.in/guspPZND Annoyingly, on my own feed, LinkedIn is showing that I've reposted with the text above. But on Vipul's post, my repost isn’t listed. So I'm trying again.

Leonard Haasbroek

Integration, Testing and Automation

9mo

Interesting discussion. The crux of the issue is probably centered around the perceived quality vector of the generated test cases. Test experts in the field are signaling be careful with how much value to attribute to these AI generated test cases. The ability to pass the Turing Test is not the bar here.

1 Reaction

Sanjay Sharma

owner

9mo

You are a torch bearer for whole of the testing community. You always lead for something new always.

Brad Thompson

9mo

One of the only valuable outcomes of creating test cases is the thinking. If we remove this why as a human would I even read them? I’m not saying AI generated ones would be bad but at that point maybe just bang on the software and report back to me. Only give me cases in which you found an issue. Idk interesting stuff

1 Reaction

See more comments

To view or add a comment, sign in

Running a crowd-sourced experiment on using LLMs for testing

Vipul Kocher

Insights from the community

Others also viewed

It’s Checking not Testing

KM AI course: Case study- World bank

Windmills of testing – Harness it with Artificial Intelligence and Machine Learning

Bug $500K Lost, Test SMS, AI Observability & Game Testing

Debunking Myths: 10 Real-World AI Use Cases in Test Automation

Testers Discuss AI

How can banks enhance customer relationship with Deep Learning?

Demystifying The Benefits of AI

Artificial Intelligence and Software Testing

Impact of AI/ML in changing the traditional QA mindset

Explore topics