/
- Field Notes /
- The Measure of Integrity of Software of Students

The Measure of Integrity of Software of Students

A light-weight tool for comparing similarities between projects

2022-02-28

Table of content

The Solution
What it does
Interpreting Results
How to Interpret
The common objections
What if the computer makes a mistake and accuses a student falsely?
What if instructor bias leads to a mistake, and falsely accuses a student?
Routine for Use
Conclusion
Future Development
Other Uses
Code Refactoring
Distributed Open Source Analysis

It seems that your browser JavaScript is disabled, LibDoc is set to fallback mode with reduced features.

This article describes the use of a browser-based tool for comparing and visualising student assignments. The tool M.I.S.S is a tribute to the classic tool MOSS (Measure Of Software Similarity), but supports organisational privacy and legal constraints by operating completely locally.

In a previous article, the use of GPU to perform the comparisons was discussed.

After twenty years as an IT professional, having been a software developer for a diverse set of industries, having built systems that solve problems for some of the largest organisations on the planet … I decided to take up teaching.

I had spent a lot of time working with newly hired developers to give them the basics of working within various corporate environments and had been disappointed by the lack of interest in problem-solving. It took a lot of work to get a recent graduate to think about how their actions affected others, and how their individual decisions resulted in effects that had consequences beyond their immediate deadline.

Too often, new hires would race through a solution, submit it, and (proudly) declare that it was complete; a cursory inspection would demonstrate large failings. This was common among all new hires. My goal in teaching was to head this problem off. Graduates were coming out of school, wildly unprepared for the field, and I was going to change that.

I was going to increase the quality of software developers available to organisations, not the quantity of developers.

I wanted to reach out to students and hopefully guide them to learning how to problem solve, not just type code.

Toward the end of my first semester, I noticed two significant problems with this:

Some students are just not inherently curious about programming
Before signing off on the quality of students, I had to first identify which was their work

Students work together, and collaboration is good but under the pressure to perform some people are tempted to cheat. This offends me in different ways depending on the motivation behind it:

I dislike bullies. When a cool kid tries to bribe or bully a socially awkward student into doing his homework, I get angry.
I dislike fraudsters. In some cases, students came in with no interest in the work and paid professionals to do their homework.
I love people learning. Under pressure, a good person may take a shortcut, but, if someone is hiding a problem, I can't help them with it.

An Aside

This is not a moral judgment, it's a factual judgement. I make a bad MMA fighter because I'm too delicate, and that's OK … unless I try to take up fighting.

One problem I noticed with administration was a strong desire to not have students fail. If a student could not perform an activity, excuses were made because nobody wants to hurt anyone's feelings. This resulted in students being put in positions they were not ready for, and my original observation of graduates putting customers at risk.

Remember my motive: Integrity is important because judging individuals on their ability to do the work is a safety issue, and a certification trust issue.

Compounding this problem was a concern that other faculty and administration did not want to find the problem. It is a hard problem to deal with, charged with emotions, subjective reasoning, academic courts, and formality. Life is so much easier if we just ignore it. I heard on more than one occasion “Industry will straighten them out”. To me, this makes the certification advertised by the facility valueless (see “I Was Shocked To Catch A Candidate Cheating In An Online Interview”).

So, to summarise the problem: I'm teaching JavaScript programming, visualisation, and data analysis, have just left a position where distributed computing was my bread and butter, and am faced with a problem. Add to that, the administration forbade the transmission of student assignments outside of the facility and ominously reminded me that it could not leave the country.

So I wrote my own solution.

Complete in the browser.

No servers involved.

There was no need for permission to install the software. There were no more concerns for legal or privacy constraints. No questionable licenses. No fees. Just a simple tool to find out who needed my help.

The Solution

M.I.S.S is a piece of software that runs completely in the browser. It can be found hosted at https://gitlab.com/jefferey-cave/miss

Before you begin with the software it is worth having a set of files you want to compare. If you work with a submission tool like BrightSpace, you are in luck, M.I.S.S accepts and interprets the zip file you download from your submissions folder. It can be directly uploaded the same way you downloaded it. Alternately, you can download the sample used for testing.

The first page you see when you get to the interface

When presented with the first connection, the tool starts on an “FAQ” page. The intent is to answer all the questions you may have right away. The most important thing you can do at this point is to upload a collection of student assignments for comparison (stored as a zip). In the background, the tool will unpack the zip and extract each file. Each root folder is considered to represent one student.

The points represent student submissions. Orange represents the progress of comparing them to one another; while the grey line shows a completed comparison.

Once uploaded, you should be able to navigate to the “force” tab to see that progress is being made. The graphics update in real-time, giving a sense of progress, so go get a coffee and wait for the calculations to complete.

What it does

Now that the sample is loaded and processing is begun, it's time to ask what it is actually doing.

The process for measuring the level of similarity between two programs is not just a straight textual comparison. Rather, M.I.S.S goes through several phases to try to be as accurate as possible.

Compilation/Normalization

The first part of the process is to run the code through an appropriate interpreter depending on the language. When most people initially think of code comparison, they consider text comparisons: letter by letter. However, by using a language-specific interpreter, we can change the text into a series of tokens:

instead of f, o, r, the word for is recognised as a single thing (and assigned a number)
instead of f, o, r, w, a, r, d, the word is recognised as a variable (and assigned a number)

This avoids the first three letters being thought to mean the same thing: instead token1 is compared to token2 and seen as a single comparison between 2 things, not 3 similarities, and 4 differences.

This allows us to recognise small changes that may have a large visual impact (like variable name changes).

2. Full-Text Comparison

The text is compared at a full-textual level. This is important for ensuring that order does not matter:

function FunA(){...}
function FunB(){...}

is seen as the same code as

function FunB(){...}
function FunA(){...}

Please see my previous article on how the Smith-Waterman algorithm was implemented to achieve this.

This allows us to recognise small changes that may have a large visual impact (like re-ordering functions).

Interpreting Results

This program does a blind comparison, it does not have any bias about the what it is comparing, it just compares. Unfortunately, all creative work, will have common elements to it. Good solutions will be independently discovered. This means that it is not possible for this program to catch cheaters.

What this program can do is filter out the people that probably aren't directly copying one another's work. People that appear to be exercising their own skills in attempting to solve the problem.

Given a batch of 120 assignments to grade there are 7140 comparisons that would need to be performed. Multiply this by multiple batches assigned to different graders and the problem becomes untenable, resulting in faculty (faultily) relying on intuition.

Instead, it is probably necessary to review 5 to 7 of the submissions, and MISS can be used to filter the list down to just those items, by removing items that are obviously different.

How to Interpret

The goal of the visualisations is to reduce the workload of faculty. Once a batch of assignments has completed processing, interpreting the results is relatively straightforward.

The force-directed graph is more a curiosity than anything else (we were building them in my class). Clusters that appear are indicative of social groups within the class. This can be indicative of students that are studying together and making shared mistakes.

The first real step is for faculty to look to the Listing tab

An example of a run. Most submissions have around a 5–10% similarity. However, highlighted in red, are two assignments that have about 50% similarity to one another. This is a major divergence from the norm, making them worth investigating. Be careful, the calculation is not complete, we don't know what “normal” actually is.

The listing includes the comparison of every submission to every submission, with the amount of similarity indicated as a percentage. In an effort to reduce effort, items are sorted by most similar to least. The ones at the bottom can probably be ignored. Results are also colour coded to represent not compared (grey), normal (blue), suspicious (red). If you see grey, it is not complete, go get more coffee.

“Suspicious” is defined as an abnormally similar from the group. This is calculated by sorting the items in order of similarity, calculating the difference between each step, and taking the largest change. Any value from largest change forward is considered to be worth inspection

For example,

14%         }
   }- 3     }
11%         }  Suspicious Range of values
   }- 1     }
10%         }
   }- 4   <-- Greatest Change
 6% 
   }- 1
 5% 
   }- 0 
 5%

This technique was chosen to identify normal “sharing” within the group. It is assumed that the steps will represent students leaning on their peers for support, until it reaches a point where it is “more” than leaning. Even if the amount returns to normal after that point, the amount of sharing is still abnormally high. This technique is believe to offer better utility than a simple average as it can identify a group of people where the minority of students are working independently.

Clicking on any of these comparisons will take you to the Comparison tab.

The comparison tab allows you to view the actual assignments side by side with colour coded sections of similarity. Don't miss the “maximize” button in the top right corner.

Here you can view the actual software, side by side. Similar blocks of code are colour coded to allow for inspection. Clicking on the maximise button in the top right corner is probably useful. Single clicking on the percentage indicators will jump to the block of code, double clicking brings both pieces of code into alignment.

Hopefully, your inspection turns nothing of interest up at all.

Action buttons that are useful: report a bug, print the results, delete the current data, upload a result file, or download a result file.

As this runs fully in the browser, it is also important that you save a copy of the comparisons. This can be done by clicking on the download button which downloads a copy of the current comparison for saving for later. This is useful as a backup, but also in case you need to stop a run and need to resume it later (perhaps with the addition of a late submission). The downloaded file stores all the data to resume comparisons at a later date if necessary.

Give your file a meaningful name before downloading it. Then delete and start another class' assignments.

Having said that, the point to this software is to be independent of the tool. Regardless of your findings, you should always print the results for long term archiving. The printed copy can act as a permanent record of your findings and will exist as simple HTML (or PDF) should you need to access the results even decades later, when you have forgotten about M.I.S.S.

Always remember to maintain consistent evidence of your consistent decisions.

The common objections

Most objections regarding the results amount to straw-man attacks on the process, and focus on not challenging student integrity.

What if the computer makes a mistake and accuses a student falsely?

Used properly, this tool simply identifies works that bear closer investigation. It does not give any information about the nature or motives of the similarity. That is left to the investigator to determine from the context.

In fact, let me reverse the objection:

What if instructor bias leads to a mistake, and falsely accuses a student?

In its first use, this tool actually indicated that someone suspected of cheating was, in fact, innocent.

At the time, I was exhausted. It was the end of the semester, and students had been given a second (and a third, also a fourth and fifth) chance to submit assignments. Administrators, councilors, and fellow faculty bullied me into allowing multiple late submissions from some key students. Naturally, struggling to keep up with grading, and being pulled into multiple meetings regarding these students did not put me in a pleasant mood when a particular submitted assignment looked suspiciously similar to a very unique solution I had been particularly impressed with (submitted by a particularly strong student).

Being ill-tempered, and in an effort to stave off the next set of objections I decided to gather evidence from an unbiased computational resource, and found…

The level of similarity between the two students was normal for the class

The student had a couple of variable names that were similar (probably due to tutoring), but beyond that, the style was very different. My personal emotions and experiences had led to bias because... well because I'm human.

The computational solution was unbiased and treated the student fairly and without emotion, even when I did not.

It also (thanks to its brute-force capabilities) identified two students who had copied one another's work exactly with only variable name changes. They had been missed purely due to being lost in the mass of papers needing grading.

Routine for Use

I strongly recommend reading a paper on the use of MOSS in the classroom: “Experience Using ”MOSS” to Detect Cheating On Programming Assignments”. This details the routines and experience of faculty at the University of South Florida

Objections to the use of tools like this focus on accusations of vindictiveness by the user. However, for Faculty that do choose to use a tool like this, the goal is to avoid problems, not to cause them. After having developed the tool, I began to develop and document a routine for use:

run it against every batch of assignments consistently
be open with students that you are running it
each semester, I only bothered to pursue the most egregious case

I would normally start the first semester with the first assignment and run all the students' submissions through the tool, in front of the students. The visualisations caught their eye, and they could see clusters starting to form. Naturally, I was careful to anonymize Student IDs, but being so introductory, the amount of similarity was incredibly high anyway. Students are not singled out but it does serve as a warning to students that you are watching, and serves as an opportunity to discuss the difference between copying, and collaborating. It is also a great introductory discussion into how the skills they are learning can be applied to solve personal work problems.

While the tool reduces the effort of evidence gathering, it does not eliminate it. There is a large bureaucratic burden in even discussing academic concern. I personally, felt that a balance between pursuing the issue and spending time with my family had to be struck. My conclusion was to pursue only one case per semester: pick the worst, chase it down. Also I tended to do this in the middle of the semester

To informally warn students when things are new and busy
To leave myself time to pursue the requisite paperwork
To give the students the opportunity to correct their behaviour

Part of the reason I felt it was important to be consistent in filing and pursuing the issue was due to my first experience with the formal process. As I began to pursue a case, other faculty came forward with their previous anecdotal cases with the exact same group of students. Anecdotally, these students had multiple incidents that had been dealt with individually, however the pattern had not been identified across space-time.

Another reason for consistency is simple bias management: are you letting a student slide due to favouritism? Anytime you actively withhold consequences based on discretion, you are actively applying punitive measures to those you choose to pursue. Are you therefore applying punitive measures without bias? If you consistently act based on the evidence, you do not have an opportunity to express unconscious bias, and therefore do not need to question your own integrity.

Conclusion

While I no longer teach programming, I do think that academic integrity is important. Colleges and Universities are certifying that students are knowledgeable in a domain of expertise, and failure to keep that process honest can have dire consequences.

Found this useful or interesting? Consider leaving a tip … it helps.

Future Development

I would love to continue work on M.I.S.S, but without more data (more assignments) it is difficult to comprehend the needs of an audience.

If you find the tool useful, I would love to hear from you.

Suggest a feature (or up-vote one)
Submit a patch
Just tell me about your experience (good or bad) in the comments

Knowing how (or even “that”) the system is used would help drive development forward.

Right away, I would be very interested in working on adding cluster indicators to the force-directed graph. I think it would be useful to have a bubble drawn around the grouped suspicious items. The difference analysis already identifies which items these are, but in a case where you have more than one cluster, it would be interesting to see them separated out.

Other Uses

There are several future capabilities for a tool such as this. While I am interested in exploring some of these, I'm interested in a lot of things and have had to make some hard choices.

Code Refactoring

One interesting capability is in the refactoring of code. While this tool compares different applications to one another, this same technique could be used to compare a software application to itself. This self-comparison could aid in software refactoring, identifying parts of code where copy/paste has resulted in duplication, or even where two developers have independently implemented the same idea. These parts that are similar, and repeated, would likely represent a refactoring opportunity.

Distributed Open Source Analysis

Another interesting idea would be to extend this to larger-scale projects. Rather than taking student software as submissions, scan GitHub or GitLab for content duplication. Aside from forks, who is copying one another's code? This would require a server and database to allow for checkouts of code and to store the comparisons in a longer-term repository. This project was developed shortly after a distributed project I had worked on and was constructed with distributed computing in mind, it should be able to be scaled to distributed environments easily.

If someone has some investment money they would like to throw at these ideas, I would be thrilled to chase them down.

M.I.S.S.
A measure of software similarity completely contained in the browser.jefferey-cave.gitlab.io