Sure, I can provide further details but it’s pointless.
- There are 5 types of row
- 3 of them are of type ‘T’, ‘F’ or ‘C’
- One of the tasks is to find all sequences of 3 rows which are a ‘T’ followed by a ‘F’ and then a ‘C’
- For such rows, there are approx 10 other columns where we need to check for consistency, meaning the values should match
- Except for the ‘T’ row, where not all data is present so not all of the columns are a match
- Additionally, some columns are never a match
- For one column, there are three values ‘A’, ‘B’ and ‘N’. For the ‘T’ row, the value should be ‘A’ if ‘F’ and ‘C’ rows are ‘B’, it should be ‘B’ if ‘F’ and ‘C’ rows are ‘A’, and if ‘T’ is ‘N’ then there’s some other special logic (I forget exactly what) which has to be run
Hopefully it is now obvious why I didn’t provide these details. It’s just a set of arbitary, complex but not complicated, rules.
This data comes out of the back of someone else’s order matching engine, so it’s really just a log of operations which needs special filtering logic to extract important features.
It is possible to process this data in a vectorized way. I implemented that first. It’s faster, but I abandoned it because it’s hard to debug, and hard to understand if it’s doing the correct thing. It also requires creating literally dozens of pseudo columns to hold intermediate results which results in problems trying to name all of those new columns. The names are important for implementing algorithms to operate on the columns in a systematic way. This kind of thing is also hard to write tests for, because the number of possible combinations is huge.
I must admit - in terms of testing, this is a weird situation. I essentially have the output from some system. I am processing this data to get the input. If I feed the input into a similar system, I should get the same output. That’s my testing.