Programming Best Practices

Monday, October 23, 2006

SQL Detectives

SQL Detectives
Data Centre Times, Oct 23, 2006
Special Correspondent, Yours Truly, Reporting from the Scene


This episode attempts to trace the track of a mysterious Data Leak that managed to escape the sophisticated dragnet prepared by SD and MP. Special detective HK of TRAIT was drafted to assist SD in recapturing the felon at large. A first-person account of the thrilling chase is provided by our friendly neighbourhood HK.

[Over to HK].

I was steamed out from back-breaking work from the HP-UX terminal when I was approached with this alarming problem. Realising the important nature of the situation, and the untold misery it was causing my colleague, I started out with the following queries:~

SQL> SELECT tname
FROM tab
WHERE tname LIKE '%CHEQUE%';

SQL> SELECT tname
FROM tab
WHERE tname LIKE '%LOG%';

And I got the names of the tables I wanted:~

MM_FT_TA_CHEQUE_TAID /* dummy table */
LOG_TEST2 /* history table */

SQL> DESC LOG_TEST2
Name Null? Type
--------------------- -------- -----------------
USER_NAME NOT NULL VARCHAR2(20)
ACTIVITY NOT NULL VARCHAR2(1000)
DATE_TIME NOT NULL DATE
IP_ADDRESS VARCHAR2(20)
ACTUAL_NAME VARCHAR2(50)

Okay, I'm in business. I suspect that you have already updated the master tables, but I do a sanity check (I have to find the master tables). So I play around, and find a likely candidate.

SQL> SELECT COUNT (*), COUNT (cheque_dt), COUNT (entry_date), COUNT (dd_dt)
FROM mm_ft_ta_cheque;

COUNT(*) COUNT(CHEQUE_DT) COUNT(ENTRY_DATE) COUNT(DD_DT)
---------- ---------------- ----------------- ------------
1101 829 865 0


Okay, we have some deficiencies. RV said that the entry_date was made mandatory only recently. OK. We still need to find out the other tables and the other stats.

I recall a value of 157 (as the result of some count) on SD's SQL*Plus session. So I do this:~

SQL> SELECT COUNT (*)
FROM mm_ft_ta_cheque_taid
WHERE taapp_id IN (SELECT taapp_id
FROM mm_ft_ta_cheque
WHERE entry_date IS NULL);

And I get:

COUNT(*)
----------
157

Voila! So this is what SD's been trying to do:~
  1. Some of the entries in the master table (mm_ft_ta_cheque) do not have entry_date.
  2. All the transactions are logged in the history table, which has the date_time value logged as well. It has all the necessary fields.
  3. The problem becomes one of identifying the records with null entry_dates (in mm_ft_ta_cheque), and getting matching date_time values from the history table (log_test2 - unusual name for a log table)
  4. The solution is now trivial: you only have to count the position and issue the appropriate SUBSTR command to extract the embedded taapp_id from activity.
  5. Even then RV asks: "Why do you need a GROUP BY?" The reason became self-evident in a matter of seconds.
  6. If some values are missng then it is clear that no contraints are placed against that field (and none ever could).
  7. So, if some values are missing, some values (records) can be duplicated as well.And indeed, that's what we find.

SQL>SELECT DISTINCT SUBSTR (activity, 55, 17), COUNT (*)
FROM log_test2
GROUP BY SUBSTR (activity, 55, 17)
HAVING COUNT (*) > 1;

SUBSTR(ACTIVITY,5 COUNT(*)
----------------- ----------
TAKRKMYR200600029 4
TAKRKMYR200600037 8
TAKRKMYR200600159 3
TAKRKMYR200600164 2
TAKRKMYR200600172 2
TAKRLCNN200600119 2
TAKRLCNN200600121 2
TAKRLCNN200600401 2
TAKRLCNN200600412 2
TAKRLCNN200600499 2
TAKRLCNN200600669 2
TAKRLCNN200600759 2
TAKRLCNN200600895 2

13 rows selected.

Hmm...duplication. We now need to check if it is duplication or constraint violation (multiple entries with different values). So we finally have the Query to check it:~

SQL>SELECT user_name, SUBSTR (activity, 1, 150), date_time, ip_address,
actual_name
FROM log_test2
WHERE SUBSTR (activity, 55, 17) IN
('TAKRKMYR200600029',
'TAKRKMYR200600037',
'TAKRKMYR200600159',
'TAKRKMYR200600164',
'TAKRKMYR200600172',
'TAKRLCNN200600119',
'TAKRLCNN200600121',
'TAKRLCNN200600401',
'TAKRLCNN200600412',
'TAKRLCNN200600499',
'TAKRLCNN200600669',
'TAKRLCNN200600759',
'TAKRLCNN200600895'
)
/

Rest assured: from the output we can see that the table has merely replicated values. (If you're not convinced, just cut and paste the Query given previously in SQL*Plus. It will produce a cluttered, but truncated, report from which you can verify that only duplication has taken place, nothing more serious. For want of space I am not including the screendump of the above Query. I have included a much cleaner report at the end, one that fully leverages the power of the STRING functions in PL/SQL):

It is merely repetition. But your ASP may not work correctly if multiple rows are returned.

What it needs is a good (unique) index on ACTIVITY. I think that's the only problem here.

The biggest problem, of course, is "hairy" SQL. Don't EVER embed SQL, especially UPDATEs, in table, no matter what you are going to do. Never store SQL on tables, it's terrible programming. At times it's deadly.

Of course, it's a trivial matter to remove duplication. But the problem is, programmers who are smart enough to embed DML in tables should be able to figure out the way themselves! We have been crying wolf over the programming practices at IT Cell, and it seems to be a lost cause, with the widespread adoption of trick programming (it is too easy to handle application complexity - something that arises from a bad design - by embedding hairy SQL in tables)

Here's the SQL you needed:~

SELECT SUBSTR (l.activity, 55, 17) log_taapp, l.date_time log_dt,
f.taapp_id f_taapp
FROM log_test2 l, mm_ft_ta_cheque_taid f
WHERE SUBSTR (l.activity, 55, 17) = f.taapp_id

And of course, you should have written the UPDATE SQL yourself, since I see that the table has been updated.

May I ask a few simple questions?
  1. From the data in the ACTIVITY field, I infer that it is simply a listing of the Querystring generated by ASP.
  2. I cannot make out how this could actually be useful. I can only assume that the table LOG_TEST2 might be used to roll back some of the changes made to the master (data) tables. I presume this is how the new applications at IT Cell make certain at what stage the ball is at the present moment. What a crude technique! Instead of setting a flag, use the Querystring! How symbolically foolish!
  3. If this is indeed the case, all I can say is that a database application should not have this functionality. Once something is done - whether it is in error is another question - there should not be provision to roll it back. This is not the same as advice note flow, where giving a connection might not be feasible and the outdoor section might have to return the advice note to the commercial officer. This is simply not the same case. I can understand it if Medallist wrote this logic, because he doesn't know the difference between a cat and a mouse.
  4. I'm assuming that the value in the ACTIVITY field is meant to do something meaningful. Even then, embedding DML in a table is dangerous. I've been crying myself hoarse that bulk updates are dangerous (ust run a bulk update at the SQL*Plus prompt - whether the data changed or not, it will say 'xx rows updated').
  5. Programming is not a trick. Like any other means of livelihood, it has its own dignity and rules. Don't think that there is an easy way to learn programming. There is no easy way.
  6. I think you have fundamentally demonstrated to yourself what actually went wrong at Eranakulam, and what is actually wrong with the Cash Management module: a database application that thinks it is a Web application, with a primary duty of smoothly displaying 'pages'. A lot of loose ends, incomplete transactions ("failed final submission" in Anil sir's terminology) means the application itself is "swaaha".
  7. Please do not subvert the meaning of "table". Table is used to hold data, not hairy SQL to be embedded in lame ASP. That is not the way things have been designed. Do not suppose that ASP programmers in different parts of the world are fools. There are some things you cannot do using ASP. NONE of the fancy things Medallist tried to do with SSANet 4 can be done in ASP. Unfortunately, there are no workarounds.
Still, if you use all the features of the language, and use the database the way it should be used, almost any problem can be solved. Believe me.

I do not believe that Manoj, Anil, and myself are lacking ideas or capacity. I think we're all perfectly capable of doing things competently. (RV will immediately go off, saying, "We need training... we need help..." and when she does that I feel awful) When it comes to designing an application and programming, I don't think we lack anything. Understand the reason why we (we=TRA Integrators) are silent. Understand why we should work in such a way as to add to our knowledge, our skills, and above all, help us further our careers as programmers. I think, for you and for me and for everyone at IT Cell, it's probably too late to change tack. And hence, try to reason for yourself, and understand why our silence carries a lot of meaning.

Back to Shalla Bal: The Final Report



SQL> COLUMN TAAPP_ID FORMAT A20
SQL> COLUMN OFF_CODE FORMAT A15
SQL> COLUMN CIR_CODE FORMAT A10
SQL> COLUMN UNIT_CODE FORMAT A12

SQL> SELECT RTRIM (SUBSTR (activity, 55, 17)) taapp_id,
RTRIM (SUBSTR (activity,
INSTR (activity, '#', -1, 6) + 1,
(INSTR (activity, '#', -1, 5) - 1
)
- (INSTR (activity, '#', -1, 6))
)
) off_code,
RTRIM (SUBSTR (activity,
INSTR (activity, '#', -1, 4) + 1,
(INSTR (activity, '#', -1, 3) - 1
)
- (INSTR (activity, '#', -1, 4))
)
) cir_code,
RTRIM (SUBSTR (activity,
INSTR (activity, '#', -1, 2) + 1,
(INSTR (activity, '#', -1, 1) - 1
)
- (INSTR (activity, '#', -1, 2))
)
) unit_code
FROM log_test2
WHERE SUBSTR (activity, 55, 17) IN
('TAKRKMYR200600029',
'TAKRKMYR200600037',
'TAKRKMYR200600159',
'TAKRKMYR200600164',
'TAKRKMYR200600172',
'TAKRLCNN200600119',
'TAKRLCNN200600121',
'TAKRLCNN200600401',
'TAKRLCNN200600412',
'TAKRLCNN200600499',
'TAKRLCNN200600669',
'TAKRLCNN200600759',
'TAKRLCNN200600895'
)
ORDER BY SUBSTR (activity, 45, 100)
/

And running this Query will give you the following Report: ~

TAAPP_ID OFF_CODE CIR_CODE UNIT_CODE
-------------------- --------------- ---------- ------------
TAKRKMYR200600029 SDEIDCNR KRK MYR
TAKRKMYR200600029 SDEIDCNR KRK MYR
TAKRKMYR200600029 SDEIDCNR KRK MYR
TAKRKMYR200600029 SDEIDCNR KRK MYR
TAKRKMYR200600037 SDENAJGROUPS KRK MYR
TAKRKMYR200600037 SDENAJGROUPS KRK MYR
TAKRKMYR200600037 SDENAJGROUPS KRK MYR
TAKRKMYR200600037 SDENAJGROUPS KRK MYR
TAKRKMYR200600037 SDENAJGROUPS KRK MYR
TAKRKMYR200600037 SDENAJGROUPS KRK MYR
TAKRKMYR200600037 SDENAJGROUPS KRK MYR
TAKRKMYR200600037 SDENAJGROUPS KRK MYR
TAKRKMYR200600159 SDENAJPHONES KRK MYR
TAKRKMYR200600159 SDENAJPHONES KRK MYR
TAKRKMYR200600159 SDENAJPHONES KRK MYR
TAKRKMYR200600164 SDESNGLR KRK MYR
TAKRKMYR200600164 SDESNGLR KRK MYR
TAKRKMYR200600172 SDEEXTLMR KRK MYR
TAKRKMYR200600172 SDEEXTLMR KRK MYR
TAKRLCNN200600119 SDEESSIIKNG KRL CNN
TAKRLCNN200600119 SDEESSIIKNG KRL CNN
TAKRLCNN200600121 SDENSCNN KRL CNN
TAKRLCNN200600121 SDENSCNN KRL CNN
TAKRLCNN200600401 SDEKOD KRL CNN
TAKRLCNN200600401 SDEKOD KRL CNN
TAKRLCNN200600412 AOTRICNN KRL CNN
TAKRLCNN200600412 AOTRICNN KRL CNN
TAKRLCNN200600499 SDEGPNR KRL CNN
TAKRLCNN200600499 SDEGPNR KRL CNN
TAKRLCNN200600669 SDOPKSZ KRL CNN
TAKRLCNN200600669 SDOPKSZ KRL CNN
TAKRLCNN200600759 SDEIIITMB KRL CNN
TAKRLCNN200600759 SDEIIITMB KRL CNN
TAKRLCNN200600895 AOTRCCIICNN KRL CNN
TAKRLCNN200600895 AOTRCCIICNN KRL CNN

35 rows selected.

Imagine the lengths I went to! Imagine the power! The glory! All this just to make sure that only duplication has taken place! All this to prove that 1+1+1+1=1 if the table doesn't have a constraint!

So, in the end, science triumphs and time flies. The following things have been achieved by a combination of persistence, serendipity, and confidence in SQL:~

  1. HK gets a chance to brush up his string manipulation (and a chance to open Feuerstein again).
  2. SD can breathe a sigh of relief; the only problem is data duplication. Probably all that she needs are a few indexes/constraints that would prevent spurious data entry in future. (But the query to remove duplicate entries is a secret until and when she steps down from her fool's paradise).
  3. A nicely formatted PL/SQL report has been generated (one that is perfectly useless and serves to feed the vanity).
  4. A case against "hairy programming" has been built.
  5. A live demonstration has been made of the common ills of programming practices being followed at Data Centre (as exemplified by Eranakulam and Cash Management).

I cringed when SD showed me what she wanted to do with the dynamic SQL. It all sounded INCREDIBLY difficult! I was thinking all the while how SQL could become so tough so suddenly... From SD's description, it was clear to me that she was not thinking straight... was just imagining that things were too complicated! Actually, at TRA we're doing just this sort of work all the time (we used to). I think we (SM and HK) are both excellent 'SQL Detectives'. What SD wanted to do was, simply stated, back-end data manipulation to cover up for a leaky database application (SSANet also has leaks). All ASP applications that do not use SQL procedures to check the integrity of data suffer from data leaks at some point or another. (As far as I know, all the Web applications made at IT Cell have data leaks. A case in point is the celebrated data leak that caused a terrific smashup between Trivandrum Computer Cell and IT Cell). At best, data leaks are a pain in the neck (like the present case). At worst, they destroy data. In some cases, if there are no redundant mechanisms to collect data, it may even be impossible to rebuild the data, no matter how clever you are with SQL. (I think, if there is data in some form somewhere, we can use it to rebuild the broken data; if it is theoretically possible then we'll do it). One way of ensuring that data is not corrupted is to enforce constraints (primary/foreign/unique/check). I think such precautions need to be taken at the back-end for every database application. And this is where the database rules the game.

I know SD is having a stiff load at the present moment, but I think she has only herself to blame. She took all this upon herself. She used to complain of having no work to do, and now she's got a full plate! I think her former mentor might have found her services invaluable at this stage. WTMS might become a busy application sometime in the future. Whatever his faults, he makes robust, dependable programs. I don't think the same can be said about Medallist. (Forgive my ignorance, but I can't seem to remember one decent page he has written.)

I hope that the author of the mischief would seriously consider "straight" programming as a viable alternative in the future.

0 Comments:

Post a Comment

<< Home