This post discusses a method to run FlexeLint so as to receive messages only on code that differs from a specific baseline. The examples use FlexeLint but the ideas are relatively simple and can be easily modified to work with other tools as well.
Let's start with a very simple example of what we want to accomplish. Below is a function that contains several issues:
#include <stdlib.h> #include <string.h> char * dup_string(char *s) { char *copy = malloc(strlen(s+1)); strcpy(copy, s); return copy; }
Here are the messages produced by my FlexeLint configuration:
FlexeLint for C/C++ (Unix) Vers. 9.00h, Copyright Gimpel Software 1985-2011 --- Module: dupstring.c (C) dupstring.c: 5: Warning 432: Suspicious argument to malloc dupstring.c: 5: Warning 613: Possible use of null pointer 's' in left argument to operator 'ptr+int' dupstring.c: 6: Warning 668: Possibly passing a null pointer to function 'strcpy(char *, const char *)', arg. no. 1 [Reference: file dupstring.c: line 5] dupstring.c: 5: Info 831: Reference cited in prior message dupstring.c: 6: Warning 668: Possibly passing a null pointer to function 'strcpy(char *, const char *)', arg. no. 2 dupstring.c: 8: Info 818: Pointer parameter 's' (line 4) could be declared as pointing to const - C Coding Standards Rule 6.2 dupstring.c: 4: Info 830: Location cited in prior message dupstring.c: 8: Note 953: Variable 'copy' (line 5) could be declared as const dupstring.c: 5: Info 830: Location cited in prior message --- Global Wrap-up : 0: Info 714: Symbol 'dup_string(char *)' (line 4, file dupstring.c) not referenced dupstring.c: 4: Info 830: Location cited in prior message
Now suppose we make some changes:
#include <stdlib.h> #include <string.h> char * dup_string(char *s) { char *copy = malloc(strlen(s+1)); int i; strcpy(copy, s + 1); i++; return copy; return i; }
The goal is to get output that corresponds to only the new and changed lines:
FlexeLint for C/C++ (Unix) Vers. 9.00h, Copyright Gimpel Software 1985-2011 --- Module: dupstring.c (C) dupstring.c: 7: Warning 668: Possibly passing a null pointer to function 'strcpy(char *, const char *)', arg. no. 1 [Reference: file new/dupstring.c: line 5] dupstring.c: 5: Info 831: Reference cited in prior message dupstring.c: 7: Warning 613: Possible use of null pointer 's' in left argument to operator 'ptr+int' dupstring.c: 8: Warning 530: Symbol 'i' (line 6) not initialized dupstring.c: 6: Info 830: Location cited in prior message dupstring.c: 9: Warning 438: Last value assigned to variable 'i' (defined at line 6) not used dupstring.c: 6: Info 830: Location cited in prior message dupstring.c: 10: Warning 527: Unreachable code at token 'return' dupstring.c: 10: Error 64: Type mismatch (return) (char * = int)
Note that although line 5 has not changed, line 7, for which this message is referencing, has changed so this message should be displayed as well.
There are several potential methods for achieving the desired effect which are considered below.
Line Filter method
Perhaps the most naive method is to run the original codebase through lint, obtain the line numbers and message numbers produced, and then filter the same messages that reference the same line numbers in subsequent lintings on updated versions of the codebase on a per-source file basis. This quickly falls apart though as line numbers quickly shift when new code is added or removed. Adding a line to the top of a file would cause the line numbers of the rest of the file to change and all of the original messages would be generated again. This is obviously not a feasible method and won't be discussed further.Annotation method
Another idea is to run the original codebase through lint, extract the messages produced, and append a single-line suppression comment to each line that produces a message.The method has the following drawbacks:
- It may not be acceptable to make a potentially significant amount of annotations to the code base to suppress messages.
- If the linting configuration changes, updating the code base to accommodate the changes can be messy.
- Single-line suppressions do not work for all messages. For example, they are ineffective for suppressing wrap-up messages.
The advantages to this method are:
- The messages suppressed become documented within the code and can easily be examined and addressed as time permits.
- As long as the annotations are performed when the linting configuration for the project has been finalized, nothing needs to be done during subsequent lintings to maintain the suppressions.
- Implementing such a method is very straight-forward and can be accomplished with a simple script.
This may be a reasonable method for smaller codebases with fewer messages to worry about but is probably not a good general solution.
Contextual method
The next method to consider is the contextual method. This scheme is identical to the Line Filter method but instead of using line numbers, the lines of code themselves are used. For each message produced by lint, the source line and message number are saved. The results of future lintings are filtered based on the saved results and messages that were originally produced for identical lines of code are withheld.The advantages of this method are:
- No changes to the code base is required
- The baseline can be easily changed
- The saved results can easily be recreated based on different linting configurations
The main disadvantage is that newly added source code lines that are identical to lines with previously suppressed messages could wind up being suppressed, such as with copy-pasted code. This is particularly problematic for lines containing little or no useful context which could cause a significant suppression of new messages.
Some other products use a method similar to this where a hash of the message details (typically some combination of file name, function name, symbol name, and message number) are used to create an identifier each issue which can be used to provide baseline suppressions. The hash method can result in collisions where similar issues with the same combination of information produce the same hash value. Including source code into the hash can help improve the uniqueness of the hash but usually requires more than one line of source code to be useful and this is not easily accomplished with FlexeLint. There is, however, significant potential for developing a solution around this method.
Diff method
The last method considered here is using a diff between the current version of the codebase and a baseline version to build a list of the files and line numbers that have been added or changed. A filter can then be applied to the output of linting the current version to only emit the messages that relate to new or changed code.The advantages to this method are:
- No change to the codebase are required
- The baseline can be easily changed
- Unlike the other methods, the baseline code base does not need to be linted
The disadvantages are:
- A diff against the established baseline must be performed for every updated version of the codebase that is to be linted
- Diff algorithms aren't perfect and significant changes to files can cause large chunks of relocated code to be incorrectly marked as changed
Since the shortcomings of this method are probably more tolerable than the other methods in a general case and it is not too complex to implement, I have selected to implement a solution using this method.
Implementation
The basic idea is pretty simple:- Run a diff against an established baseline and a newer version, extract the location of the new/changed code
- Run lint over the new code base
- Filter the lint output to only display messages related to the new/changed code
There are a few complications to consider though:
- Some messages produce 830 and 831 messages, these need to be handled properly and be grouped together with their parent message.
- When grouping messages with their 830/831 messages, the locations of each of the references should also be considered since the base message could specify a location that hasn't changed but the reference could specify a location that has.
- Verbosity messages and specific walk announcements, etc. can be interspersed throughout messages and should be handled properly.
Because of this, something more intelligent than a simple line-by-line suppression mechanism is required.
The strategy used here is to consider all lines from the end of a previous message until the new message (including any following 830/831 messages) as a single block which is displayed if any of the location information corresponds to new code. Regular expressions are used to specify patterns to match verbose messages (such as
--- Module:
, /// Start of Pass 2 ///
, etc) which are always displayed. Below is the python script (also available on github) that does this.#!/usr/bin/python import re def process_diff_list(fh): '''Obtain the filenames and locations that represent new/changed code from the output of the diff command and return a dict of sets describing them.''' diff_pattern = re.compile(r'^[,\d]+[ac](?P<start>\d+)(,(?P<end>\d+))?') file_pattern = re.compile(r'^diff .* (?P<file>\S+)$') cur_file = None diff_list = {} for line in fh: m = file_pattern.search(line) if m: cur_file = m.group('file') continue m = diff_pattern.search(line) if not m: continue if cur_file not in diff_list: diff_list[cur_file] = set() start = int(m.group('start')) end = start if m.group('end'): end = int(m.group('end')) for i in xrange(start, end + 1): diff_list[cur_file].add(i) return diff_list def process_lint_output(fh, diff_list): '''Process the output from linting, filter messages not related to new/changed code, print the ones that are.''' messages = get_messages(fh) for locations, message_list in messages: if not locations: print ''.join(message_list), continue for filename, lineno in locations: if filename in diff_list and lineno in diff_list[filename]: print ''.join(message_list), break def get_messages(fh): '''Process the next message group and return it along with all of the file locations it references''' # Pattern for lint messages, only need filename, line, and message number message_pattern = re.compile(r'^(?P<filename>[^:]*):\s*(?P<lineno>\d+):' r'\s*(Warning|Error|Info|Note)\s*(?P<msgno>\d+)') # The pattern that specifies verbose messages that are not specific # to a particular message and should never be suppressed verbose_pattern = re.compile(r'^(---|///|===|\s{4})') # List of tuples representing locations related to the current message locations = [] # List of lines representing the current message buffer = [] # Flag indicating if the contents of the current buffer contains a # recorgnized lint message with file information in_message = False for line in fh: # If line is a verbose line, emit buffer and line m = verbose_pattern.search(line) if m: if buffer: yield locations, buffer yield [], line locations = [] buffer = [] in_message = False continue else: m = message_pattern.search(line) if m: filename = m.group('filename') lineno = int(m.group('lineno')) msgno = m.group('msgno') if in_message and msgno not in ('830', '831'): # A new message has been encountered, emit current # message and queue this one yield locations, buffer locations = [] buffer = [] locations.append((filename, lineno)) buffer.append(line) continue else: # New message or 830/831 to be grouped with current one in_message = True locations.append((filename, lineno)) buffer.append(line) continue else: if in_message: # Message that doesn't match pattern starts a new message # group, emit current message and queue new one yield locations, buffer locations = [] buffer = [] buffer.append(line) in_message = False continue else: # If we haven't seen a recognized message yet, just add # this line to the buffer buffer.append(line) continue # Don't forget to send any leftover data yield locations, buffer if __name__ == '__main__': import sys import subprocess if len(sys.argv) < 4: print "Usage: lint_diff.py orig_dir new_dir [lint options] files" sys.exit(1) orig_dir = sys.argv[1] new_dir = sys.argv[2] lint_args = sys.argv[3:] diff_fh = subprocess.Popen(('diff', '-N', orig_dir, new_dir), stdout=subprocess.PIPE) diff_list = process_diff_list(diff_fh.stdout) lint_fh = subprocess.Popen((('lint',) + tuple(lint_args)), stdout=subprocess.PIPE) process_lint_output(lint_fh.stdout, diff_list)
I use the "normal" mode for
diff
because it is easier to extract the new and changed line information, the -N
option is needed to properly handle new files. The process_diff_list
function handles the diff output and returns a dict of filenames each containing a dict of line numbers representing the new/changed code. The
process_lint_output
function calls the generator function get_messages
to iterate over the groups of messages and obtain the file location information for each one which is compared to the diff_list
to determine whether or not to output it. If no location information is available, the message is displayed by default.The script is called with the name of the baseline directory, the directory containing the modified code, and any lint arguments including the files to process. For example, using the dupstring.c example from earlier, if the original version is in orig/ and the new version is in new/, we can run our incremental lint like so:
lint_diff.py orig/ new/ new/dupstring.c
and obtain incremental results:
FlexeLint for C/C++ (Unix) Vers. 9.00h, Copyright Gimpel Software 1985-2011 --- Module: dupstring.c (C) new/dupstring.c: 7: Warning 668: Possibly passing a null pointer to function 'strcpy(char *, const char *)', arg. no. 1 [Reference: file new/dupstring.c: line 5] new/dupstring.c: 5: Info 831: Reference cited in prior message new/dupstring.c: 7: Warning 613: Possible use of null pointer 's' in left argument to operator 'ptr+int' new/dupstring.c: 8: Warning 530: Symbol 'i' (line 6) not initialized new/dupstring.c: 6: Info 830: Location cited in prior message new/dupstring.c: 9: Warning 438: Last value assigned to variable 'i' (defined at line 6) not used new/dupstring.c: 6: Info 830: Location cited in prior message new/dupstring.c: 10: Warning 527: Unreachable code at token 'return' new/dupstring.c: 10: Error 64: Type mismatch (return) (char * = int)
The diff process could be broken into a separate piece that writes the intermediate information to a file if desired. The solution could also be integrated to perform diffs from a SCM system.
This is a solution that works well in practice but it isn't perfect and some tweaks may need to be made for particular purposes:
- The message_pattern assumes a single-line message format that starts with "
%f: %l: %t %n
", the script would need to be modified to handle alternative formats. - Since thread messages do not currently contain file or line information, they cannot be treated differently from the baseline and either all of them will be displayed or none of them will (which is the case in this implementation).
- The output of the
-summary
option will necessarily include information related to messages from the baseline code.
how to support incremental lint?
ReplyDeleteI want to lint as FlexeLint does.
and then parse the lint output and filter with diff result.
like this:
$ lint file.c > result.txt
$ lint_filter orig_dir new_dir result.txt > diff_result.txt
maybe lint_diff.py is also usefull.
ReplyDeletebut i can't run it:
[test]$ python lint_diff.py old new new/test
.c
Traceback (most recent call last):
File "lint_diff.py", line 146, in
lint_fh = subprocess.Popen((('lint',) + tuple(lint_args)), stdout=subprocess
.PIPE)
File "/usr/lib/python2.6/subprocess.py", line 633, in __init__
errread, errwrite)
File "/usr/lib/python2.6/subprocess.py", line 1139, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
=============
lint is:
../LINT-NT.EXE -"format=%f: %l: %t %n" $*
I can run lint:
[test]$ ./lint new/test.c
PC-lint for C/C++ (NT) Vers. 9.00h, Copyright Gimpel Software 1985-2011
--- Module: new\test.c (C)
_
#include
new\test.c: 1: Error 322
If "lint" is a batch script, try replacing 'lint' with 'lint.bat' at the end of the python script. If that doesn't work, also try specifying the full path for the batch file.
ReplyDeleteI wrote a lint-diff by shell.
ReplyDeletehttps://github.com/sinojelly/lint-diff
This tool diffs the old and new files, only show warnings on the lines have difference.
It not impact the lint result, just filter the warnings by lint-diff.xsl.
You can use this lint-diff tool these ways:
1. show warnings relative to a workspace of old code.
a) modify variables in envsetup.sh
b) run lint-diff-wraper.sh
2. show warnings relative to the specified git commit
a) use lint-diff.sh, you shold take a look at it first.