Practical Example Of My New Event Driven Regex Engine - SAR

As I’ve said in the previous post, SAR is very powerful when it comes into appending context for our regexps.
In this post I will try to demonstrate this statement.

Now consider the following task: You need to find in a C code all the functions that call another certain function inside of them, for example, we want to find all functions in our code that call the function “malloc”.

When trying to solve this task without SAR I’ve come across many problems, I thought maybe i could try to split the scopes (i.e. the curly brackets), but then how do we know that those brackets belong to function body and not an if or while or other stuff?

We could try to use native regex to find function declaration, that should be easy enough, but then again, how do we know that inside those functions we have a malloc? Yet another problem…

I’ve tried many different approaches but unfortunately they all seem to get rather complicated very quickly.

Then looking at SAR the problem suddenly seems rather simple and straightforward, but before I’ll explain how it was solved I’d like to emphasize again a few key properties of SAR:

  1. We are able to append multiple regexps together.
  2. For every regexp we also append a callback which will be called upon every match in the same order they appear in the text.
  3. We are able to remember information between matches by using python closure (basically means we can access variables outside of function scope and modify them) properties, and thus apply context into the regexp.

Now that we’ve recalled these properties we can move on to the solution.

First, read the C file content and create sar instance:

from regexp_sar import RegexpSar

with open("./c_code.c", "r") as c_fh:
    c_content = c_fh.read()

required_method = "malloc"

sar = RegexpSar()

Now let’s move on into solving the problem. The solution requires 2 things:

  1. Find the name of the function in which we look inside it’s body
  2. Identify the required method name and print the name of the function we’re at to notify us.

Now let’s look how we can achieve these requirements:

  1. In order to find the name of the function we’re at, we will do the following steps:

    1. Keep track of the latest word we have seen (\w+).
      last_found_word = None
      def found_name(from_pos, to_pos):
          global last_found_word
          last_found_word = c_content[from_pos: to_pos]
          sar.continue_from(to_pos)
      
      sar.add_regexp("\\w+", found_name)
      
    2. Whenever we find an open paren character, mark that last found word as a function name (we will use that context in the following steps).
      # name of last encountered function
      last_found_function = None
      def found_function(from_pos, to_pos):
          global last_found_function
          last_found_function = last_found_word
      
      sar.add_regexp("(", found_function)
      
    3. Keep track of the depth of our curly brackets, every time we encounter an open bracket ("{") raise a counter by 1 and every time we encounter a closing bracket ("}") reduce the counter by 1.
    4. Before increasing the bracket counter, check if it is equal to 0, if yes, that means that currently we are not inside any function, and also, that the function name we previously found, is the name of the function we are going to enter its body, so save that in another variable.
      # This is steps 4 + 5
      # name of function we are currently in its body (with its body)
      inside_function_name = None
      curly_bracket_count = 0
      def handle_open_curly_bracket(from_pos, to_pos):
          global curly_bracket_count, inside_function_name
          if curly_bracket_count == 0:
              inside_function_name = last_found_function
          curly_bracket_count += 1
      
      sar.add_regexp("{", handle_open_curly_bracket)
      
      def handle_close_curly_bracket(from_pos, to_pos):
          global curly_bracket_count
          curly_bracket_count -= 1
      
      sar.add_regexp("}", handle_close_curly_bracket)
      
  2. After we know how to find the name of the function we are inside, we will do the following:

    1. whenever we encounter the name of the required function, print the name of the function we are at, since it is known after following part 1.
      def handle_required_method_found(from_pos, to_pos):
          print(f"found at: {inside_function_name}")
      
      sar.add_regexp(f"\\^\\w{required_method}\\^\\w", handle_required_method_found)
      

Once we’ve implemented the steps above all that’s left is to run the regexps:

sar.match(c_content)

In order to see the full code with the C file click here.

And indeed, SAR finds that the following functions call the function “malloc” (this are: “cloneDoubleStr”, “getSubArr”, “someNestedMalloc”).

In this post we’ve learned what context means and how it is applied in SAR. It is very important to note that SAR was designed to give developers maximum freedom in their way of using SAR, and indeed, the way one applies context into SAR is entirely up to the developer to decide it’s implementation.