Creating Quick and Effective Yara Rules: Working with Strings

Creating Quick and Effective Yara Rules: Working with Strings

This is a quick post to outline a few ways to extract and identify useful strings for creating quality Yara rules. This post focuses on Windows executable files, but can be adapted to other files types. Let’s start with an overview of the types of strings we are interested in when developing Yara rules.

tl;dr

In this post, you will learn:

  • How to extract ASCII and Encoded strings from malware samples.
  • How to analyse strings from a malware sample set and choose strings for your Yara rule.
  • Tips and other tools to assist in Yara rule creation.

ASCII vs. Encoded Strings

Windows executables normally contain both ASCII and encoded strings. A “string” typically refers to a sequence of alphanumeric and special characters arranged in a specific order. Strings are used to represent various types of data, including file names, paths, URL’s, and other content within files. ASCII and encoded strings refer to different concepts in the context of character representation.

An ASCII string is a character encoding standard that uses numeric codes to represent characters. ASCII is a straightforward encoding, but it has limitations when it comes to representing characters from other languages or special symbols. Encoded strings generally refer to representing text using a specific character encoding scheme, such as Unicode (16-bit Unicode Transformation Format, or UTF-16, and sometimes referred to as “wide” strings) which is standard in Windows executable files. When writing Yara rules for Windows executables, we normally want to focus on both ASCII and Unicode strings. So, how do we extract these strings from an executable file? Glad you asked.

Extracting Strings

The simplest way to extract ASCII strings is using the strings tool in Linux/Unix (also available in Windows and MacOS). Execute the command on your malware executable target, and save the output to a text file like so:

strings -n 4 malware.exe > malware-ascii-strings.txt

Encoded strings are also easy to extract:

strings -n 4 -e l malware.exe > malware-encoded-strings.txt

Once we have our strings, let’s dump them into a Yara rule, shall we? Heh… Not so fast, cowboy. We have some strings analysis work to do first.

Analyzing Strings

One of the challenges with using strings for detecting malware is that there are so.. many.. strings. A single executable file could have thousands. How do we know the good strings, from the bad strings, from the ugly strings? How can we know which to include in our Yara rule?

If you have a single malware executable, you’ll have lots of strings to dig through (depending on the size of the executable file, of course). The trick is to identify the strings that are likely related to the malware itself, while disregarding and filtering out the strings that are not directly related to the malware that we may not be interested in (such as compiler data and code, common strings that also reside in benign files, etc.). It takes experience to know what to look for and what to ignore.

If you have a number of files of the same malware family, this process can be a bit more efficient. What we need to do is gather our malware sample set, extract all strings from these samples, and compare these strings to identify the strings we should zero in on for our Yara rule.

This malware sample set must meet the following requirements:

  • The malware samples should be part from the same malware family. For example, if you are developing a Yara rule for Ryuk ransomware, all samples should be Ryuk ransomware, otherwise bad samples/strings will taint your Yara rule.
  • The malware samples should be unpacked/deobfuscated. If the samples are packed, encrypted, obfuscated, etc., you are no longer writing a Yara rule for the malware itself, but rather for the packer/obfuscator. If this is your intention, that’s perfectly fine, as there are valid use cases for this as well!
  • The malware samples should be of the same file type. It’s not a good idea to mix Windows executables with MS Office documents, for example.
  • The more malware samples you have in your set, the more accurate your Yara rule could be.

We can extract and analyse all strings in a malware sample set with a one-liner command. First, make sure you have your malware samples together in one directory called “samples”. (I am assuming you are on a *Nix system here, but the following command can be adapted for Windows as well with a bit of work):

for file in $(ls ./samples/*); do strings -n 4 $file | sort | uniq; done | sort | uniq -c | sort -rn > count_malware_strings.txt

In the above command, we create a for loop that iterates over all files in our samples directory (“samples”). Each file’s strings are extracted and sorted, and finally we append a “count” value to each string and save this to a text file “count_malware_strings.txt”. Here is a screenshot of the result:

You may be able to spot some interesting strings. The number “9″ next to each line denotes the number of samples this string resides in. My sample set consists of 9 samples, so each string with a 9 next to it means that this string resides in all my malware samples!

We should also run this same command, but for encoded strings:

for file in $(ls ./samples/*); do strings -n 4 -e l $file | sort | uniq; done | sort | uniq -c | sort -rn > count_malware_strings_encoded.txt

Here is the result:

See any interesting strings here? Perhaps the references to WMI (SELECT * …), the sandbox-related strings (“sandbox”), and strings such as “Running Processes.txt”?

Selecting Strings for the Yara Rule

So, now we have a much better idea of what strings to use in our Yara file. Ideally, we’ll want to select strings that are in all or most of the sample set. Selecting strings that are in only one file may result in lots of false-positives (depending on what type of rule you are creating and what your objectives are, of course). However, selecting only strings that appear in all files may result in your Yara rule being too specific. Again, this will depend on your objectives for the rule.

Consider also that even though you are dealing with malware, there will be “benign” strings (sometimes called “goodware strings”) in these files that are not part of the malware’s code or functionalities. You’ll likely want to weed these out. Optionally, you could create a goodware strings database or list that simply contains strings you wish to exclude from your Yara rules. But this is a topic for another day.

Creating our Yara Rule

Based on the strings I observed in the strings text files I created previously, I chose the following strings and created my basic Yara rule:

Notice how I added the “wide” attribute to some of the strings. This tells Yara that these are encoded strings. For the conditions at the bottom, I am specifically looking for samples that have the header bytes 0x5A4D (meaning a Windows PE file), and the sample must have 15 or more of these strings residing in them. Lowering this number will result in more of a “hunting” rule, where you may catch additional malware (with a wider net) but have more false positives. Increasing this number will create a higher-fidelity rule, but may be too specific.

Other Tools and Tips

Here are a few other random tips/tricks for dealing with strings in Yara rules:

PE Studio – PE Studio is a great PE executable file analysis tool that also has a nice “goodware” and “malware” strings database built-in. You can open an executable file in PE Studio and the tool will provide you with some hints on which strings may be interesting.

Strings-Sifter – A tool created by Mandiant, it can “sift” through strings and sort them based on how unique or “malicious” they are. This is very useful for quickly identifying the interesting strings.

Yargen – A full-on cheatmode for Yara rules. Yargen is a tool from Florian Roth that takes an input sample set and automatically generates Yara rules based on interesting strings or code in the files. This is a great tool if you are pressed for time or if you have lots of rules to create. However, nothing beats a well-tuned, manually-written rule (in my humble, old-school, boomer opinion). Also, if you are new to Yara and/or malware analysis, stay away from the automatic tools and just do it manually, please 🙂

Conclusion

I hope this short post helps you create better Yara rules! If you have further suggestions or ideas, send them to me and I may include them in this post or in future posts!

@d4rksystem

Comments are closed.