Skip to content

How to write a filter file

If you are looking for templates for filters, you can find them here: filter templates

Sections of the filter file

The filter file is separated into two main sections: filters and attributes. A filter specifies if a project will be saved in the output corpus. The attributes specified in the attributes section define, which attributes of a project will be shown in the output corpus.

The two sections are specified by writing:

filters:

attributes:

Without any indentation.

The example above is also the minimal required filter file, when using corpus filter or the corpus build pipeline.

How to write a filter

A filter can be written directly under the section title filters: with an indentation (I always use 4 spaces or tab). Make sure to be consistent with the indentation or your filter file might not be read.

Any attribute a project has in a corpus, can be used as a filter option. To filter by an attribute, one has to define the operator and the value, that will be used in the evaluation.

Here is a small example:

filters:
    id:
        operator: "<"
        value: 12345

As we are writing a filter, we start with the keyword filters:. In the next line, we write the attribute by which we want to filter the projects. Here it is id:. The next two lines are for operator and value. Any non-numeric values need to be surrounded by double quotes (\") or single quotes (\').

Attributes with a string value can also be filtered by using a regular expression, as shown in the following example:

filters:
    name:
        operator: "regex"
        value: ".*machine\slearning.*"

Here we search for projects, which have the string \'machine learning\' in its name.

Special filter option: languages

GitLab provides the languages used in a project through its API. We can use this, to create a corpus of projects which use specific languages only.

Until now, there are four possible language filters:

any_languages

A project will only be added to the corpus, it is contains any of the languages defined here.

atleast_languages

A project will only be added to the corpus, if it contains at least the languages defined here.

atmost_languages

A project will only be added to the corpus, if it contains at most the languages defined here.

exact_languages

A project will only be added to the corpus, if it contains exactly the languages defined here.

Some examples can be found in the section [Examples]{.title-ref}.

How to specify the attributes

Defining the attributes to be shown in the corpus is straight forward. Simply add the name of the attribute in a list in the next line after attributes, like so:

attributes:
    - id
    - name
    - description
    - web_url

How to refer to a filter file

A filter file is needed, if you either run the command corpus build or corpus filter. The default location for a filter file is resources/filters.yaml from your current directory.

If you want to specify the location of the filter file manually, add the following to your command:

corpus build --filter-file=path/to/your/filter_file.yaml

or:

corpus filter --filter-file=path/to/your/filter_file.yaml

Examples

Assume we want to create a corpus of the projects of our GitLab instance, which currently only has two projects:

  1. Project 1, which has the following languages section:

    "C#": 52.7,
    "C++": 43.14,
    "C": 4.16
    
  2. Project 2, which has the following languages section:

    "HTML": 51.0,
    "Vue": 9.0,
    "JavaScript": 40.0
    

Examples for any_languages

We now want to filter out projects that have any of the languages C, C++ or Java. The filter for this would look like this:

filters:
    any_languages:
        C:
            operator: ">="
            value: 0.0
        C++:
            operator: ">="
            value: 0.0
        Java:
            operator: ">="
            value: 0.0

The resulting corpus would then contain Project 1 only. In the future it shall be necessary anymore, to write operator and value in this case.

Now we want to filter more detailed, by projects which have the languages C, C++ or Java with at least 60%:

filters:
    any_languages:
        C:
            operator: ">="
            value: 60.0
        C++:
            operator: ">="
            value: 60.0
        Java:
            operator: ">="
            value: 60.0

The resulting corpus would not contain any of the two projects.

Examples for atleast_languages

The following filter would only add Project 2 to the corpus, because Project 1 does not contain HTML or Vue:

filters:
    atleast_languages:
        HTML:
            operator: ">"
            value: 0.0
        Vue:
            operator: ">"
            value: 0.0

Here we filter out projects, which contain at least Vue, but it should not make up more than 50% of the projects languages:

filters:
    atleast_languages:
        Vue:
            operator: "<="
            value: 50.0

The corpus would then contain Project 2.

Examples for atmost_languages

We now want to filter out projects, which only contain the programming languages C and C++ and nothing more:

filters:
    atmost_languages:
        C:
            operator: ">"
            value: 0.0
        C++:
            operator: ">"
            value: 0.0

None of the above projects would be added to the corpus.

If we add C#, Python and ActionScript to the filters, Project 1 will be added to the corpus, because it contains C#, C++ and C:

filters:
    atmost_languages:
        C:
            operator: ">"
            value: 0.0
        C++:
            operator: ">"
            value: 0.0
        C#:
            operator: ">"
            value: 0.0
        Python:
            operator: ">"
            value: 0.0
        ActionScript:
            operator: ">"
            value: 0.0

Examples for exact_languages

We now want to filter out those projects, that contain exactly HTML, Vue and JavaScript with at least 30%:

filters:
    exact_languages:
        HTML:
            operator: ">"
            value: 30.0
        Vue:
            operator: ">"
            value: 30.0
        JavaScript:
            operator: ">="
            value: 30.0

The resulting corpus would contain Project 2 only.