BLOG

Analyzing Flat File data with Regular Expressions

29.04.2015 Ralf Glöckner

In this article I want to talk about some basic usage of regular expressions when searching for erroneous data in structured text files, like csv. This comes in handy if you have to work with data in text files without the opportunity to read it into SQL Server or Excel in order to analyze the data. All examples make use of the PowerShell, using the following commands:

  1. Get-ChildItem .\testdata.csv | Select-String [-Notmatch|-Pattern] „PATTERN“
  2. # returns the matched lines, including the line number
  1. Get-Content .\testdata.csv | Select-String [-Notmatch|-Pattern] „PATTERN“
  2. # returns only the matched lines

The main idea is, to find rows, containing invalid data. To check against a valid pattern, we have to use the -Notmatch parameter of the Select-String cmdlet to find the lines, containing data, that don’t match a specific pattern. That is, lines with data, which don’t validate against the pattern. For the other way round, matching a pattern, describing invalid data, one have to use the -Pattern paramter. The Select-String cmdlet is mainly used to search for text in the whole row or file. But it can also be used to search for content in a specific column of a csv file. Please notice, that the following only applies to text files with one data row per row. Embedded line breaks ar not supported.

Searching in columns

If you want to check data in specific columns, you can also do this with the Select-String cmdlet and regular expressions. The pattern of a column is defined, using the delimiter of the csv file. Let’s say, we have data, delimited by a semicolon (the spaces are added only for better visualization):

  1. Col1;      Col2;      Col3
  2. abcde;    12345;    2014-10-01
  3. fghijk;  67890;   2014-10-02

A column can then generally be matched by: [^;]*?;

Meaning:

  • [^;] : arbitrary text, which is not the delimiter (;)
  • *?   : this text zero to „inifinite“ times
  • ;    : followed by the delimiter (;)

Special cases: Matching the first and last column:

  • To match the first column you have to place the caret (^) in front of the pattern: ^[^;]*?;
  • To match the last column, you have to place the dollar sign ($) at the end of the pattern, and move the „;“ to the beginning: ;[^;]*?$

Changing the delimiter

Replace the semicolon to fit the delimiter to your needs. Using tabs (\t) as a delimiter would change the pattern to [^\t]*?\t. This applies to every usage of the pattern described here.

Building the pattern for specific columns

Checking a specific column involves using a general column-pattern for each column preceding the one u want to check, matching every text. You then put the pattern for the column you want to check at the end of the whole pattern. For example to check the 3rd column you do:

  1. Col1       ;Col2       ;Col3      ;Col4      ;Col5
  2. [Match all];[Match all];[Pattern to check];[ignore];[ignore]

One can easily build a pattern like this with the following steps, using the general „match all“-pattern for a column: [^;]*?;

  1. Set the pattern for the first column: ^[^;]*?;
  2. Set the pattern for the following columns: [^;]*?;
  3. Repeat the last pattern as often as needed with a quantifier ({}): ([^;]*?;){2}
  4. Set the last pattern to check the data in the desired column after the preceding matches, closing with a „;“: ([a-z]);

  To parse specific columns for the string „abc“, use:

  • for the 1st column: ^[^;]*?abc[^;]*?;,
  • for the 2nd column: ^[^;]*?;[^;]*?abc[^;]*?;,
  • for the 3rd column: ^[^;]*?;([^;]*?;){1}[^;]*?abc[^;]*?;,
  • for the 4th column: ^[^;]*?;([^;]*?;){2}[^;]*?abc[^;]*?;,
  • and so on…,
  • for the last column: ;[^;]*?abc[^;]*?$.

Use [^;]*? to match arbitrary text in your matches (like „%“ in SQL). Remove the pattern [^;]*? , if you don’t need it. E.g, to match „abc“ only at the beginning of the 2nd column use ^[^;]*?;abc[^;]*?; .

Hint: For better readability it is a good practice to put your pattern in parentheses, like (abc)^[^;]*?;[^;]*?(abc)[^;]*?;:

Now let’s look at some examples.

Examples

Check for invalid integers in the 1st column

Here we want to find integers, which match a pattern for invalid data. Therefore we have to use the -Pattern paramter.

  1. Get-ChildItem .\testdata.csv | Select-String -Pattern „^(\d{10,});“

This is just a coarse comparison. It displays all lines, containing a number greater or equal to 1.000.000.000 in the whole first column (^ to ;). So one can easily identify numbers by hand, that exceed the max value of unsigned int32 (-2.147.483.648 – 2.147.483.647) or signed int32 (4.294.967.295). If you want to check numbers with a higher range, change the quantifier accordingly, e.g. to check for bigints, change it to 19 or 20 (signed, unsigned bigint).

Check for invalid 2-letter country codes in the 3rd column

Here we use a pattern, that matches a correct 2-letter country code, so we have to use the -Notmatch paramter, to get the rows with wrong values. The country code should constist of two upper case letters only. Note: Here we have to match case-sensitive, so the -CaseSensitive parameter has to be added.

  1. Get-ChildItem .\testdata.csv | Select-String -Notmatch „^[^;]*?;([^;]*?;){1}([A-Z][A-Z]);“ -CaseSensitive

In order to avoid returning the line with the column names, you can extend the pattern with an alternation to also match the desired column by name, in this example „Col3“:

  1. Get-ChildItem .\testdata.csv | Select-String -Notmatch „^[^;]*?;([^;]*?;){1}([A-Z][A-Z]|Col3);“ -CaseSensitive

Check for invalid dates in the 4th column

Here we use a pattern, that matches a correct date. To find rows, containg invalid dates (not matching the pattern), we have to use the -Notmatch paramter.

  1. Get-ChildItem .\testdata.csv | Select-String -Notmatch „^[^;]*?;([^;]*?;){2}(\d\d\d\d\-\d\d-\d\d);“

The date pattern here is a very simple example. You can insert a pattern of your choice for checking valid dates. In order to avoid returning the line with the column names, you can extend the pattern with an alternation to also match the corresponding column by name („Col4“):

  1. Get-ChildItem .\testdata.csv | Select-String -Notmatch „^[^;]*?;([^;]*?;){2}(\d\d\d\d\-\d\d-\d\d|Col4);“

 

Check for spaces in the last column

Here we want to find rows, that do contain spaces in the last colummn (positive match), so we have to use the -Pattern parameter. The last column can be parsed directly, using the „end of line“ anchor ($), so we don’t have to use a stack of column patterns in the string.

  1. Get-ChildItem .\testdata.csv | Select-String -Pattern „;[^;]*?( )[^;]*?$“

[^;]*? works as a „match-all“-placeholder.

Variations: Checking for Space at the beginning of the column: remove the „match-all“-placeholder at the beginning, so the pattern starts after the delimiter:

  1. Get-ChildItem .\testdata.csv | Select-String -Pattern „;( )[^;]*?$“

Checking for Space at the end of the column: remove the „match-all“-placeholder at the end, so the pattern ends at the „end of line“-anchor ($):

  1. Get-ChildItem .\testdata.csv | Select-String -Pattern „;[^;]*?( )$“

Conclusion

Regular expressions are a great way to check for invalid data in text files und fun to work with!

Your email address will not be published. Required fields are marked *

Join #teamoraylispeople

Gestalte mit uns
die Welt der Daten