Analyzing Flat File data with Regular Expressions

Technical Value

In this article I want to talk about some basic usage of regular expressions when searching for erroneous data in structured text files, like csv. This comes in handy if you have to work with data in text files without the opportunity to read it into SQL Server or Excel in order to analyze the data. All examples make use of the PowerShell, using the following commands:

  1. Get-ChildItem .\testdata.csv | Select-String [-Notmatch|-Pattern] "PATTERN"
  2. # returns the matched lines, including the line number
  1. Get-Content .\testdata.csv | Select-String [-Notmatch|-Pattern] "PATTERN"
  2. # returns only the matched lines

The main idea is, to find rows, containing invalid data. To check against a valid pattern, we have to use the -Notmatch parameter of the Select-String cmdlet to find the lines, containing data, that don't match a specific pattern. That is, lines with data, which don't validate against the pattern. For the other way round, matching a pattern, describing invalid data, one have to use the -Pattern paramter. The Select-String cmdlet is mainly used to search for text in the whole row or file. But it can also be used to search for content in a specific column of a csv file. Please notice, that the following only applies to text files with one data row per row. Embedded line breaks ar not supported.

Searching in columns

If you want to check data in specific columns, you can also do this with the Select-String cmdlet and regular expressions. The pattern of a column is defined, using the delimiter of the csv file. Let's say, we have data, delimited by a semicolon (the spaces are added only for better visualization):

  1. Col1;      Col2;      Col3
  2. abcde;    12345;    2014-10-01
  3. fghijk;  67890;   2014-10-02

A column can then generally be matched by: [^;]*?;


  • [^;] : arbitrary text, which is not the delimiter (;)
  • *?   : this text zero to "inifinite" times
  • ;    : followed by the delimiter (;)

Special cases: Matching the first and last column:

  • To match the first column you have to place the caret (^) in front of the pattern: ^[^;]*?;
  • To match the last column, you have to place the dollar sign ($) at the end of the pattern, and move the ";" to the beginning: ;[^;]*?$

Changing the delimiter

Replace the semicolon to fit the delimiter to your needs. Using tabs (\t) as a delimiter would change the pattern to [^\t]*?\t. This applies to every usage of the pattern described here.

Building the pattern for specific columns

Checking a specific column involves using a general column-pattern for each column preceding the one u want to check, matching every text. You then put the pattern for the column you want to check at the end of the whole pattern. For example to check the 3rd column you do:

  1. Col1       ;Col2       ;Col3      ;Col4      ;Col5
  2. [Match all];[Match all];[Pattern to check];[ignore];[ignore]

One can easily build a pattern like this with the following steps, using the general "match all"-pattern for a column: [^;]*?;

  1. Set the pattern for the first column: ^[^;]*?;
  2. Set the pattern for the following columns: [^;]*?;
  3. Repeat the last pattern as often as needed with a quantifier ({}): ([^;]*?;){2}
  4. Set the last pattern to check the data in the desired column after the preceding matches, closing with a ";": ([a-z]);

  To parse specific columns for the string "abc", use:

  • for the 1st column: ^[^;]*?abc[^;]*?;,
  • for the 2nd column: ^[^;]*?;[^;]*?abc[^;]*?;,
  • for the 3rd column: ^[^;]*?;([^;]*?;){1}[^;]*?abc[^;]*?;,
  • for the 4th column: ^[^;]*?;([^;]*?;){2}[^;]*?abc[^;]*?;,
  • and so on...,
  • for the last column: ;[^;]*?abc[^;]*?$.

Use [^;]*? to match arbitrary text in your matches (like "%" in SQL). Remove the pattern [^;]*? , if you don't need it. E.g, to match "abc" only at the beginning of the 2nd column use ^[^;]*?;abc[^;]*?; .

Hint: For better readability it is a good practice to put your pattern in parentheses, like (abc)^[^;]*?;[^;]*?(abc)[^;]*?;:

Now let's look at some examples.


Check for invalid integers in the 1st column

Here we want to find integers, which match a pattern for invalid data. Therefore we have to use the -Pattern paramter.

  1. Get-ChildItem .\testdata.csv | Select-String -Pattern "^(\d{10,});"

This is just a coarse comparison. It displays all lines, containing a number greater or equal to in the whole first column (^ to ;). So one can easily identify numbers by hand, that exceed the max value of unsigned int32 (-2.147.483.648 - 2.147.483.647) or signed int32 (4.294.967.295). If you want to check numbers with a higher range, change the quantifier accordingly, e.g. to check for bigints, change it to 19 or 20 (signed, unsigned bigint).

Check for invalid 2-letter country codes in the 3rd column

Here we use a pattern, that matches a correct 2-letter country code, so we have to use the -Notmatch paramter, to get the rows with wrong values. The country code should constist of two upper case letters only. Note: Here we have to match case-sensitive, so the -CaseSensitive parameter has to be added.

  1. Get-ChildItem .\testdata.csv | Select-String -Notmatch "^[^;]*?;([^;]*?;){1}([A-Z][A-Z]);" -CaseSensitive

In order to avoid returning the line with the column names, you can extend the pattern with an alternation to also match the desired column by name, in this example "Col3":

  1. Get-ChildItem .\testdata.csv | Select-String -Notmatch "^[^;]*?;([^;]*?;){1}([A-Z][A-Z]|Col3);" -CaseSensitive

Check for invalid dates in the 4th column

Here we use a pattern, that matches a correct date. To find rows, containg invalid dates (not matching the pattern), we have to use the -Notmatch paramter.

  1. Get-ChildItem .\testdata.csv | Select-String -Notmatch "^[^;]*?;([^;]*?;){2}(\d\d\d\d\-\d\d-\d\d);"

The date pattern here is a very simple example. You can insert a pattern of your choice for checking valid dates. In order to avoid returning the line with the column names, you can extend the pattern with an alternation to also match the corresponding column by name ("Col4"):

  1. Get-ChildItem .\testdata.csv | Select-String -Notmatch "^[^;]*?;([^;]*?;){2}(\d\d\d\d\-\d\d-\d\d|Col4);"


Check for spaces in the last column

Here we want to find rows, that do contain spaces in the last colummn (positive match), so we have to use the -Pattern parameter. The last column can be parsed directly, using the "end of line" anchor ($), so we don't have to use a stack of column patterns in the string.

  1. Get-ChildItem .\testdata.csv | Select-String -Pattern ";[^;]*?( )[^;]*?$"

[^;]*? works as a "match-all"-placeholder.

Variations: Checking for Space at the beginning of the column: remove the "match-all"-placeholder at the beginning, so the pattern starts after the delimiter:

  1. Get-ChildItem .\testdata.csv | Select-String -Pattern ";( )[^;]*?$"

Checking for Space at the end of the column: remove the "match-all"-placeholder at the end, so the pattern ends at the "end of line"-anchor ($):

  1. Get-ChildItem .\testdata.csv | Select-String -Pattern ";[^;]*?( )$"


Regular expressions are a great way to check for invalid data in text files und fun to work with!


Do, 24.05.2018 - 10:32

Thank you for a informative page. How can I search with regex in a flat file? Looking in spesific position for 1 or more digits.

Flat file is blocked with 12 characters each line starts with a two digit recortype, ie 10=phonenumber 20 = zip code


I want to find value 3 in position 9 in every 10 recordtype.
resulting in finding this line

102620523565 this line
102620523024 this line

I use Notepad ++ latest version. I want tu put the regex code in search box using regex
I just need Notepad++ to highlight found digit

i do not need to print out this line

Ralf Glöckner
Fr, 25.05.2018 - 09:38


if I understood you correctly, you can use this regex to find the value as described:


It matches a "3" on a line, wich starts with a "10", followed by 6 digits.


(?<= # start of Lookbehind
^10 # literal 10 at the beginning of the line (2 digits)
\d{6} # any digit (6 times)
) # end of Lookbehind
3 # desired value (on position 9).

I tested it in notepad++ and it works, given the structure of the data and search requirements as described. It does also works for longer lines.

If you need to test for digits after the "3" also, you can use a Lookahead after "3" like this:



(?= # start of Lookahead
\d{3}$ # 3 digits until end of line
) # end of Lookahead

Lookbehinds and Lookaheads are kind of anchors which find the position of the given pattern before or behind the search string, but don't contribute to the match.

Please be aware, that "Lookbehind" is not supported by every RegEx-Flavor/Tool. But the major ones do support it (Perl, PCRE, .NET).

Hope this helps.

Don't hesitate to ask, if you have any further questions.


Neuen Kommentar schreiben

Der Inhalt dieses Feldes wird nicht öffentlich zugänglich angezeigt.


  • Keine HTML-Tags erlaubt.
  • HTML - Zeilenumbrüche und Absätze werden automatisch erzeugt.
  • Web page addresses and email addresses turn into links automatically.
Teilen auf

Newsletter Anmeldung

Abonnieren Sie unseren Newsletter!
Lassen Sie sich regelmäßig über alle Neuigkeiten rundum ORAYLIS und die BI- & Big-Data-Branche informieren.

Jetzt anmelden