Posts for the month of November 2007

ErlyBird Is Ready for R12B

OTP/Erlang R12B is coming soon, which will support Binary Comprehension and -spec, -type attributes. I've updated ErlyBird to support these new syntax, and, with a new Emacs Standard color theme.

The new version of ErlyBird will be available around the releasing date of OTP/Erlang R12B and NetBeans 6.0

Regexp Syntax in Pattern Match?

Pattern match in Erlang is very useful, but it has some limits, for example, to match something like:


I have to write the code as:

<<_:S/binary, C1,C2,C3,$x,$/,Y1,Y2,Y3,Y4,$/,M1,M2,$/,D1,D2,$/,_/binary>> when
     C1 > 47, C1 < 58, C2 > 47, C2 < 58, C3 > 47, C3 < 58,
     Y1 > 47, Y1 < 58, Y2 > 47, Y2 < 58, Y3 > 47, Y3 < 58, Y4 > 47, Y4 < 58,
     M1 > 47, M1 < 58, M2 > 47, M2 < 58, D1 > 47, D2 < 58, D2 > 47, D2 < 58

But life can be simple, by using parse_transform, we can write above code to:


Where d is digital. The parse_trasnform can process the AST tree to make the true code.

And more, since current erl_parse supports any atom in binary skeleton, we can write pattern match as:


Will somebody write such a parse_transform?

Tim Bray's Erlang Exercise "WideFinder" - After Find Wide

>>> Updated Nov 7:

A new version tbray9a.erl which uses ets instead of dict, with 116 LoC, took about 3.8 sec on 5 million lines log file, 0.93 sec on 1 million lines file on my 4-core linux box.

  •  Results on T5120, an 8-core 1.4 GHz machine with 2 integer instruction threads per core and support for 8 thread contexts per core. Solaris thinks it sees 64 CPUs:
Schedulers# Elapsed(s) User(s) System(s) (User+System)/Elapsed
1 37.58 35.51 7.82 1.15
2 20.14 35.31 8.28 2.16
4 11.81 35.37 8.25 3.69
8 7.63 35.28 8.33 5.72
16 5.60 36.08 8.27 7.92
32 5.29 36.64 8.11 8.46
64 5.45 36.79 8.23 8.26
128 5.26 36.75 8.39 8.58

When schedulers was 16, (User+System)/Elapsed was 7.92, and the elapsed time reached 5.60 sec (near the best), so, the 8-core computing ability and 2 x 8 =16 integer instruction threads ability almost reached the maxima. The 8 x 8 thread contexts seemed to do less help on gaining more performance improvement.

It seems that  T5120 is a box with 8-core parallel computing ability and 16-thread for scheduler?

On my 4-core linux box, the slowest time (1 scheduler) vs the fastest time (128 schedulers) was 6.627/3.763 = 1.76. On this T5120, was 37.58/5.26 = 7.14. So, with the 8-core and 16-integer-instruction-thread combination, the T5120 is pretty good on parallel computing.

An interesting point is, when schedulers increased, Elpased time dropped along with User time and System time keeping almost constancy. This may because I separated the parallelized / sequential part completely in my code.

  • Results on 2.80Ghz 4-core Intel xeon linux box (5 million lines log file):
Schedulers# Elapsed(s) User(s) System(s) (User+System)/Elapsed
1 6.627 5.356 4.248 1.45
2 4.486 6.176 3.936 2.25
4 4.299 8.989 4.156 3.06
8 3.960 9.629 3.644 3.35
16 3.826 9.101 3.696 3.34
32 3.858 9.029 3.840 3.34
64 3.763 8.801 3.820 3.35
128 3.920 9.137 3.980 3.35


I'm a widefinder these days, and after weeks found wide, I worte another concise and fast widefinder tbray9.erl, which is based on Steve, Anders and my previous code, with Boyer-Moore searching (It seems  Python's findall uses this algorithm) and parallelized file reading. It's in 120 LoC (a bit shorter than  Fredrik Lundh's, took about 1 sec for 1 million lines log file, and 5.2 sec for 5 million lines on my 4-core linux box. Got 5.29 sec on T5120 per Tim's testing.

To evaluate:

erlc -smp tbray9.erl
erl -smp +A 1024 +h 10240 -noshell -run tbray9 start o1000k.ap

BTW since I use parallelized io heavily, by adding flag +A 1024, the code can get 4.3 sec for 5 million lines log file on my 4-core linux box.

Binary efficiency in Erlang is an interesting topic, except some tips I talked about in previouse blog, it seems also depending on binary size, memory size etc., The best buffer size for my code seems to be around 20000k to 80000k, which is the best range on my 4-core linux box and T5120, but it may vary for different code.

Note: There is a maximum element size limit of 227 - 1 (about 131072k) for binary pattern matching in current Erlang, this would be consistent with using a 32-bit word to store the size value (with 4 of those bits used for a type identifier and 1 bit for a sign indicator) (For this topic, please see  Philip Robinson's blog). So, the buffer size can not be great than 131072k.

I. Boyer-Moore searching

Thanks to Steve and Anders, they've given out a concise Boyer-Moore searching algorithm in Erlang, I can modify it a bit to get a general BM searching module for ASCII encoded binary:

%% Boyer-Moore searching on ASCII encoded binary
-export([compile/1, match/3]).

-record(bmCtx, {pat, len, tab}).

compile(Str) ->
    Len = length(Str),
    Default = dict:from_list([{C, Len} || C <- lists:seq(1, 255)]),
    Dict = set_shifts(Str, Len, 1, Default),
    Tab = list_to_tuple([Pos || {_, Pos} <- lists:sort(dict:to_list(Dict))]),
    #bmCtx{pat = lists:reverse(Str), len = Len, tab = Tab}.

set_shifts([], _, _, Dict) -> Dict;
set_shifts([C|T], StrLen, Pos, Dict) ->
    set_shifts(T, StrLen, Pos + 1, dict:store(C, StrLen - Pos, Dict)).

%% @spec match(Bin, Start, #bmCtx) -> {true, Len} | {false, SkipLen}
match(Bin, S, #bmCtx{pat=Pat, len=Len, tab=Tab}) -> 
    match_1(Bin, S + Len - 1, Pat, Len, Tab, 0).
match_1(Bin, S, [C|T], Len, Tab, Count) ->
    <<_:S/binary, C1, _/binary>> = Bin,
    case C1 of
        C -> 
            match_1(Bin, S - 1, T, Len, Tab, Count + 1);
        _ ->    
            case element(C1, Tab) of
                Len -> {false, Len};
                Shift when Shift =< Count -> {false, 1};
                Shift -> {false, Shift - Count}
match_1(_, _, [], Len, _, _) -> {true, Len}.


> Pattern = bmsearch:compile("is a").
{bmCtx,"a si",4,{4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,...}}
> Bin = <<"this is a test">>.
<<"this is a test">>
> bmsearch:match(Bin, 1, Pattern).
> bmsearch:match(Bin, 5, Pattern).
{true, 4}
> bmsearch:match(Bin, 7, Pattern).
{false, 4}

II. Reading file in parallel and scan

To read file in parallel, we should open a new file handle for each process. To resolve the line break bound, we just split each chunk to first line (Head), line-bounded Data, and last line (Tail), and mark Head, Tail with the serial number as {I * 10, Head}, {I * 10 + 1 Tail}, so we can join all pending segments (Head and Tail of each chunk) later in proper order.

scan_file({FileName, Size, I, BmCtx}) ->
    {ok, File} = file:open(FileName, [raw, binary]),
    {ok, Bin} = file:pread(File, Size * I, Size),
    HeadL = split_on_first_newline(Bin),
    TailS = split_on_last_newline(Bin),
    DataL = TailS - HeadL,
    <<Head:HeadL/binary, Data:DataL/binary, Tail/binary>> = Bin,
    {scan_chunk({Data, BmCtx}), {I * 10, Head}, {I * 10 + 1, Tail}}.

III. Spawn workers

["Luke's spawn_worker" Luke's spawn_worker] are two small functions, they are very useful, stable and good abstract on processing workers:

spawn_worker(Parent, F, A) ->
    erlang:spawn_monitor(fun() -> Parent ! {self(), F(A)} end).

wait_result({Pid, Ref}) ->
        {'DOWN', Ref, _, _, normal} -> receive {Pid, Result} -> Result end;
        {'DOWN', Ref, _, _, Reason} -> exit(Reason)

So, I can start a series of workers simply by:

read_file(FileName, Size, ProcN, BmCtx) ->
    [spawn_worker(self(), fun scan_file/1, {FileName, Size, I, BmCtx})        
     || I <- lists:seq(0, ProcN - 1)].

And then collect the results by:

    Results = [wait_result(Worker) || Worker <- read_file(FileName, ?BUFFER_SIZE, ProcN1, BmCtx)],

In this case, returning Results is a list of {Dict, {Seq1, Head}, {Seq2, Tail}}

IV. Concat segments to a binary and scan it

Each chunk is slipt to Head (first line), line-bounded Data, and Tail (last line). The Head and Tail segments are pending for further processing. After all workers finished scanning Data (got a Dict), we can finally sort these pending segments by SeqNum?, concat and scan them in main process.

Unzip the results to Dicts and Segs, sort Segs by SeqNum?:

    {Dicts, Segs} = lists:foldl(fun ({Dict, Head, Tail}, {Dicts, Segs}) ->
                                        {[Dict | Dicts], [Head, Tail | Segs]}
                                end, {[], []}, Results),
    Segs1 = [Seg || {_, Seg} <- lists:keysort(1, Segs)],    

The sorted Segments is a list of binary, list_to_binary/1 can concat them to one binary efficently, you do not need to care about if it's a deep list, they will be flatten automatically:

    Dict = scan_chunk({list_to_binary(Segs1), BmCtx}),


Thinking in parallel is fairly simple in Erlang, right? Most of the code can run in one process, and if you want to spawn them for parallel, you do not need to modify the code too much. In this case, scan_file/4 is sequential, which just return all you want, you can spawn a lot of workers which do scan_file/4 work, then collect the results later. That's all.