22858 total geeks with 3297 solutions
Recent challengers:
best bread maker
 Welcome, you are an anonymous user! [register] [login] Get a yourname@osix.net email address 

Articles

GEEK

User's box
Username:
Password:

Forgot password?
New account

Shoutbox
sefo
anilg, new comments are deleted automaticall y because of some abuse recently
anilg
this is plain wierd. I submitted comments twice to article 950, and they dont seem to be there. Something wrong with the comment code?
CodeX
shout-boxes in general are old + the staff thing happened to everyone after an issue 2 months ago
anilg
/me is no longer staff :(
anilg
Also, osix's shoutbox predated twitter. Heh.

Donate
Donate and help us fund new challenges
Donate!
Due Date: Jul 31
July Goal: $40.00
Gross: $0.00
Net Balance: $0.00
Left to go: $40.00
Contributors


News Feeds
The Register
UK.gov sticks to IE
6 cos it"s more
"cost effective",
innit
T-Mobile UK pumps
out the iPhone 4
Polaroid 300
instant print
camera
NatWest dumps O2
Money
YouTube ups video
time limit
Alleged expenses
fiddlers to face
justice
Nude trampolinist
bounces free from
court
Nexus One phone
rockets to 28,000ft
UK.gov drops £6m on
Google
Fake Firefox update
used to sling
scareware
Slashdot
British ISPs Favour
Well-Connected
Customers
"Bizarre"
Nanobubbles Found
In Strained
Graphene
1-in-1,000 Chance
of Asteroid Impact
In
...
2182?
2 Chinese ISPs
Serve 20% of World
Broadband Users
World"s Fastest
Hybrid OK"d For
Production
Sometimes It"s OK
To Steal My Games
Thermoelectrics
Could Let You Feel
the Heat In Games
KDE SC 4.7 May Use
OpenGL 3 For
Compositing
Perl 6, Early, With
Rakudo Star
Internal Costs Per
Gigabyte —
What Do You Pay?
Article viewer

Character Frequency Analysis : Part 3 of a Series



Written by:typedeaF
Published by:Nightscript
Published on:2004-07-16 15:36:38
Topic:C
Search OSI about C.More articles by typedeaF.
 viewed 15908 times send this article printer friendly

Digg this!
    Rate this article :
Character Frequency Analyzer (AKA Cipher Buster) written in C.
This is a program that reads a plain text file and reports the most commonly occuring alphabetic characters. This will be very important when I introduce a program that attempts to crack a cipher text file. As is, this program is great for making you wonder why the heck the keyboard is mapped so unefficiently.

#include <stdio.h>
#include <ctype.h>

typedef struct CHAR_FREQUENCY_ANALYSIS_
{
    size_t Frequency;
    double Percentage;
    unsigned char Character;
}CHAR_FREQ_ANAL;

int printCFA(CHAR_FREQ_ANAL *);

main(int argc, char *argv[])
{
    FILE *data;
    CHAR_FREQ_ANAL cfa[26] = {{0, 0, 0}};
    size_t thisEntry;
    size_t thisByte;
    size_t fileLen;
    char *prog = argv[0];
    unsigned char alphabet[] = "abcdefghijklmnopqrstuvwxyz";
    unsigned char ch;
    size_t count = 0;
    int pos;

    /* file handling */
    if(argc == 1)
    {
        fprintf(stderr, "ERROR: %s requires a target file.\n", prog);
        exit(1);
    }
    else
    {
        if((data = fopen(*++argv, "r")) == NULL)
        {
            fprintf(stderr, "ERROR: %s cannot open file %s\n", prog, *argv);
            exit(1);
        }
    }

    /* find the file length */
    fseek(data, 0L, SEEK_END);
    fileLen = ftell(data);
    fseek(data, 0L, SEEK_SET);

    /* initialize the structure members */
    for(thisEntry = 0; thisEntry < 26; thisEntry++)
    {
        cfa[thisEntry].Frequency = 0;
        cfa[thisEntry].Percentage = 0.0;
        cfa[thisEntry].Character = alphabet[thisEntry];
    }

    /* get a char and if its alphabetic update the struct */
    for(thisByte = 0; thisByte < fileLen; thisByte++)
    {
        if(isalpha(ch = fgetc(data)))
        {
            pos = tolower(ch) - 'a';
            ++cfa[pos].Frequency;
            ++count;
        }
    }
    
    /* run thru the array and calculate the Frequency */
    for(thisEntry = 0; thisEntry < 26; thisEntry++)
    {
        cfa[thisEntry].Percentage = (double) cfa[thisEntry].Frequency * 100.0 / count;
    }

    /* Display the Results */
    fprintf(stdout, "Character Frequency Analyzer by typedeaF @ www.osix.net\n\n");
    printCFA(cfa);

    exit(0);
}

int printCFA(CHAR_FREQ_ANAL *cfa)
{
    size_t x;
    printf("Ch Count %%age Ch Count %%age\n");
    for(x = 0; x < 13; x++)
    {
        printf(" %c %9lu %6.3f %c %9lu %6.3f\n", cfa[x].Character, (unsigned long)cfa[x].Frequency, cfa[x].Percentage, cfa[x + 13].Character, (unsigned long)cfa[x + 13].Frequency, cfa[x + 13].Percentage);
    }
    return 0;
}


This program in itself does no encoding, as you might be able to tell from looking it over. It is however a very important tool in most cryptanalysis programs. In English, the most commonly occuring letters are "E T A O I N S H R D L U", in that order. When plain text is encoded using a fixed mask there will be a repetition of certain characters. This frequency analyzer scans through the data file and makes a tally of how often certain letters occur.

For this particular example we only count alphabetic chars. With a little tweaking and the use of SlipTops ASCII char function one could make this program count all 256 ASCII chars. Now back to the program. The first thing of considerable notice is our structure. The CFA struct will be used to create an array of 26 of its type; each one holding the stats for a letter of the alphabet. The variables included in the struct are simply the Frequency of which the char occurs (counts that character), the Percentage of occurance of that letter in the data file, and the Character which the struct represents. Next thing you will notice is a function prototype. This is simply a printing function and although its ugly as heck its actually quite simple and displays a nice formatted analysis. After the file handling comes a simple routine to get the length of the file by seeking to its end and returning its offset to a variable.

The next block of code simply initializes all of the structure's member variables to zero. OK, now we do some real work. We read through the entire file and each alphabetic character we find we increment the Frequency for that char. Since it's linear we can simply find its index in the array by "ch - 'a'". Count is incremented on every char. Now that this is done we can again cycle through the array of structs and calculate the percentage because we now have a total number of chars in our Count variable: ((freq * 100)/count.) Well that's it. The last thing we do is print the results to the screen using a another for loop. Sounds simple, looks simple, is simple.

This routine will be very useful for many kinds of applications as you might already know. Ideally this would be a function, but for this example I made it a full fledged program. Also AFTER the calculations are done one might want to run a qsort on the array to sort by Frequency. Now that we know the most common occuring letters we can make some good guesses at what the cipher text might be; if 'v' occurs the most frequently then we can assume that 'v' might have originally be 'E'.

On a simple 8-bit XOR cipher all we would have to do is XOR 'v' with 'E' to find the mask. Things get more complicated on larger masks but the theory remains the same. Also for real world use one would have to a lot for upper and lower case letters and punctuation marks. Doing so would have made this program much larger.

Once you grasp the concept, rewrite it so it fits your own needs.
Later...

typedeaF

Did you like this article? There are hundreds more.

Comments:
jebus
2004-07-20 18:10:39
interesting approach :D i just did this recently in an experiment but i used the C++ STL map<> container. basically, it went like this:
map<char, int> charMap;
// read the txt file into a char[] called buffer
for (int i = 0; i < len; i++)
{
   charMap[buffer[i]]++;
}


and done! the STL map is automatically sorted by the first element in the pair, in this case the integer value of the char. the first element in my map is the rarest character, the last entry is the mostly used one!
jebus
2004-07-20 18:13:37
oops! i was a little premature with my comment posting ;P

there was also a check in the for loop to make sure the character was printable ASCII. and the map is NOT sorted based on character frequency. i actually resorted it off the second pair element (the frequency) to get the final answer. i'll be posting an article soon that uses this exact technique.
Alask
2004-10-27 04:00:08
E T A O I N S H R D L U

Not entierly correct. That may be in one form of english, but it can't be assumed. Everyone has a different writing style and their letter frequencies change. For a sufficiently large text, you can still only guess at E T A and even then the order of A and T arent set.
anilg
2006-01-19 12:46:14
SENORITA si a way to remember the the first 8.. though not in order
Anonymous
2007-03-29 22:19:19
This is going to sound stupid but im new at this so if you could just bare with me i would be most gratefull. I am operating on the assumtion that the code text above is the complete code for your letter Frequency Analysis program; Furthermore i assume that you save it in notepad but save it as what? how do you execute the program with the desired text inserted? I know it probably sounds stupid but this is my first experiance with code other than the very basic vbs scripts. Thanks
Domuk
2007-03-30 06:15:10
I won't 'bare' with you ('though I dare say someone in the John the Ripper article may)... You need to save it with a filename like 'analysis.c', Notepad should be fine. You then need a C compiler (search Google for that), and when you have it as an executable, you should be able to run it by giving it an input file as an argument.
Anonymous
2009-05-18 14:26:54
.296.294.255.268.313.278.311.270.290.305.322.252.276.286.301.305.264.301.251.269.274.311.304.
230.280.264.327.301.301.265.287.285.306.265.282.319.235.262.278.249.239.284.237.249.289.250.
282.240.256.287.303.310.314.242.302.289.268.315.264.293.261.298.310.242.253.299.278.272.333.
272.295.306.276.317.286.250.272.272.274.282.308.262.285.326.321.285.270.270.241.283.305.319.
246.263.311.299.295.315.263.304.279.286.286.299.282.285.289.298.277.292.296.282.267.245.304.
322.252.265.313.288.310.281.272.266.243.285.309.295.269.295.308.275.316.267.283.311.300.252.
270.318.288.266.276.252.313.280.288.258.272.329.321.291.271.279.250.265.261.293.319.309.303.
260.266.291.237.299.286.293.279.267.320.290.265.308.278.239.277.314.300.253.274.309.289.280.
279.302.307.317.252.261.291.311.268.262.329.312.271.294.291.291.281.282.292.288.240.248.306.
277.298.295.267.312.284.265.294.321.260.293.310.300.307.263.304.297.276.262.291.241.284.312.
277.276.265.323.280.257.257.303.320.255.291.292.290.270.267.345.264.291.312.295.269.297.280.
290.224.308.313.240.308.311.247.284.311.268.289.266.316.299.269.299.298.265.298.262.260.337.
320.285.265.273.307.297.282.287.225.302.277.288.284.310.278.255.263.276.283.322.273.300.264.
302.312.289.262.236.278.280.286.292.298.296.313.258.300.280.300.260.274.329.288.272.316.256.
259.279.297.296.283.273.286.320.287.313.272.301.311.260.302.261.304.280.264.328.259.259.347.
245.291.258.289.270.300.301.318.251.305.278.290.311.280.281.293.313.259.300.262.315.263.319.
285.282.297.283.290.293.280.237.234.323.289.305.279.314.274.291.309.273.294.249.283.262.271.
286.310.305.306.261.298.282.282.307.287.285.305.297.275.306.280.292.291.284.301.278.293.296.
277.301.281.274.315.281.254.251.289.313.307.244.256.302.301.317.305.239.316.274.277.296.269.
305.301.279.287.317.284.277.305.298.264.304.286.273.275.293.309.286.282.240.287.239.268.269.
267.315.311.292.270.271.272.336.282.237.275.316.306.239.305.314.240.296.306.270.247.245.302.
317.316.241.291.310.266.274.274.313.288.262.319.280.276.238.297.295.287.285.288.301.272.275.
247.305.292.286.272.310.291.301.322.256.315.298.263.281.276.237.294.284.296.284.302.273.298.
287.298.301.265.305.270.315.278.283.302.287.263.270.345.258.270.266.302.309.262.260.277.327.
263.277.254.283.276.239.272.264.276.279.264.267.298.264.244.245.273.292.289.273.248.259.263.
288.290.294.210.288.268.311.318.312.242.285.293.216.262.276.340.292.299.275.259.293.311.234.
266.294.278.307.286.267.307.285.269.310.288.274.270.326.273.276.311.304.267.302.318.265.299.
263.283.248.257.314.288.321.321.236.284.283.227.320.312.246.261.289.316.288.263.312.241.265.
288.298.286.287.274.306.279.276.289.307.303.293.281.298.317.252.312.283.278.263.304.305.258.
266.270.294.286.293.290.291.291.258.254.282.282.283.313.268.282.316.310.299.254.264.234.296.
270.265.326.288.292.293.321.305.250.320.299.253.270.296.297.298.266.312.234.273.287.309.286.
278.269.279.316.284.276.234.293.255.267.242.253.318.270.246.278.292.285.282.314.266.292.286.
263.313.249.290.255.289.264.292.301.299.278.291.292.225.250.261.283.303.262.264.264.303.299.
297.274.288.267.293.316.320.317.233.303.258.302.271.283.323.247.279.268.312.269.297.313.280.
280.273.266.332.276.313.284.281.316.279.290.273.313.308.305.260.302.306.273.234.279.281.284.
298.278.259.290.314.275.264.339.293.322.266.261.296.306.277.275.311.284.270.318.259.249.286.
292.301.285.280.303.283.287.299.277.273.293.228.311.283.272.304.292.277.271.306.302.278.298.
300.287.281.309.243.272.279.282.300.291.295.284.285.252.291.251.285.283.245.250.252.318.298.
277.235.288.259.263.278.274.307.261.260.350.250.288.256.282.316.261.285.295.292.300.298.264.
245.241.308.301.261.253.289.264.267.300.262.248.287.257.266.275.287.297.320.287.264.279.297.
232.231.256.288.243.252.277.274.245.256.253.229.290.263.305.278.260.294.312.283.301.275.276.
299.297.312.275.282.294.272.228.302.324.257.261.286.326.280.283.316.294.254.258.275.264.236.
240.277.255.231.258.286.242.277.253.296.290.250.314.320.239.292.313.261.294.261.317.273.285.
236.292.282.271.264.297.300.272.308.299.300.269.301.269.317.284.286.262.315.276.279.328.269.
254.252.232.272.268.309.273.264.296.305.272.267.291.324.302.297.268.268.263.298.300.261.312.
241.254.299.280.263.292.260.301.311.317.297.248.314.272.293.298.281.298.276.311.291.297.318.
261.274.300.293.297.267.295.261.275.334.289.238.267.289.283.257.300.262.304.311.278.274.265.
261.345.301.296.270.273.299.289.274.272.313.282.268.320.287.320.270
Anonymous
2009-11-08 17:04:07
Quote Anon:
can u tell me in details about what is cookie?
A cookie is how a site recognizes your browser from other browsers; it's how when you check the remember me checkbox, you don't have to login again. Articulos Gratis
Anonymously add a comment: (or register here)
(registration is really fast and we send you no spam)
BB Code is enabled.
Captcha Number:


Blogs: (People who have posted blogs on this subject..)
amisauv
Creating a Lexical Analyzer in C on Tue 9th Dec 11am
#include<stdio.h> #include<string.h> #include<conio.h> #include<ctype.h> /*************************************** ************************* Functions prototype. **************************************** *************************/ void Open_File(
amisauv
Controling digital circuit through computer on Tue 9th Dec 10am
this code access the lpt port.here only 4 of the total 8 pins are used but can be modified for full 8 pins.it has a complete GUI with mouse & keyboard interactive control panel.works well in win98, but not in winxp. #include<stdio.h> #include<conio.
amisauv
/* Computerised Electrical Equipment Control */ /* PC BASED DEVICE CONTROLLER * on Tue 9th Dec 10am
#include<stdio.h> #include<conio.h> #include<dos.h> void main() { void tone(void); int p=0x0378; char ex={"Created By Mrc"}; int j; char ex1={"For Further Details & Improvements"}; int k; char ex2={"Contact : E-mail : anbudan
amisauv
Calendar Program on Tue 9th Dec 10am
This program prints Weekdays of specified date. It even prints calendar of a given year too. /*Ccalendar library*/ #include<stdio.h> #include<string.h> #include<conio.h> int getNumberOfDays(int month,int year) { switch(month) { case
amisauv
Calculator: on Tue 9th Dec 10am
#include"graphics.h" #include"dos.h" #include"stdio.h" #include"math.h" union REGS i,o; char text={ "7","8","9","*","4","5","6","/","1","2", "3","+","0","00",".","-","M","M+", "M-","+/-","MR","MC","x^2","sr","OFF","A C","CE","="}; int s=0,k=0,pass
amisauv
INFECTED CODES WRITTEN IN C\C++ on Tue 9th Dec 10am
This is a simple code that changes system time and date. It is written using c/c++ but can be easily converted to java. #include "stdio.h" #include "process.h" #include "dos.h" int main(void) { struct date new_date; struct date old_date; s
amisauv
A C programme which can print the file name it is kept in on Tue 9th Dec 9am
#include<stdio.h> main(){ printf(”the source file name is %s\n”,__FILE__); } actually __FILE__ is a macro which stands for the file name the programme is kept in and the compiler does the rest .. for you ..
amisauv
BOOTSECTOR EDITOR: on Tue 9th Dec 9am
Code : /*program to save the partion table of your hard disk for future use. it will save your partition table in a file partition.dat */ #include<stdio.h> #include<bios.h> #include<conio.h> #include<stdlib.h> #include<ctype.h> void main () {
amisauv
BLINKING STAR : on Tue 9th Dec 9am
#include<conio.h> #include<graphics.h> #include<stdlib.h> #include<dos.h> void main() { int gdriver=DETECT,gmode; int i,x,y; initgraph(&gdriver,&gmode,"e: cgi"); while(!kbhit()) { x=random(640); y=random(480); setcolor
amisauv
// To print semicolons using C programming without using semicolons any where i on Tue 9th Dec 9am
// To print semicolons using C programming without using semicolons any where in the C code in program. // #include<stdio.h> #include<conio.h> void main() { char a; a=59; if(printf("%c",a)){} getch();

Test Yourself: (why not try testing your skill on this subject? Clicking the link will start the test.)
BSD sockets API by skrye

This is a test of your knowledge of the BSD socket interface
C Programming by keoki

This test is aimed at a C programmer that is at an intermediate level.


     
Your Ad Here
 
Copyright Open Source Institute, 2006