Stories by Furkan KAMACI on Medium

Unlocking the Secrets of Cryptography: A Beginner’s Guide to Breaking Ciphers — Part 3

Furkan KAMACI — Mon, 08 May 2023 11:46:38 GMT

Unlocking the Secrets of Cryptography: A Beginner’s Guide to Breaking Ciphers — Part 3

This is the third post and last part of my breaking ciphers series. If you haven’t read the previous one, you can find it here:

https://furkankamaci.medium.com/unlocking-the-secrets-of-cryptography-a-beginners-guide-to-breaking-ciphers-part-2-a4ab2996e522

In my previous post, I showed how to break Product Cipher Composed of the same ciphers. In this post, I will make the cipher much more complicated and try to break it again.

I will use the same ciphertext for this post:

https://medium.com/media/521475316a38a7e1a824557c2cf1a507/href

Cryptanalysis of Combination of Porta Cipher and Permutation Cipher

As I have demonstrated, the Product cipher of Porta Ciphers results in an idempotent system. As Shannon proposed, taking the product of substitution-type ciphers with permutation-type ciphers is a commonly used technique to fix that issue.

We have m! probabilities to find out the original key for the permutation cipher. However, in total, we have 13^key length ∗ m! of key space for a combination of Porta Cipher and Permutation Cipher, and this combination is not idempotent anymore.

To investigate it, I created a permutation cipher using that permutation key:

At this point, you may wonder that what is a Permutation Cipher. A Permutation Cipher is a type of encryption technique that involves rearranging the order of characters or blocks of text in the plaintext message to produce the ciphertext.

In a Permutation Cipher, each character or block of characters in the plaintext is assigned a unique position or index, and then the positions are shuffled or permuted according to a specific key. The resulting permutation is used to encrypt the plaintext message, resulting in a ciphertext that appears jumbled and unintelligible without knowledge of the key.

Here is how it works for my permutation key: divide the plaintext into 6-length chunks. Put 0th character into 3rd place, 1st character into 2nd place, so on so forth.

The resulting character frequency output of the permutation cipher is:

Character frequency of permutation ciphertext

As expected, we can observe that the character frequency remains unchanged when using the permutation cipher compared to using only the Porta Cipher.

The Porta Cipher repeats a secret to cipher a text, regardless of the plaintext length. However, when the text size is not a multiple of the permutation key, we should find a solution for the permutation cipher. Padding is one solution, but in our case, since we have long text, we can remove the last 2 characters to align with the permutation key length. We can then use this modified text as the input for the permutation cipher.

For the permutation cipher operations, I created the following class:

https://medium.com/media/a79b3b0d887c104b7d7ec222afdbf3a8/href

For a key length of 6, we can find the original key by trying:

possibilities.

When I try it with below code, I could easily find the permutation key:

https://medium.com/media/e5e904f408da0be2f251bcb93f806242/href

I found the permutation key in epoch 412. However, since this is an exhaustive search, the number of epochs may vary depending on the key, and we still need to try m! possibilities.

We can find a solution if we use a better search approach. Therefore, I decided to use a metaheuristic optimization algorithm to overcome this issue. There are some alternative for metaheuristic optimization algorithms. Hill Climbing and Genetic Algorithms are one of the most used ones.

Hill Climbing is a simple and efficient algorithm that can converge quickly to a local optimum, especially in small search spaces. However, it can get stuck in a local optimum and fail to find the global optimum, and it may not be effective in large or complex search spaces.

On the other hand, Genetic Algorithms are more robust and versatile than Hill Climbing. They can explore large search spaces efficiently and can potentially find global optima. However, they require more computational resources than Hill Climbing, and Genetic Algorithms can be slower in converging to a solution.

As I mentioned, our key space is 13^key length ∗ m! Since the key length is 6, we should try:

possibilities, which is large enough. Hence, I implemented a Genetic Algorithm as my metaheuristic optimization algorithm.

Genetic Algorithm flow

The product cipher is a combination of two distinct ciphers, namely, the Porta Cipher and the Permutation Cipher. Although the Porta Cipher has a larger key space than the permutation cipher, it is still susceptible to frequency analysis. Conversely, the permutation cipher has a smaller key space, but it is resistant to frequency analysis. Therefore, what if we assume that the output of the product cipher is not an output of a permutation cipher and try to break it? This approach can potentially reveal a shuffled key as the result, which we can then attempt to crack to determine the correct shuffle order. As a result, we can break that product cipher not within a 13^key length ∗ m! key search space, but 13^key length + m!.

Below are the results of the Chi-squared Statistics when directly applying each key element trans- formation to each cluster of permutation cipher output:

So, the candidate key is all of the combinations of these letters:

One of the candidate key is:

This is exactly a shuffled version of the original key. However, I didn’t use that information since I assumed that I don’t know how to crack the intermediate cipher, Porta Cipher.

I implemented a Genetic Algorithm as I mentioned. I created a weighted trigram approach as the fitness function. Here is the code I implemented:

https://medium.com/media/487b772a047988342b14d8974f6b158b/href

With only 60 evaluations, I was able to discover the solution, which is significantly fewer than the m! = 720 possibilities that would have been required. This solution would also address the issue in the event of a lengthier key:

Decryption of product cipher with Genetic Algorithm

Here is the key that I found, which is the same as the original one:

The proposed method searches on a space of 13^key length+m! instead of 13^key length∗m!. Furthermore, as previously mentioned, by independently processing the elements of the key, we process 13 ∗ key_length candidates for the Porta Cipher. Therefore, we are able to consider 13 ∗ key length + m! possibilities. Moreover, the m! part of the Permutation Cipher can be reduced using the Genetic Algorithm as demonstrated. Effect of narrowing the key search space is demonstrated below (without adding the leverage of the Genetic Algorithm):

Improvement of key search space with proposed approach

As I demonstrated, the proposed method exposes both keys of the Porta Cipher and Permutation Cipher.

Conclusion

I hope you enjoyed this series.

Don’t forget to follow me on Twitter: https://twitter.com/kamaci_furkan

Stay tuned!

References

[1] David Kahn. The Codebreakers: The Comprehensive History of Secret Communication from Ancient Times to the Internet. Scribner, 1996, p. 139. isbn: 9780684831305.

[2] Keith M. Martin. Everyday Cryptography. Oxford University Press, 2012, p. 142. isbn: 9780191625886.

[3] Berna Örs Yalçın. Lecture notes in Cryptography. Apr. 2023.

[4] William F Friedman. The index of coincidence and its applications in cryptography. Riverbank Laboratories. Department of Ciphers. Publ. OCLC, 1922.

[5] Practical Cryptography. Chi-squared Statistic. url: http://practicalcryptography.com/cryptanalysis/ text-characterisation/chi-squared-statistic/. (Accessed: 03.04.2023).

[6] Douglas R. Stinson and Maura B. Paterson. Cryptography: Theory and Practice. 4th. Boca Raton, FL: CRC Press, 2019, pp. 75–78. isbn: 9781138197015.

[7] Vittorio et. al Maniezzo. Matheuristics, Algorithms and Implementations. Springer International Pub- lishing, 2021.

[8] Bill Waggener. Pulse Code Modulation Techniques. Springer, 1995, p. 206. isbn: 9780442014360.

Unlocking the Secrets of Cryptography: A Beginner’s Guide to Breaking Ciphers — Part 2

Furkan KAMACI — Fri, 05 May 2023 18:40:52 GMT

Unlocking the Secrets of Cryptography: A Beginner’s Guide to Breaking Ciphers — Part 2

Photo by Tom Roberts on Unsplash

This is the second post in my breaking ciphers series. If you haven’t read the previous one, you can find it here:

https://furkankamaci.medium.com/unlocking-the-secrets-of-cryptography-a-beginners-guide-to-breaking-ciphers-part-1-67d79c470127

In my previous post, I showed how to apply frequency analysis and break a cipher. In this post, I will make the cipher more complicated and try to break it again.

I will use the same ciphertext for this post:

https://medium.com/media/521475316a38a7e1a824557c2cf1a507/href

Cryptanalysis of Product Cipher Composed of Porta Ciphers

As I defined previously, creating substrings and measuring their closeness to English letter characteristics after applying transformations allows us to gain insight and figure out key elements.

How about applying a Porta Cipher into the output of another Porta Cipher? Will it increase the security?

If both Porta Ciphers of that product cipher have the same key lengths, we can smoothly apply the previous approach because encrypting a letter with one key and then encrypting it again with another key is akin to using a different type of tableau. Here is an example:

Porta Cipher with single key

When we use that ciphertext as an input of a Porta Cipher, we will get the plain text as expected:

Reusing same key for Porta Cipher

So, to create a noteworthy product cipher, we should choose a different key than the original one.

As an example:

Applying a different key for Porta Cipher

Even though we used different keys and created a ciphertext from a ciphertext, it is similar to using only one Porta Cipher. We can demonstrate the above product cipher as:

One may assume that obtaining Q and then K is equivalent to directly acquiring K upon initial observation. However, the reality differs from this assumption.

The character frequency of the second cipher’s ciphertext, obtained through the product cipher using MUDFOG as the key for the second cipher, is presented below.

Character frequency of multiple Porta Ciphers

Upon calculating the IoC for the ciphertext, we obtain the following result:

IoC values for random key length of multiple Porta Ciphers

The initial hypothesis of obtaining Q and then K is equivalent to directly acquiring K holds true since the key length of 6 and the resulting output is identical to using only the first cipher. However, this initial judgment is not definitive since, as previously noted, the Porta Cipher is reciprocal such that [A-M] is mapped to [N-Z] and vice versa. Therefore, there is no direct correspondence between the composition of two Porta Ciphers and the original plaintext.

To resolve the issue of non-idempotency, I offered a solution to encrypt the output of the second Porta Cipher with another Porta Cipher of key length 6. Upon analyzing the resulting ciphertext, I obtained the following frequency distribution:

Character frequency of triple Porta Ciphers

The IoC values yield the following results:

IoC values for random key length of triple Porta Ciphers

We observe that the key length is congruent, prompting an attempt to treat the cipher text as a non-product cipher and decipher it accordingly. Furthermore, we got the reciprocal versions of characters compared to using two Porta Ciphers. Below are the results of the Chi-squared Statistics when applying each key element transformation to each cluster:

So, the candidate key is all of the combinations of these letters:

I tried that as a candidate key:

Therefore, this triple encryption has become idempotent, and it is evident that iterating it does not enhance its security. In summary, we have the following:

Regardless, it is equivalent to that:

So, the key space remains 13^key length.

On the other hand, a question may arise regarding the case of having different key lengths. However, we should comprehend that having a key with a length of 6 is equivalent to having a key with a length of 12, achieved by repeating the original key once more. Therefore, if we have a sufficiently long ciphertext, we can aim for the least common multiple of the key lengths and attempt to break it accordingly as a solution.

Overall, we applied Porta Cipher twice but security did not increase. That is called as non-idempotency and mentioned by Shannon. He proposed using the product of substitution-type ciphers with permutation-type ciphers to fix that issue.

What is Next?

In the next post, I will try to break a combination of Porta Cipher and Permutation Cipher.

Don’t forget to follow me on Twitter: https://twitter.com/kamaci_furkan

Stay tuned!

References

[1] David Kahn. The Codebreakers: The Comprehensive History of Secret Communication from Ancient Times to the Internet. Scribner, 1996, p. 139. isbn: 9780684831305.

[2] Keith M. Martin. Everyday Cryptography. Oxford University Press, 2012, p. 142. isbn: 9780191625886.

[3] Berna Örs Yalçın. Lecture notes in Cryptography. Apr. 2023.

[4] William F Friedman. The index of coincidence and its applications in cryptography. Riverbank Laboratories. Department of Ciphers. Publ. OCLC, 1922.

[5] Practical Cryptography. Chi-squared Statistic. url: http://practicalcryptography.com/cryptanalysis/ text-characterisation/chi-squared-statistic/. (Accessed: 03.04.2023).

[6] Douglas R. Stinson and Maura B. Paterson. Cryptography: Theory and Practice. 4th. Boca Raton, FL: CRC Press, 2019, pp. 75–78. isbn: 9781138197015.

[7] Vittorio et. al Maniezzo. Matheuristics, Algorithms and Implementations. Springer International Pub- lishing, 2021.

[8] Bill Waggener. Pulse Code Modulation Techniques. Springer, 1995, p. 206. isbn: 9780442014360.

Unlocking the Secrets of Cryptography: A Beginner’s Guide to Breaking Ciphers — Part 1

Furkan KAMACI — Tue, 02 May 2023 14:21:42 GMT

Unlocking the Secrets of Cryptography: A Beginner’s Guide to Breaking Ciphers — Part 1

Photo by Mauro Sbicego on Unsplash

Historical Background

Cryptography has a long and fascinating history, dating back to ancient civilizations. The use of codes and ciphers allowed people to send messages that could not be understood by anyone who intercepted them.

One of the earliest known examples of cryptography comes from ancient Egypt, where hieroglyphs were often written in a code that only priests and scribes could understand. In ancient Greece, the famous Scytale was used as a cryptographic tool by Spartan military leaders to send secret messages during wartime.

During the Middle Ages, European monasteries were centers of learning and cryptography. Monks used simple ciphers and codes to protect their books and manuscripts from theft or unauthorized access. However, cryptography became much more sophisticated during the Renaissance, as new mathematical techniques were developed.

In the 20th century, the use of cryptography became increasingly important for government and military communication. During World War II, both the Allies and the Axis powers used sophisticated codes and ciphers to communicate securely. The most famous of these was the German Enigma machine, which was used to encrypt military messages and was famously broken by Alan Turing and his team at Bletchley Park.

In the modern era, cryptography has become an essential tool for protecting digital information, such as online transactions and confidential communications. Cryptography techniques are used in a wide variety of applications, from secure email to banking and e-commerce, and the field continues to evolve as new threats emerge and new encryption methods are developed.

Fundamental Definitions

Steganography: Originates from the Greek words “steganos” meaning “covered” and “graphein” meaning “to write”, is the practice of concealing the existence of a message as a means of achieving covert communication.

Cryptography: The term was coined from the Greek word “kryptos” meaning “hidden”. Its main objective is not to conceal the presence of a message, but rather to conceal its content through encryption.

Cryptanalysis: Refers to the process of deciphering ciphertext to plaintext without having complete knowledge of the encryption method used.

Cryptology: Cryptography + Cryptanalysis

Cryptosystem: A set of algorithms and protocols that are used to perform encryption and decryption of data, ensuring that the information remains secure and confidential during transmission and storage.

Example of a Basic Cipher: Shift Cipher

After explaining historical background and presenting fundamental definitions, let’s give some examples from classical era of cryptography. My first example will be Shift Cipher. A shift cipher is a simple form of encryption in which each letter in a message is replaced by another letter that is a fixed number of positions down the alphabet. For example, if the shift value is 3, then the letter ‘A’ would be replaced by the letter ‘D’, ‘B’ would become ‘E’, ‘C’ would become ‘F’, and so on. The shift value is also known as the key, and if it is kept secret, the message is considered encrypted. The shift cipher is one of the earliest and simplest methods of encryption and can be easily broken with a brute-force attack if the shift value is known. This method is named after Julius Caesar, who used it in his private correspondence.

Shift cipher is a substitution cipher, in particular it is a mono-alphabetic substitution cipher since a character is replaced with only other character. Our key space for shift cipher is 26! in English due to first letter can be replaced with 26 candidate characters, second one 25, so on so forth. Nowadays, it is easy to break such a cipher, but Caesar used it before the computer era for a long time. Here is my code to apply encrypt, decrypt and exhaustive search for a Shift Cipher:

https://medium.com/media/5a46afac349b3f0b52ba4685f5b2183e/href

Such ciphers are vulnerable to frequency attacks and I will explain it later with an example.

Can we improve this cipher? What if we encode a character with different characters regarding to a certain rule? Vigenère Cipher is a good example for it. The Vigenère Cipher is a poly-alphabetic substitution cipher that was invented by Blaise de Vigenère in the 16th century. It is a method of encrypting alphabetic text by using a series of interwoven Caesar ciphers, based on the letters of a keyword.

The Vigenère cipher works by using a keyword or key phrase that is repeated over and over again to generate a series of Caesar ciphers. Each letter of the keyword corresponds to a Caesar shift value, and the plaintext message is shifted by the corresponding shift value for each letter of the keyword. The result is a ciphertext message that is much more difficult to break than a simple Caesar cipher.

Cryptanalysis Time!

Another example for a poly-alphabetic cipher is Porta Cipher. Giovanni Battista Della Porta invented the Porta Cipher, a poly-alphabetic substitution cipher that utilizes 13 alphabets, in contrast to the Vigenere Cipher, which uses 26 alphabets. Due to its reciprocal nature, the cipher alphabets used in Porta Cipher enable the encryption and decryption processes to be the same.

The Porta Cipher employs the following tableau to encrypt a plaintext:

Porta tableau

To use the Porta Cipher to a encrypt text, we should follow these steps:

Choose a keyword
Repeat the keyword above the plaintext message
Find the corresponding value for each letter of the keyword and the plain text letter in the tableau
Use these values to generate the ciphertext

Below is an example of encrypting a text using secret as the keyword:

Porta Cipher example

Since we use a tableau of 13 rows, key space of Porta Cipher is 13^key_length.

Here is the encrypted text I used for this analysis:

https://medium.com/media/521475316a38a7e1a824557c2cf1a507/href

In order to analyze the complete process, I have implemented a statistics collector class that can be used in future processes. The code for this class is shown below:

https://medium.com/media/f7c95de5db194fb998580707f99f9e0f/href

First of all, given plain text was long enough and we can understand it from its frequency analysis since it is too close to the original English letter frequencies. Letter probabilities sorted by descending in the English language are as follows (it is different at different resources):

English letter probabilities

When we run a frequency analysis on the plaintext, we notice that:

Character frequency of plaintext

Plaintext has 13874 characters, 26 unique characters, and character distribution is too close to the original English letter probabilities. I used only ciphertext to find the key of the Porta Cipher, and as an initial step, I applied a frequency analysis on the ciphertext. Here is the output:

Character frequency of ciphertext

Ciphertext has 13874 characters and 26 unique characters like plain text. However, as shown, letter frequencies are near evenly distributed. So, this information helps us to understand this ciphertext not generated by a mono-alphabetic substitution cipher. As mentioned, Porta Cipher is a poly-alphabetic substitution cipher that correlates with that finding.

I tried to find the key length to break the Porta Cipher. I used the Index of Coincidence (IoC) method for that purpose (another alternative is using Kasiski Test). The Index of Coincidence (IoC) is a statistical measure of how similar a frequency distribution is to a uniform distribution.

The IoC is calculated simply by taking the sum of the frequencies of each letter in the ciphertext squared, divided by the total number of letters in the ciphertext:

where fi is the frequency of the i-th letter in the ciphertext, and n is the total number of letters in the ciphertext.

Using probabilities, we can calculate that English letters have an approximate IoC value of 0.065. On the other hand, a random text will have a value IoC value of 0.038.

The length of the key is crucial in a poly-alphabetic substitution cipher. In such ciphers, a letter can be encrypted with a different character, but after the length of the key, the same character will be encrypted to the same character with a period equal to the length of the key.

To determine the key length, I began with a random key length and created substring clusters of that length. I filled the clusters with elements that have a gap of the key length between them.

Creating substrings according to the key length will not directly solve our problem. However, it will have characteristics of a mono-alphabetic substitution cipher and should have a similar IoC value to English letters. So, I divided ciphertext into different substrings (buckets) and compared their IoC concerning being close to 0.065. Here is the complete class I created to calculate IoC:

https://medium.com/media/267a2bedb1300d70e140cfbc12bf284c/href

Below is the output of IoC values corresponding random key lengths:

IoC values for random key length

It reveals that the key lengths of 6 and 12 are in close proximity to the desired IoC values.

If the key length is 6, we can consider its multiples as possible key candidates since a 12-length key can be generated by repeating a 6-length key twice. Alternatively, we could test 12 since it may be the original key length. After analyzing these options, I initially accepted 6 as the key length and proceeded with it.

As the next step, I created substrings with a key length of 6. Below figure demonstrates my approach for the first 7 letters of the given ciphertext:

Creating substrings of key period

For each cluster of letters, I tried 13 transformations defined in the Porta tableau. However, the result of performing that transformations will not be meaningful since we have a scattered version of the candidate plaintext. One solution to understand whether we found a potential solution at one of the transformations is to measure the cluster statistics with English letter statistics. For that purpose, I applied the Chi-squared Statistic. The Chi-squared Statistic quantifies the dissimilarity between two categorical probability distributions. If the two distributions are the same, the Chi-squared Statistic will be 0, whereas if the distributions are substantially different, the statistic will be larger.

The formula for the Chi-squared Statistic is:

where C_i is the count of letter i, and E_i is the expected count of letter i. To calculate the expected count of a given letter, I multiplied the total number of letters with letter probability defined at English letter probabilities table.

Below is the code I implemented to calculate the Chi-squared Statistics of each cluster:

https://medium.com/media/a87bdbbf1f8505ec13d966ba69d23273/href

The results of the Chi-squared Statistics for applying each key element transformation to each cluster are shown below:

So, the candidate key is all of the combinations of these letters:

One of the meaningful words from that combination is OLIVER, and I accepted it as the potential key. I created the below class for Porta cipher operations:

https://medium.com/media/dbbff5b0a4d2c4d798a5c62b10d496cf/href

When I used that key, Porta Cipher broken, and I got the exact match with plaintext. As a final remark, if I couldn’t get a match, I had to back-propagate from candidate key elements and different key lengths.

Overall, one of the found keys is:

As a final remark, independently working on elements of key length let us consider 13 ∗ key length candidates instead of trying 13^key length possibilities.

What is Next?

In the next post, I will cover Shannon’s theory and product cryptosystems. For this purpose, I will modify our Porta Cipher example and attempt to break the modified version.

Don’t forget to follow me on Twitter: https://twitter.com/kamaci_furkan

Stay tuned!

References

[1] David Kahn. The Codebreakers: The Comprehensive History of Secret Communication from Ancient Times to the Internet. Scribner, 1996, p. 139. isbn: 9780684831305.

[2] Keith M. Martin. Everyday Cryptography. Oxford University Press, 2012, p. 142. isbn: 9780191625886.

[3] Berna Örs Yalçın. Lecture notes in Cryptography. Apr. 2023.

[4] William F Friedman. The index of coincidence and its applications in cryptography. Riverbank Laboratories. Department of Ciphers. Publ. OCLC, 1922.

[5] Practical Cryptography. Chi-squared Statistic. url: http://practicalcryptography.com/cryptanalysis/ text-characterisation/chi-squared-statistic/. (Accessed: 03.04.2023).

[6] Douglas R. Stinson and Maura B. Paterson. Cryptography: Theory and Practice. 4th. Boca Raton, FL: CRC Press, 2019, pp. 75–78. isbn: 9781138197015.

[7] Vittorio et. al Maniezzo. Matheuristics, Algorithms and Implementations. Springer International Pub- lishing, 2021.

[8] Bill Waggener. Pulse Code Modulation Techniques. Springer, 1995, p. 206. isbn: 9780442014360.

Algorithm to Solve Sudoku

Furkan KAMACI — Thu, 03 Jun 2021 11:45:51 GMT

Sudoku is a logic-based number-placement puzzle. In classic sudoku, the objective is to fill a 9×9 grid with digits so that each column, each row, and each of the nine 3×3 sub-grids that compose the grid (also called “boxes”, “blocks”, or “regions”) contains all of the digits from 1 to 9. The puzzle setter provides a partially completed grid, which for a well-posed puzzle has a single solution [1].

Sudoku example and its solution

As the nature of the logic, we may fill empty cells with numbers and have many solution candidates. Trying to solve this problem with a brute force approach is feasible since it has a complexity of O(9^m) (m demonstrates empty cells).

When you think about the brute force solution, one can realize that some of the solution attempts have unnecessary branching i.e. when we place a number in a row, we don’t need to derive new solution attempts which has the same number twice at the same row since it will not be a valid solution.

So, the key point is we are looking for an algorithm trying to solve a problem with pruning unnecessary branching. At this point, backtracking can help us.

Backtracking is a general algorithm for finding all (or some) solutions to some computational problems, notably constraint satisfaction problems, that incrementally builds candidates to the solutions, and abandons a candidate (“backtracks”) as soon as it determines that the candidate cannot possibly be completed to a valid solution [2].

Backtracking algorithms work like that:

Iterate to construct solution attempts.
Do not construct a candidate if it is not valid.
Branch to next candidate with moving ahead
Recursively follow step 1.
Return if the solution is found.
Take one step back and try moving ahead with the next.

This means that we will try to put numbers to the empty cells and continue if it is valid sudoku with existing numbers. We will recursively follow these steps until we hit an invalid solution. If so, we will take one step back and try the next number.

Solving sudoku with backtracking

Assume that we have given sudoku as a 2d char array with empty cells are defined with ‘.’ (dot) character. The algorithm that solves sudoku with backtracking is as follows:

https://medium.com/media/768d12cf46dd21771c5ffb20531bd38b/href

This article is a part of my solutions for the classical algorithm problems series. Check my profile to read existing articles and stay in tune with new ones.

You can follow me on Twitter: https://twitter.com/kamaci_furkan

[1] https://en.wikipedia.org/wiki/Sudoku

[2] Gurari, Eitan (1999). “CIS 680: DATA STRUCTURES: Chapter 19: Backtracking Algorithms”. Archived from the original on 17 March 2007.

Apache NLPCraft

Furkan KAMACI — Tue, 10 Mar 2020 12:04:48 GMT

Photo by Volodymyr Hryshchenko on Unsplash

NLPCraft is an open source library for adding Natural Language Interface to any applications. Based on semantic modeling it requires no ML/DL model training or existing text corpora.

NLPCraft is simple to use: define a semantic model and intents to interpret user input. Securely deploy this model and use REST API to explore the data using natural language from your applications.

Why NLI

Natural Language Interface (NLI) enables users to explore any type of data sources using natural language augmenting existing UI/UX with fidelity and simplicity of conversational AI.

There is no learning curve, no special rules or applications to master, no syntax or terms to remember — just a natural language that your users already speak and the tools they already use.

Key Features

Semantic Modeling

Advanced semantic modeling and intent-based matching enable deterministic natural language understanding without requiring ML/DL training or text corpora.

Strong Security

HTTPs, model deployment isolation, 256-bit encryption, and ingress-only connectivity are among the key security features in NLPCraft.

Any Data Source

Any data source, device, or service — public or private. From databases and SaaS systems to smart home devices, voice assistants and chatbots.

Model-As-A-Code

Model-as-a-code convention natively supports any system development life cycle tools and frameworks in Java eco-system.

Java-First

REST API and Java-based implementation natively support the world’s largest ecosystem of development tools, programming languages, and services.

Out-Of-The-Box Integration

NLPCraft natively integrates with OpenNLP, Google Cloud Natural Language API, CoreNLP and spaCY for base NLP processing and named entity recognition.

How It Works

There are three main software components:

Data model specifies how to interpret user input, how to query a data source, and how to format the result back. Developers use a model-as-a-code approach to build models using any JVM language like Java or Scala.

Data probe is a DMZ-deployed application designed to securely deploy and manage data models. Each probe can manage multiple models and you can have many probes.

REST server provides REST endpoint for user applications to securely query data sources using NLI via data models deployed in data probes.

NLI Applications

Despite being seemingly obvious that NLI (Natural Language Interface) has wide applicability to many applications and software systems there are specific areas where NLI is already used today and has demonstrated its unique capabilities.

NLI-Enhanced Search

NLI-enhanced search, filter, and sort is one area where NLI has been successful for a number of years already. Look at Google Analytics, Gmail, JIRA, or many other applications that allow you to search, filter or sort their content with natural language queries. This use case is a perfect application of NLI as it naturally augments the existing UI/UX by replacing often cumbersome and hard-to-use search/filter/sort UX with a simple text box.

As a matter of fact, all major general-purpose search platforms today (i.e. Google, Bing, or Siri) use the NLI-enhanced approach to their search queries processing.

Chatbots

NLI is clearly at the heart of any chatbot implementation. And although most naive implementations of chatbots have failed to gain significant traction — the advancement in NLI technology is allowing modern chatbots to become gradually more sophisticated and outgrow the early “childhood” problems of parasitic dialogues, lack of contextual awareness, inability to comprehend a spoken, free-form language, and primitive rule-based logic.

Data Reporting

Fully deterministic NLI systems like NLPCraft provide critical technology for NLI-based data reporting. Unlike data insights analytics or data exploration, the data reporting typically cannot rely on the probabilistic nature of ML/DL-based approaches as it must provide 100% correctness in all cases.

NLPCraft employs advanced semantic modeling that provides fully deterministic results and NL comprehension.

Ad-Hoc Data Exploration

One of the most exciting applications of NLI is an ad-hoc data analytics or data exploration. This is the area where the proper NLI application can bring about a fundamental seismic change to how we explore our data and discover insights from it.

Today the most data is walled off in the silos of the individual, incompatible data systems making it mostly inaccessible to all but a few “power” users. Very few can gain access to all the different systems in a typical company, learn all the different ways to analyze the data and master incompatible and drastically different user interfaces.

The NLI-based approach can democratize access to the sprawling silo-ed data with a single unified UX by allowing users to use the natural language to explore and analyze the data. The natural language is the only UX/UI that everyone already knows, requires no training or learning and is universal regardless of the data source.

Device Control

With the popularization of consumer technologies like Amazon Alexa, Apple HomeKit, Mercedes MBUX and similar the NLI-based control of various devices and systems becoming a norm.

While most of these systems today can only understand the rudimentary 2–3 words command the advancements in NLI technology are rapidly leading to more sophisticated interfaces. The enterprise world is starting to catch up and NLI-based systems appear today in various manufacturing, oil and gas, pharma and medical applications.

NLPCraft has been accepted to Apache Incubator and I will be a Mentor of the Apache NLPCraft project. I’ll supervise the NLPCraft community in order to align with the Apache Way.

[1] https://nlpcraft.org/index.html

[2] https://medium.com/@furkankamaci/open-source-software-development-and-apache-incubator-372cc90081ae

Apache NLPCraft was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

Apache NLPCraft

Furkan KAMACI — Wed, 04 Mar 2020 13:26:42 GMT

Photo by Volodymyr Hryshchenko on Unsplash

NLPCraft is an open source library for adding Natural Language Interface to any applications. Based on semantic modeling it requires no ML/DL model training or existing text corpora.

NLPCraft is simple to use: define a semantic model and intents to interpret user input. Securely deploy this model and use REST API to explore the data using natural language from your applications.

Why NLI

Natural Language Interface (NLI) enables users to explore any type of data sources using natural language augmenting existing UI/UX with fidelity and simplicity of conversational AI.

There is no learning curve, no special rules or applications to master, no syntax or terms to remember — just a natural language that your users already speak and the tools they already use.

Key Features

Semantic Modeling

Advanced semantic modeling and intent-based matching enable deterministic natural language understanding without requiring ML/DL training or text corpora.

Strong Security

HTTPs, model deployment isolation, 256-bit encryption, and ingress-only connectivity are among the key security features in NLPCraft.

Any Data Source

Any data source, device, or service — public or private. From databases and SaaS systems to smart home devices, voice assistants and chatbots.

Model-As-A-Code

Model-as-a-code convention natively supports any system development life cycle tools and frameworks in Java eco-system.

Java-First

REST API and Java-based implementation natively support the world’s largest ecosystem of development tools, programming languages, and services.

Out-Of-The-Box Integration

NLPCraft natively integrates with OpenNLP, Google Cloud Natural Language API, CoreNLP and spaCY for base NLP processing and named entity recognition.

How It Works

There are three main software components:

Data probe is a DMZ-deployed application designed to securely deploy and manage data models. Each probe can manage multiple models and you can have many probes.

REST server provides REST endpoint for user applications to securely query data sources using NLI via data models deployed in data probes.

NLI Applications

NLI-Enhanced Search

As a matter of fact, all major general-purpose search platforms today (i.e. Google, Bing, or Siri) use the NLI-enhanced approach to their search queries processing.

Chatbots

Data Reporting

NLPCraft employs advanced semantic modeling that provides fully deterministic results and NL comprehension.

Ad-Hoc Data Exploration

Device Control

With the popularization of consumer technologies like Amazon Alexa, Apple HomeKit, Mercedes MBUX and similar the NLI-based control of various devices and systems becoming a norm.

NLPCraft has been accepted to Apache Incubator and I will be a Mentor of the Apache NLPCraft project. I’ll supervise the NLPCraft community in order to align with the Apache Way.

[1] https://nlpcraft.org/index.html

[2] https://medium.com/@furkankamaci/open-source-software-development-and-apache-incubator-372cc90081ae

U.S. National Institute of Standards and Technology Released the Privacy Framework

Furkan KAMACI — Mon, 20 Jan 2020 14:52:28 GMT

Photo by fabio on Unsplash

The National Institute of Standards and Technology (NIST) is a physical sciences laboratory and a non-regulatory agency of the United States Department of Commerce. Its mission is to promote innovation and industrial competitiveness. NIST’s activities are organized into laboratory programs that include nanoscale science and technology, engineering, information technology, neutron research, material measurement, and physical measurement.

NIST released the version 1.0 of Privacy Framework: A tool for Improving Privacy through Enterprise Risk Management (Privacy Framework), to enable better privacy engineering practices that support privacy by design concepts and help organizations protect individuals’ privacy. The Privacy Framework can support organizations in:

Building customers’ trust by supporting ethical decision-making in product and service design or deployment that optimizes beneficial uses of data while minimizing adverse consequences for individuals’ privacy and society as a whole;
Fulfilling current compliance obligations, as well as future-proofing products and services to meet these obligations in a changing technological and policy environment; and
Facilitating communication about privacy practices with individuals, business partners, assessors, and regulators.

The framework is developed in collaboration with a range of stakeholders, the framework provides a useful set of privacy protection strategies for organizations that wish to improve their approach to using and protecting personal data.

The NIST Privacy Framework is not a law or regulation, but rather a voluntary tool that can help organizations manage privacy risk arising from their products and services, as well as demonstrate compliance with laws that may affect them, such as the California Consumer Privacy Act and the European Union’s General Data Protection Regulation. It helps organizations identify the privacy outcomes they want to achieve and then prioritize the actions needed to do so.

The Privacy Framework follows the structure of the Framework for Improving Critical Infrastructure Cybersecurity (Cybersecurity Framework) to facilitate the use of both frameworks together. Like the Cybersecurity Framework, the Privacy Framework is composed of three parts: Core, Profiles, and Implementation Tiers. Each component reinforces privacy risk management through the connection between business and mission drivers, organizational roles and responsibilities, and privacy protection activities.

The Core enables a dialogue — from the executive level to the implementation/operations level — about important privacy protection activities and desired outcomes.
Profiles enable the prioritization of the outcomes and activities that best meet organizational privacy values, mission or business needs, and risks.
Implementation Tiers support decision-making and communication about the sufficiency of organizational processes and resources to manage privacy risk.

In summary, the Privacy Framework is intended to help organizations build better privacy foundations by bringing privacy risk into parity with their broader enterprise risk portfolio.

Privacy Risk Management

Companies have a tremendous amount of data but they are distributed as sparse from the view of storage intermediates and data types. Such data includes a huge amount of personal data and companies are honeypot in that manner.

Lagom Data Knowledge Platform can connect to any data source including both structured (i.e. databases) and unstructured (i.e. file systems, e-mails, documents) then can search, analyze, report, make predictions, and also take automated actions on them via static or ML-based rules as blazing fast.

To manage privacy risk arising from their products and services as like pointed by Privacy Framework, as well as demonstrate compliance with laws that may affect them, such as the CCPA and the GDPR, Lagom provides a Data Governance solution to discover structured and unstructured data, catalog sensitive data, take automatic actions as like pseudonymization or anonymization, create reports, notify authorized people to prevent data breaches and provide cybersecurity.

If you think that you are a good fit for Lagom and want to work with us, send your resume to career@lagom.ai

Do you want us to visit your country for a tech talk? Please send an email to contact@lagom.ai

Visit our webpage at https://lagom.ai

[1] https://www.nist.gov/system/files/documents/2020/01/16/NIST%20Privacy%20Framework_V1.0.pdf

[2] https://www.nist.gov/news-events/news/2020/01/nist-releases-version-10-privacy-framework

[3] https://nvlpubs.nist.gov/nistpubs/CSWP/NIST.CSWP.04162018.pdf

U.S. National Institute of Standards and Technology Released the Privacy Framework was originally published in LAGOM on Medium, where people are continuing the conversation by highlighting and responding to this story.

California Consumer Privacy Act (CCPA) and Comparison with GDPR

Furkan KAMACI — Thu, 28 Nov 2019 10:41:19 GMT

31 states at the United States have established laws regulating the secure destruction or disposal of personal information, with at least a dozen states imposing broader data security requirements and California is a pioneer belong them.

The California Consumer Privacy Act (CCPA) was enacted in 2018 and takes effect on January 1, 2020. This landmark piece of legislation secures new privacy rights for California consumers.

The CCPA grants new rights to California consumers:

The right to know what personal information is collected, used, shared or sold, both as to the categories and specific pieces of personal information;
The right to delete personal information held by businesses and by extension, a business’s service provider;
The right to opt-out of sale of personal information. Consumers are able to direct a business that sells personal information to stop selling that information. Children under the age of 16 must provide opt in consent, with a parent or guardian consenting for children under 13.
The right to non-discrimination in terms of price or service when a consumer exercises a privacy right under CCPA.

The CCPA applies to certain businesses :

Businesses are subject to the CCPA if one or more of the following are true:

Has gross annual revenues in excess of $25 million;
Buys, receives, or sells the personal information of 50,000 or more consumers, households, or devices;
Derives 50 percent or more of annual revenues from selling consumers’ personal information.

As proposed by the draft regulations, businesses that handle the personal information of more than 4 million consumers will have additional obligations.

CCPA and GDPR

The California Consumer Privacy Act (CCPA) and the European Union’s General Data Protection Regulation (GDPR) are separate legal frameworks with different scopes, definitions, and requirements. A business that complies with GDPR and is subject to CCPA may have additional obligations under CCPA.

For example, under GDPR, companies must undertake a data inventory and mapping of data flows in furtherance of creating records to demonstrate compliance. Additional data mapping may be important to reflect the different requirements under CCPA.
Under GDPR, companies must develop processes and/or systems to respond to individual requests for access to personal information and for erasure of personal information. These processes and/or systems may be applied to handling CCPA consumer requests, although businesses may need to review and reconcile the different definitions of personal information and applicable rules on verification of consumer requests.
Under GDPR, companies must disclose data privacy practices in a privacy policy. CCPA also requires companies to disclose specific business practices in a comprehensive privacy policy. Many California companies that operate commercial websites and online services must post a privacy policy under the California Online Privacy Protection Policy, or CalOPPA, and will need to update this policy for CCPA.
Under GDPR, companies must draft and execute written contracts with its service providers (“processors”). Companies may need to review these contracts to reflect requirements under CCPA.

Solution

To make companies compliant with regulations like CCPA, Lagom provides a Data Governance solution to discover structured and unstructured data, catalog sensitive data, take automatic actions as like pseudonymization or anonymization, create reports, notify authorized people to prevent data breaches and provide cybersecurity.

If you think that you are a good fit for Lagom and want to work with us, send your resume to career@lagom.ai

Do you want us to visit your country for a tech talk? Please send an email to contact@lagom.ai

Visit our webpage at https://lagom.ai

[1] https://oag.ca.gov/privacy/ccpa

[2] https://www.helpnetsecurity.com/2019/05/23/american-gdpr-awareness/

California Consumer Privacy Act (CCPA) and Comparison with GDPR was originally published in LAGOM on Medium, where people are continuing the conversation by highlighting and responding to this story.

Becoming Compliant with GDPR

Furkan KAMACI — Mon, 25 Nov 2019 11:15:08 GMT

Photo by Dayne Topkin on Unsplash

The GDPR has been a milestone for collecting, using and processing personal data which gives EU citizens more control over their data. This Regulation applies to the processing of personal data in the context of the activities of an establishment of a controller or a processor in the Union, regardless of whether the processing takes place in the Union or not.

Personal data, processing data, profiling data, restriction of personal data processing and pseudonymization are all explained at GDPR.

Personal Data

Any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.

Processing

Any operation or set of operations which is performed on personal data or on sets of personal data, whether or not by automated means, such as collection, recording, organization, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction;

According to Article 6 at GDPR, there are six lawful bases for processing the information:

1. Consent

The data subject has given consent to the processing of his or her personal data for one or more specific purposes.

2. Contracts

Processing is necessary for the performance of a contract to which the data subject is party or in order to take steps at the request of the data subject prior to entering into a contract.

3. Legal obligation

Processing is necessary for compliance with a legal obligation to which the controller is subject.

4. Vital Interest

Processing is necessary in order to protect the vital interests of the data subject or of another natural person.

5. Public Interest

Processing is necessary for the performance of a task carried out in the public interest or in the exercise of official authority vested in the controller.

6. Legitimate Interest

Processing is necessary for the purposes of the legitimate interests pursued by the controller or by a third party, except where such interests are overridden by the interests or fundamental rights and freedoms of the data subject which require protection of personal data, in particular where the data subject is a child.

Processing shall be lawful only if and to the extent that at least one of them applies.

Profiling

Any form of automated processing of personal data consisting of the use of personal data to evaluate certain personal aspects relating to a natural person, in particular to analyze or predict aspects concerning that natural person’s performance at work, economic situation, health, personal preferences, interests, reliability, behavior, location or movements.

Restriction of Processing

The marking of stored personal data with the aim of limiting their processing in the future.

Pseudonymization

The processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organizational measures to ensure that the personal data are not attributed to an identified or identifiable natural person.

Penalties

Fines at GDPR are up to 20 Million EUR, or in the case of an undertaking, up to 4% of the total worldwide annual turnover of the preceding financial year, whichever is higher.

Solution

To make companies compliant with regulations like GDPR, Lagom provides a Data Governance solution to discover structured and unstructured data, catalog sensitive data, take automatic actions as like pseudonymization or anonymization, create reports, notify authorized people to prevent data breaches and provide cybersecurity.

If you think that you are a good fit for Lagom and want to work with us, send your resume to career@lagom.ai

Do you want us to visit your country for a tech talk? Please send an email to contact@lagom.ai

Visit our webpage at https://lagom.ai

[1] https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679

Becoming Compliant with GDPR was originally published in LAGOM on Medium, where people are continuing the conversation by highlighting and responding to this story.

Lagom Data Governance

Furkan KAMACI — Mon, 18 Nov 2019 07:14:50 GMT

Data privacy and protection are really important problems to be solved for all companies especially in a world that collected data is increasing which includes personal data.

Personal data is defined as any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person according to GDPR.

On the other hand, protecting such data, preventing information leak and running an operation to be aligned with regulations or any other purposes is a challenging task.

Lagom Data Governance Flow

Lagom does not need installing agents on your systems and can you can run it as on-premise. Lagom requires to write permission only for desired operations as an optional choice as like defining actions which modify data.

Our platform provides an Authority Matrix and all the access, read or write operations can be defined to provide more granular security. Actions taken on the platform, search operations or managing the platform itself can be controlled by Authority Matrix.

You can connect your existing authority systems i.e. LDAP implementations and all operations are done respected to related authorities. After that, you can connect any type of data sources i.e. SQL data source, NoSQL data source, Documents (Word, Excel, PDF, Images, etc.), E-mail, Exchange, Office365, Dropbox, Google, IBM FileNet, Alfresco or any other CMS systems to start discovery.

The next step is defining procedures, policies, and concepts for your data. Procedures and policies are semantic definitions for your data governance operations. Concepts are complex search bundles that include dictionaries, advanced search queries, and functions. One of the powerful features of Lagom is, concepts can be defined any time and does not require a re-discovery. Concept results are cataloged according to your custom or predefined or definitions.

After all, you can define actions and retentions for your cataloged data. Actions and retentions can be scheduled time-based and triggered at defined certain conditions. These can be asking an explicit consent from your users, archiving, pseudonymization or anonymization operation, or any other function you have defined. As an example, you can detect whether district information resides at your data or not and then replace it with any other district which is close to 5 miles to the original one.

Lagom also reports analytics data both for ongoing processes and for overall data. On the other hand, it extracts quality metrics of your data i.e. duplicate documents or unnecessary columns at a database table.

All in all, Lagom provides a Data Governance solution to discover structured and unstructured data, catalog sensitive data, take automatic actions as like pseudonymization or anonymization, create reports, notify authorized people to prevent data breaches and provide cybersecurity.

If you think that you are a good fit for Lagom and want to work with us, send your resume to career@lagom.ai

Do you want us to visit your country for a tech talk? Please send an email to contact@lagom.ai

Visit our webpage at https://lagom.ai

[1] https://medium.com/lagomtech/lagom-data-knowledge-platform-3-7a09709634db

Lagom Data Governance was originally published in LAGOM on Medium, where people are continuing the conversation by highlighting and responding to this story.