Announcing AM library 1.0.0

I am happy to announce the launch of a new library called AM or Assert-Mocks for unit-testing Java servlet API code. The library aims to make it a lot easier and faster to test the servlet API code, including and not limited to testing of servlets, servlet filters and tag libraries. It helps remove the need of using Mockito framework for setting up expectations.

Usage

The following two examples should help you get started easily. More examples are being added to the Github:sangupta/am repository.

Testing a simple JSP tag

AmTagLibTestHelper.testTagOutput(

    // the class implementing custom tag
    MyCustomJSPTag.class,

    // the expected String response
    expectedOutputWrittenToJspWriter,

    // set the values before invocation
    new GenericConsumer<MyCustomJSPTag>() {

        // method argument of same type as the first param
        public boolean consume(MyCustomJSPTag tag) {
            // set the properties of the tag
            tag.setProperty1(prop1);
            tag.setProperty2(prop2);

            // and so on...

            // return either true or false
            // it does not matter in this case
            return true;
        }
    }

);

Testing a Servlet Filter

// obtain an instance of the filter
MyCustomFilter filter = AmServletFilterTestHelper.getFilterForUnitTesting(MyCustomFilter.class);

// create request and response objects as filter will need them
// the AmHttpServletRequest.getDefault() method returns a request for a server
// deployed on localhost on port 80, and being hit with same machine where
// the servlet context is `context` and the path is `/home.html`
AmHttpServletRequest request = AmHttpServletRequest.getDefault("home.html");

// the response object to which filter will write
AmHttpServletResponse response = new AmHttpServletResponse();

// filter invocation
AmServletFilterTestHelper.assertFilterChainInvoked(filter, request, response);

// assert what was set in response
Assert.assertEquals(200, response.getStatusCode());
Assert.assertEquals("myCustomHeaderValue", response.getHeader("X-Custom-Header"));

Source & Downloads

The source code for the library is available at Github:sangupta/am

The library is available at the following Maven coordinates:

<dependency>
    <groupId>com.sangupta</groupId>
    <artifactId>am</artifactId>
    <version>1.0.0</version>
</dependency>

Happy Hacking!

Releasing jerry-core 3.0.0

I am happy to announce that exactly after an year of the last release, I have a new major release for jerry-core library: version 3.0.0. You may get hold of the library from Maven Central with the following coordinates:

<dependency>
    <groupId>com.sangupta</groupId>
    <artifactId>jerry-core</artifactId>
    <version>3.0.0</version>
</dependency>

Following are the major changes in this release:

  • Minimum supported JDK version is now 1.7 than earlier 1.6
  • Upgraded the dependencies to the latest available versions
  • Fixed a critical bug in Base62Encoder where conflicting codes were being generated - this impacts if numbers encoded were both positive and negative. And thus this makes us bump up the major version - as the Base62Encoder is no longer compatible with codes generated with any of the previous versions

Other additions to the library include (and not limited to):

  • Added ResourceUtils to read files from classpath including from packaged JARs
  • Added StringArrayIterator to iterate over a string array using Iterator
  • Added IndentedStringWriter that takes care of writing long indented text that breaks at a given line length
  • Added count, removeAll, ltrim() and rtrim() methods to StringUtils
  • Updated ReflectionUtils to bind values to object-wrappers for primitives
  • Allow counter names to be read from IntegerCounter and LongCounter
  • Added SimpleMultiMap.numValues()
  • Added StringUtils.wildcardMatch() method
  • Added jitpack.yml for allowing jerry-core via https://jitpack.io
  • Added isJDK8() and isJDK9() methods to JDKUtils
  • Added asMap() and clear methods to IntegerCounter and LongCounter
  • Added getNextSetBit() method to BitArray and implementations
  • Added AdvancedStringReader and corresponding unit-tests

And lastly, the Javadocs have been updated heavily to add missing documentation and update the existing one to bring in more clarity.

Usage examples will be posted soon to this blog :)

Hope you find this release useful.

Fastest way to merge multiple integer sets

Problem

You are given multiple integer sets and you need to merge them into a single set in the fastest possible way.

Solution

The always updated article is always available on Github:sangupta/ps repo

Suppose we have 3 arrays of integer sets and we need to merge them. The fastest solution would be possible in time of O(n1 + n2 + n3) where n1, n2, n3 are the lengths of the three sets respectively. The solution lies in using a bit-vector (also called as bit-set or a bit-array) to represent elements using the index and then iterating over them.

  • Construct a bit-array
  • Iterate over the first array and for each integer set the corresponding bit to true
  • Repeat the above process for remaining arrays
  • Any duplicate element will result in setting true an already set bit
  • The resultant bit-array is the merge of all the arrays
public void mergeArrays(int[] array1, int[] array2, int[] array3) {
  BitSet bitSet = new BitSet();

  // start merging
  for(int num : array1) {
    bitSet.setBit(num, true);
  }
  for(int num : array2) {
    bitSet.setBit(num, true);
  }
  for(int num : array3) {
    bitSet.setBit(num, true);
  }

  // output
  int index = -1;
  do {
    index = bitSet.getNextSetBit(index);
    if(index < 0) {
      break;
    }

    System.out.println(index);
  } while(true);
}

The solution above can be extended to as many arrays as are provided in the problem definition. The time to sort will still remain O(N) where N is the sum of total number of elements across all provided arrays.

Optimizations available

  • One can make use of sparsed-bit-arrays to reduce memory consumption. Refer to [brettwooldridge/SparseBitSet] (https://github.com/brettwooldridge/SparseBitSet) for one such implementation.

  • If the arrays are really, really huge - an implementation that uses file-based persistence of a bit-array can be used. Refer to one such implementation available in the jerry-core project.

Fastest sorting of integers

Problem

We are given a huge array of bounded, non-duplicate, positive integers that need to be sorted. What’s the fastest way to do it.

Solution

The always updated article is always available on Github:sangupta/ps repo

Most of the interview candidates that I have talked to about this problem come up with the quick answer as MergeSort or divide-and-conquer. The cost of sorting being O(N * Log(N)) - which in this case is not the fastest sorting time.

The fastest time to sort an integer array is O(N). Let me explain how.

  • Construct a boolean array of length N
  • For every integer n in array, mark the boolean at index n in array as true
  • Once the array iteration is complete, just iterate over the boolean array again to print all the indices that have the value set as true
public void sortBoundedIntegers(int[] array) {
  /// the regular checks first
  if(array == null) {
    return;
  }

  if(array.length == 0) {
    return;
  }

  // build the boolean result array
  boolean[] result = new boolean[array.length];
  for(int index = 0; index < array.length; index++) {
    int num = array[index];
    result[num] = true;
  }

  // print the result
  for(int index = 0; index < result.length; index++) {
    if(result[index]) {
      System.out.println(index);
    }
  }
}

Additional constraints and optimizations available

  • A boolean occupies one-byte of memory. Thus, switching to a bit-vector (also called as bit-array) will reduce the memory consumption by a factor of 8. Check code sample 2.

  • In case the integers are also negative, another bit-array can be used to check for negatives, and then both iterated one-after-another to produce result. Check code sample 2.

  • To further reduce the memory consumption, one can make use of sparsed-bit-arrays. This can lead to huge drop in memory consumption if the integers are spaced apart a lot. Check code sample 2. Refer to brettwooldridge/SparseBitSet for one such implementation.

  • In case the integer array contains duplicates, use a small short array than the boolean array to hold the number of times an integer has been seen, thus still sorting in O(N) time.

Code Sample 2

public void sortBoundedIntegers(int[] array) {
  // run regular checks

  final int len = array.length;

  // considering SparsedBitSet is an implementation available
  BitSet negative = new SparsedBitSet(len);
  BitSet positive = new SparsedBitSet(len);

  // sort
  for(int index = 0; index < len; index++) {
    int num = array[index];
    if(num < 0) {
      num = 0 - num; // convert to positive
      negative.setBit(num, true);
    } else {
      positive.setBit(num, true);
    }
  }

  // output
  int index = -1;
  do {
    index = negative.getNextSetBit(index);
    if(index < 0) {
      break;
    }

    System.out.println(0 - index);
  } while(true);

  index = -1;
  do {
    index = positive.getNextSetBit(index);
    if(index < 0) {
      break;
    }

    System.out.println(index);
  } while(true);
}

Additional Reading

Finding degrees of separation in a social graph

Problem

Assume two users on a Social network start with zero 0 connections. And add connections at a rate of 100 per minute. Explain how you would design a solution to find out the degrees of separation between two network profiles, given a known function getConnections(profile.id) that returns data from over the network. What differences would you make for realtime less consistent/optimal result vs a slower more accurate result?

Solution

The always updated article is always available on Github:sangupta/ps repo

I can think of many different ways to compute this. I will put them out below.

Approach 2 seems the best considering the storage cost and traversal costs. Approach 3 can be near real-time if we can take care of the storage, and add optimal number of workers for fan-out.

Assumptions

  • If a degree of separation cannot be found in a level of DEPTH_TO_BREAK_AT (an integer constant) we will break and say the users are not connected

Various Approaches

Approach 1: Crude approach - Brute Force

As said, we already have a function getConnections(profileID) that returns a list of all users that are directly connected to this user. Now, let’s say we want to compute the degrees of separation between two users u1 and u2. Assuming that the profile ID is a String so that we can accommodate UUIDs as well:

/**
 * Get degrees of separation between two users
 */
public int degressOfSeparation(User u1, User u2) {
	// skipping null checks and all

	String p1 = u1.getProfileID();
	String p2 = u2.getProfileID();

	return doLevelSearch(1, p1, p2);
}

/**
 * Return the level we have nested deep in.
 */
private int doLevelSearch(int level, String p1, String p2) {
	do {
		List<String> connections = getConnections(p1);
		if(connections == null || connections.isEmpty()) {
			return -1; // no connection for user
		}

		if(connections.contains(p2)) {
			// found the user
			return level;
		}

		if(level >= DEPTH_TO_BREAK_AT) {
			// this can be a breaking condition to break in case of
			// no connection between two users
			// we need to break at some place
			return -1;
		}

		for(String connection : connections) {
			return doLevelSearch(level + 1, connection.getProfileID(), p2);
		}
	} while(true);
}

The following advantages are in this approach:

  • Storage cost on disk, and in-memory are less

The following disadvantages are in this approach:

  • Too slow - traverse the entire graph at each time
  • When data is less, we can get into an entire graph traversal - thus need to code level >= 20 style condition to break at some point
  • Need too many queries O(n ^ d) in the worst case, where d is the degree’s of separation
  • The degrees of separation may not be accurate - this is because we are doing a depth-first traversal

NOTE: As this is depth-first traversal, this will result in hell-slow output. We MUST NOT use this approach.

Approach 2: Brute Force - Two sided

To improve upon the approach 1, we can start with traversal from both the sides of the tree, keeping in memory the list of all connections from both the sides. This way we can save a lot of queries that we fire to the network.

The appraoch is a simple breadth-first traversal, but starting from both sides.

/**
 * Get degrees of separation between two users
 */
public int degressOfSeparation(User u1, User u2) {
	// skipping null checks and all

	String p1 = u1.getProfileID();
	String p2 = u2.getProfileID();

	return doSearch(p1, p2);
}

/**
 * Do an iterative search for the degree of separation
 */
private int doSearch(String p1, String p2) {
	// now for the set
	List<Connection> c1 = new ArrayList<>();
	List<Connection> c2 = new ArrayList<>();

	// add users to their own owning lists
	c1.add(new Connection(p1, 0));
	c2.add(new Connection(p2, 0));

	do {
		// fetch first available user from list
		boolean hadNonScanned1 = fetchNextAvailable(c1);
		if(hadNonScanned1) {
			int hit = chechHit(c1, c2);
			if(hit > 0) {
				return hit;
			}
		}

		// check if there is atleast one connection that we can scan
		boolean hadNonScanned2 = fetchNextAvailable(c2);
		if(hadNonScanned2) {
			int hit = checkHit(c1, c2);
			if(hit > 0) {
				return hit;
			}
		}

		// check if there was at least one user in any of the two lists
		// that was scanned - if not, we have reached our crawling
		// limits
		if(!hadNonScanned1 && !hadNonScanned2) {
			// we have reached breaking limit
			return -1;
		}		
	} while(true);
}

/**
 * Function that scans an arraylist and finds out the first
 * connection that has not yet been scanned. Fetches the list
 * of connections that this profile has, adds them to the end
 * of the list and returns a `true` - this makes sure that we
 * continue looping.
 *
 * If we find no connection that is scannable, we return `false`
 */
private boolean fetchNextAvailable(List<Connection> list) {
	Iterator<Connection> iterator = list.iterator();
	while(iterator.hasNext()) {
		Connection c = iterator.next();

		if(c.isScanned()) {
			continue;
		}

		// we have a non-scanned version
		if(c.level >= DEPTH_TO_BREAK_AT) {
			continue;
		}

		// we have a candidate within list
		// mark this connection scanned
		c.markScanned();

		// go ahead and scan this connection
		List<String> ids = getConnections(c.profileID);
		if(ids != null && !ids.isEmpty()) {
			for(String id : ids) {
				list.add(new Connection(id, c.level + 1));
			}
		}

		return true;
	}

	return false;
}

private int checkHit(List<Connection> c1, List<Connection> c2) {
	for(Connection x : c1) {
		for(Connection y : c2) {
			if(c1.equals(c2)) {
				return c1.level + c2.level;
			}
		}
	}

	// not found
	return -1;
}

/**
 * A simple structure to store the level and the ID of the person
 * Also, stores a boolean to say if we have scanned the list or not.
 */
public static class Connection {

	// the level at which we found the user
	int level;

	// the profileID of the user
	String profileID;

	// keeps track if we have scanned this profile for its depth
	boolean scanned;

	Connection(String profileID, int level) {
		// do basic sanity of values
		this.profileID = profileID;
		this.level = level;
		this.scanned = false;
	}

	public int hashCode() {
		return this.profileID.hashCode();
	}

	public boolean equals(Object obj) {
		if(obj == null) {
			return false;
		}

		if(this == obj) {
			return true;
		}

		if(!(obj instanceof Connection)) {
			return false;
		}

		Connection other = (Connection) obj;

		return this.profileID.equals(other.profileID);
	}

	public void markScanned() {
		this.scanned = true;
	}

	public boolean isScanned() {
		return this.scanned;
	}
}

Advantages of this method over Approach 1

  • Faster as we are running breadth-first traversal from both sides, which is more likely to give us a match than running

  • We break as soon as we have a match between any neighbour - thus the chain can be discovered early.

Optimizations

  • Needs a little extra memory on the machine calculating the degrees of separation
  • this can be reduced by the use of sparse-bit-arrays if the profileIDs can be converted to integer values of a sequence.

  • Use a bloom filter in method checkHit to be able to faster check if we have a connection (or using bit-arrays)

  • As we know more about users, than just their profile name. Thus when scanning a connections list, for next level, we can given preference to connections that are either from same city, same school/college, or same organization - they have a higher chance of match than user’s who are not from same city, school and org.

Approach 3 - Fan out on Write

Another approach is where we can fanout on write - thus taking a near-real-time approach. We will maintain a list of connections in depth of degree of a user in multiple lists (distributed columnar storage like Cassandra, or a Redis style stores). Whenever a connection is added, we go ahead and fanout adding him to the list of all his connections at level 1, then for all users connected at level 2, and so on… to a certain level.

Thus, say A has two connections, X and Y. Y in turn has one connection called Z - now A adds a connection M. We add M to the list of A as level 1, to the list of X and Y as level 2, and Z as level 3.

To be non-blocking and not hit performance, this is done in background using a message broker for fan out.

Initially when the network is building, the connection addition rate will be high and we may need more workers to flush out the queue. But with time the rate of addition of new connections will dwindle, as people would have stabilized with all their friend in the network. At this point the fan-out queue will be smaller.

This has advantage over Approach 1 and Approach 2 that finding a hit, just involves O(1) operation. If we don’t find a match we can safely return. If the number of fan-out workers make sure that the fan-out queue never gets crowded, we can hit near realtime performance.

Columnar storage like Cassandra or HBase will help as a single row needs to be scanned. Using Redis may make it super-fast but the amount of memory we would need would be too huge.

Optimizations

  • Use of integerIDs as profile identifiers can help in improving the speed further reducing the disk space cost. We can use a sparse bit-array where the bit corresponding to the userID of connection is turned on. But storage cost when degree of separation exceeds, say 4 or 5, will increase as the connections tend to reach in millions then (from practical examples from LinkedIn and Facebook).