Deadlines
In this lab assignment, you will write a set of operators for SimpleDB to implement table modifications (e.g., insert and delete records), selections, joins, and aggregates. These will build on top of the foundation that you wrote in Lab 1 to provide you with a database system that can perform simple queries over multiple tables.
Additionally, we ignored the issue of buffer pool management in Lab 1: we have not dealt with the problem that arises when we reference more pages than we can fit in memory over the lifetime of the database. Now in Lab 2, you will design an eviction policy to flush stale pages from the buffer pool.
You do not need to implement transactions or locking in this lab.
The remainder of this document gives some suggestions about how to start coding, describes a set of exercises to help you work through the lab, and discusses how to hand in your code. This lab requires you to write a fair amount of code, so we encourage you to start early!
Quick jump to exercises:
Jump to Submission instructions.
You will need to add these new files to your release. The easiest way to do this is to untar the new code in the same directory as your top-level simpledb directory, as follows:
$ cp -r cs133-lab1 cs133-lab2
$ wget http://www.cs.hmc.edu/~beth/courses/cs133/lab/cs133-lab2-supplement.tar.gz
tar -xvzkf cs133-lab2-supplement.tar.gz
Now all files from lab1 and lab2 will be in the cs133-lab2 directory.
To work in Eclipse, create a new java project named cs133-lab2 like you did for lab1. You may need one more step to compile:
Right-click the project name (likely cs133-lab2) in the Package Explorer and select Properties. Choose Java Build Path on the left-hand-side, and click on the Libraries tab. Push the Add JARs... button, select zql.jar and jline-0.9.94.jar, and push OK, followed by apply and close. Your code should now compile.If you used the Ant window in Eclipse for Lab 1: You'll want to remove that build file from the window and add the one for Lab 2.
As before, we strongly encourage you to read through this entire document to get a feel for the high-level design of SimpleDB before you write code.
We will grade your assignment by looking at your code and verifying that
you have passed the test for the ant targets test
and
systemtest
. See Section 3.3 for a complete discussion of
grading and list of the tests you will need to pass.
Here's an overview of your SimpleDB implementation for this lab; more details on the steps in this outline, including exercises, are given in Section 2.
Filter
and Join
and
verify that their corresponding tests work. The Javadoc comments for
these operators contain details about how they should work. We have given you implementations of
Project
and OrderBy
which may help you
understand how other operators work.
IntegerAggregator
and StringAggregator
. Here, you will write the
logic that actually computes an aggregate over a particular field across
multiple groups in a sequence of input tuples. Use integer division for
computing the average, since SimpleDB only supports integers. StringAggegator
only needs to support the COUNT aggregate, since the other operations do not
make sense for strings.
Aggregate
operator. As with other
operators, aggregates implement the DbIterator
interface
so that they can be placed in SimpleDB query plans. Note that the
output of an Aggregate
operator is an aggregate value of an
entire group for each call to next()
, and that the
aggregate constructor takes the aggregation and grouping fields.
BufferPool
. You do not need to worry about
transactions at this point.
Insert
and Delete
operators.
Like all operators, Insert
and Delete
implement
DbIterator
, accepting a stream of tuples to insert or delete
and outputting a single tuple with an integer field that indicates the
number of tuples inserted or deleted. These operators will need to call
the appropriate methods in BufferPool
that actually modify the
pages on disk. Check that the tests for inserting and
deleting tuples work properly.
Note that SimpleDB does not implement any kind of consistency or integrity checking, so it is possible to insert duplicate records into a file and there is no way to enforce primary or foreign key constraints.
By the end of this lab, you'll also be able to use the provided SQL parser to run SQL queries against your database! See Section 2.7.
Finally, you might notice that the iterators in this lab extend the abstract class
Operator
instead of implementing the DbIterator
interface. Because the implementation of next/hasNext
is often repetitive, annoying, and error-prone, Operator
implements this logic generically, and only requires that you implement
a simpler fetchNext method. Feel free to use this style of
implementation, or just implement the DbIterator
interface if you prefer, as you did with SeqScan
in Lab 1.
To implement the DbIterator interface, remove extends Operator
from iterator classes, and in its place put implements DbIterator
.
If you plan to use Operator.java, be sure to browse the code and comments to see
what is going on and which methods need to be implemented for the classes that extend it.
You can look at Project.java to see examples of how you should call super in your operators.
Predicate
that is specified as part of its constructor. Hence,
it filters out any tuples that do not match the predicate.
JoinPredicate
that is passed in as part of its constructor.
We only require a simple nested loops join.
Some helpful notes:
GROUP BY
clause. Now you will implement the five SQL aggregates
(COUNT
, SUM
, AVG
, MIN
,
MAX
) and support grouping. You only need to support aggregates
over a single field, and grouping by a single field. You can ignore the other aggregate functions that you see in the Aggregator
class.
In order to calculate aggregates, the Aggregate operator uses a helper interface called
Aggregator
which does the work of merging the next tuple into the existing
calculation of an aggregate. Depending on the type of the field being aggregated
(Type.INT_TYPE
or Type.STRING_TYPE
), you should
create an IntegerAggregator
or a StringAggregator
.
The Aggregator
is told during construction which
operation it should use for aggregation (see Aggregator.Op).
Subsequently, the Aggregate operator
should call Aggregator.mergeTupleIntoGroup()
for every tuple in its child
iterator. After all tuples have been merged, the Aggregate operator can retrieve a
DbIterator of aggregation results from the Aggregator. Each tuple in the result should be pair of
the form (groupValue, aggregateValue)
, unless the value
of the group by field was Aggregator.NO_GROUPING
, in which
case the result is a single tuple of the form (aggregateValue)
Note that this implementation requires space linear in the number of
distinct groups. For the purposes of this lab, you do not need to worry
about the situation where the number of groups exceeds available memory.
However, you should not assume all tuples can fit in memory.
You should not store lists of tuples in Aggregators, but instead store
"running" aggregates. For example, for SUM
you could update the
"sum so far" as tuples are merged in.
Note: for AVG
, you should keep a running sum and running count, and
perform the division just once at the end of merging all tuples.
Some helpful notes:
TupleIterator
useful for returning a
DbIterator from the Aggregator. Now, we will begin to implement methods to support modifying tables. If you haven't yet already read through Section 2.4, you may find it helpful to do so now. You will see that you will eventually write Insert and Delete operators. These operators call the insert and delete methods from BufferPool, respectively, which in turn calls the appropriate methods on a HeapFile. Then the HeapFile calls insert or delete on the correct HeapPage!
We begin implementing Insert and Delete at the level of individual pages and files. There are two main sets of operations: adding tuples and removing tuples.
Removing tuples: To remove a tuple, you will need to implement
deleteTuple
.
Tuples contain RecordIDs
which allow you to find
the page they reside on, so this should be as simple as locating the page
a tuple belongs to and modifying the header of the page appropriately.
Note that this simplicity assumes that the header is consulted when
retrieving tuples or deciding where to insert a new tuple.
Adding tuples: The insertTuple
method in
HeapFile.java
is responsible for adding a tuple to a heap
file. To add a new tuple to a HeapFile, you will have to find a page with
an empty slot. If no such pages exist in the HeapFile, you
need to create a new page and append it to the physical file on disk
(use the static method HeapPage.createEmptyPageData()
), but after
writing to it disk be sure to get the page via the Buffer Pool before inserting the tuple.
You will also need to ensure that the RecordId in the tuple is updated correctly.
Note that it is important that the HeapFile.insertTuple() and HeapFile.deleteTuple() methods access pages using the BufferPool.getPage() method; otherwise, your implementation of transactions and locking in a future lab will not work properly.
Next you'll finish the insertion/deletion in the BufferPool. These methods should call the appropriate methods in the HeapFile that belong to the table being modified (this extra level of indirection is needed to support other types of files that could exist, such as indexes. Note that the HeapFile insert/delete methods should return which pages were dirtied so that the BufferPool can call markDirty on those pages.
Implement the following skeleton methods in src/simpledb/BufferPool.java:At this point, your code should pass the unit tests in HeapPageWriteTest, HeapFileWriteTest, and BufferPoolWriteTest.
Some helpful notes:
Insert
and Delete
operators.
For plans that implement insert
and delete
queries,
the top-most operator is a special Insert
or Delete
operator that modifies the pages on disk. For their next() calls, these
operators return the number of affected tuples. This is implemented by returning
a single tuple with one integer field, containing the count.
tableid
specified in its constructor. It should
use the BufferPool.insertTuple()
method to do this.
tableid
specified in its constructor. It
should use the BufferPool.deleteTuple()
method to do this.
Delete
. Furthermore, you
should be able to pass the InsertTest and DeleteTest system tests.
Some helpful notes:
next()
numPages
. Now, you will choose a page eviction
policy and instrument any previous code that reads or creates pages to
implement your policy.
When more than numPages pages are in the buffer pool, one page should be evicted from the pool before the next is loaded. The choice of eviction policy is up to you; it is not necessary to do something more sophisticated than discussed in class or even a random policy. Describe your policy in the lab writeup.
Notice that BufferPool
asks you to implement
a flushAllPages()
method. This is not something you would ever
need in a real implementation of a buffer pool. However, we need this method
for testing purposes. You should never call this method from any non-testing code.
Because of the way we have implemented ScanTest.cacheTest
, you will
need to ensure that your flushPage
and flushAllPages
methods
do not evict pages from the buffer pool to properly pass
this test.
flushAllPages
should call flushPage
on all pages in the BufferPool,
and flushPage
should write any dirty page to disk and mark it as not
dirty, while leaving it in the BufferPool.
The only method which should remove a page from the buffer pool is
evictPage
, which should call flushPage
on any dirty page it evicts.
flushPage()
, flushAllPages()
, evictPage()
,
as well as any updates to getPage()
to implement page eviction in:
If you did not implement writePage()
in
HeapFile.java above, you will also need to do that now.
At this point, your code should pass the EvictionTest system test.
Since we will not be checking for any particular eviction policy, this test works by creating a BufferPool with 16 pages (NOTE: while DEFAULT_PAGES is 50, we are initializing the BufferPool with less!), scanning a file with many more than 16 pages, and seeing if the memory usage of the JVM increases by more than 5 MB. If you do not implement an eviction policy correctly, you will not evict enough pages, and will go over the size limitation, thus failing the test.
Some helpful notes:
You have now completed the code for this lab. Good work!
The following code implements a simple join query between two tables, each
consisting of three columns of integers. (The file
some_data_file1.dat
and some_data_file2.dat
are
binary representation of the pages from this file). This code is equivalent
to the SQL statement:
SELECT * FROM some_data_file1, some_data_file2 WHERE some_data_file1.field1 = some_data_file2.field1 AND some_data_file1.id > 1For more extensive examples of query operations, you may find it helpful to browse the unit tests for joins, filters, and aggregates.
package simpledb; import java.io.*; public class jointest { public static void main(String[] argv) { // construct a 3-column table schema Type types[] = new Type[]{ Type.INT_TYPE, Type.INT_TYPE, Type.INT_TYPE }; String names[] = new String[]{ "field0", "field1", "field2" }; TupleDesc td = new TupleDesc(types, names); // create the tables, associate them with the data files // and tell the catalog about the schema the tables. HeapFile table1 = new HeapFile(new File("some_data_file1.dat"), td); Database.getCatalog().addTable(table1, "t1"); HeapFile table2 = new HeapFile(new File("some_data_file2.dat"), td); Database.getCatalog().addTable(table2, "t2"); // construct the query: we use two SeqScans, which spoonfeed // tuples via iterators into join TransactionId tid = new TransactionId(); SeqScan ss1 = new SeqScan(tid, table1.getId(), "t1"); SeqScan ss2 = new SeqScan(tid, table2.getId(), "t2"); // create a filter for the where condition Filter sf1 = new Filter( new Predicate(0, Predicate.Op.GREATER_THAN, new IntField(1)), ss1); JoinPredicate p = new JoinPredicate(1, Predicate.Op.EQUALS, 1); Join j = new Join(p, sf1, ss2); // and run it try { j.open(); while (j.hasNext()) { Tuple tup = j.next(); System.out.println(tup); } j.close(); Database.getBufferPool().transactionComplete(tid); } catch (Exception e) { e.printStackTrace(); } } }
Both tables have three integer fields. To express this, we create
a TupleDesc
object and pass it an array of Type
objects indicating field types and String
objects
indicating field names. Once we have created this TupleDesc
, we initialize
two HeapFile
objects representing the tables. Once we have
created the tables, we add them to the Catalog. (If this were a database
server that was already running, we would have this catalog information
loaded; we need to load this only for the purposes of this test).
Once we have finished initializing the database system, we create a query
plan. Our plan consists of two SeqScan
operators that scan
the tuples from each file on disk, connected to a Filter
operator on the first HeapFile, connected to a Join
operator
that joins the tuples in the tables according to the
JoinPredicate
. In general, these operators are instantiated
with references to the appropriate table (in the case of SeqScan) or child
operator (in the case of e.g., Join). The test program then repeatedly
calls next
on the Join
operator, which in turn
pulls tuples from its children. As tuples are output from the
Join
, they are printed out on the command line.
The first step is to create some data tables and a catalog. Suppose
you have a file data.txt
with the following contents:
1,10 2,20 3,30 4,40 5,50 5,50You can convert this into a SimpleDB table using the
convert
command (make sure to type ant first!):
java -jar dist/simpledb.jar convert data.txt 2 "int,int"This creates a file
data.dat
. In addition to the table's
raw data, the two additional parameters specify that each record has
two fields and that their types are int
and
int
.
Next, create a catalog file, catalog.txt
,
with the follow contents:
data (f1 int, f2 int)This tells SimpleDB that there is one table,
data
(stored in
data.dat
) with two integer fields named f1
and f2
.
Finally, invoke the parser.
You must run java from the
command line (ant doesn't work properly with interactive targets.)
From the simpledb/
directory, type:
java -jar dist/simpledb.jar parser catalog.txtYou should see output like:
Added table : data with schema INT(f1), INT(f2), SimpleDB>Finally, you can run a query:
SimpleDB> select d.f1, d.f2 from data d; Started a new transaction tid = 1221852405823 ADDING TABLE d(data) TO tableMap TABLE HAS tupleDesc INT(d.f1), INT(d.f2), 1 10 2 20 3 30 4 40 5 50 5 50 6 rows. ---------------- 0.16 seconds SimpleDB>The parser is relatively full featured (including support for SELECTs, INSERTs, DELETEs, and transactions), but does have some problems and does not necessarily report completely informative error messages. Here are some limitations to bear in mind:
You will submit a tarball of your code on Gradescope for intermediate deadlines and for your final version. You only need to include your writeup for the final version.
You can generate the tarball by using the ant handin target. This will create a file called cs133-lab.tar.gz that you can submit. You can rename the tarball file if you want, but the filename must end in tar.gz.
The autograder won't be able to handle it if you package your code any other way!Click Lab 2 on your Gradescope dashboard. For deadlines besides the final version,
you only need to upload or resubmit cs133-lab.tar.gz.
For the final version: click Lab 2 and then click the "Resubmit" button on the bottom of the page ; upload both cs133-lab.tar.gz and writeup.txt containing your writeup.
If you worked with a partner, be sure to enter them as a group member on Gradescope after uploading your files.
Your grade for the lab will be based on the final version after all exercises are complete.
75% of your grade will be based on whether or not your code passes the test suite. Before handing in your code, you should make sure it produces no errors (passes all of the tests) from both ant test and ant systemtest.
Important: before testing, we will replace your build.xml and the entire contents of the test directory with our version of these files. This means you cannot change the format of .dat files! You should also be careful changing our APIs. You should test that your code compiles the unmodified tests. In other words, we will untar your tarball, replace the files mentioned above, compile it, and then grade it. It will look roughly like this:
$ tar xvzf cs133-lab.tar.gz [replace build.xml and test] $ ant test $ ant systemtest
If any of these commands fail, we'll be unhappy, and, therefore, so will your grade.
An additional 25% of your grade will be based on the quality of your writeup, our subjective evaluation of your code, and on-time submission for the intermediate deadlines.
ENJOY!!