DTU Course 02819: Data mining using Python
Previous called DTU Course 02820: Python Programming
Discontinued
This course has unfortunately been discontinued and will not be
running in the autumn 2015.
You might want to consider DTU 02807 Computational Tools for Big Data.
Other related courses
are 02450
Introduction to Machine Learning and Data Mining
and DTU 02805: Social graphs and
interactions.
Practical information
Tentative schedule for autumn 2014
(to be updated)
- Introduction to the course, installation and introduction to Python
- Introduction to Python
- Numerical Python: "Python as Matlab".
Numpy, Scipy, matplotlib, ...
- Databasing, text and Web mining and Python
- Web serving
- Project work for the rest of the time.
- Exam: Sometime in December 2014 Building 321, floor 100 from
8.00-16:00(?) (Studiehåndbogen)
- Report hand-in deadline: Sometime in December 2012.
The following topics may be taken up:
- Python development
- Data analysis and the Semantic Web
- Misc.
- Pythonic Python
Course material
There are no fixed course material for the project, but we
recommend a number of books and other material:
- The Python Tutorial.
- Kevin Sheppard (2014)
Introduction
to Python for Econometrics, Statistics and Data Analysis.
Good in-depth introduction to Python with installation,
datatypes, matrices, math, mathematical functions, pandas,
control structures, date time, graphics, performance, time
series.
- Mark Pilgrim
(2004) Dive
into Python. A good and free introduction to
Python. Basic Python, Unit testing, XML, regular
expressions, etc.
- Allen Downey (2002-2008)
Think
Python: How to Think Like a Computer
Scientist. (previously (apparently!?) GFDL and CC-BY-SA
license, now CC-BY-NC). Covers the basics of the Python
language and Tkinter GUI.
Even better: An interactive version with multiple choice tests and online
programming is available.
- Hans Petter Langtangen (2008)
Python
Scripting for Computational Science.
Available as a e-book through the DTU library.
Covers basic Python, numerical (NumPy and SciPy),
combining Python and C, CGI and GUI programming.
- Vidar Bronken Gundersen (2006) MATLAB commands in
numerical Python (NumPy). Indispensable if you are trained
in the Matlab or R languages as this short note provides a
concise Rosetta stone between Python, Matlab and R. Probably
good even if you do not have knowledge of Matlab and R.
- Online
Python Tutor is an interactive code visualizer.
- Internet search engine,
e.g., "Python
tutorial",
"Python
introduction" and
"numerical python filetype:pdf"
- Finn Årup Nielsen (2014)
Data Mining with Python (working draft).
Extra:
- Toby Segaran (2007), Programming Collective Intelligence.
A commercial book with Python examples in machine learning for
Web 2.0 applications, e.g., naïve Bayes classifier and
non-negative matrix factorization.
Polyteknisk Boghandel: 279 kroner (2012)
- Steven Bird et
al. (2009), Natural
Language Processing with Python. Describes the NLTK Python
package and written by the developers of this nice language
processing toolkit. Also available under a Creative
Commons licence from https://sites.google.com/site/naturallanguagetoolkit/book. Polyteknisk
Boghandel.
- Mitchell
L. Model, Bioinformatics
Programming Using Python: Practical Programming for Biological Data.
Probably recommendable for people interested in
bioinformatics. Covers basic Python presented with problems from
bioinformatics. Also web and database programming.
Polyteknisk
Boghandel: 469 kroner (2012).
- Toby Segaran, Colin Evans og Jamie Taylor (2009),
Programming the Semantic Web.
Specialized book about the Semantic Web and how Python can be
used in that context.
Polyteknisk
Boghandel: 275 kroner (2012).
- Alex Martelli (2005), Python Cookbook. Shows good code
examples. Not so good as an introductory book.
Polyteknisk Boghandel
- Ivan Idris
(2011), NumPy
1.5 Beginner's Guide. As the name implies: A beginner's
guides to NumPy.
Evaluation
You are evaluated on the written report, the printed poster and
the oral poster presentation.
The grade will be based on an overall combined assessment.
So do not expect to get a grade or "points" after the poster presentation.
You will first get the grade after we have evaluated the reports
together with the poster presentation.
The report should contain:
- Use a template: this latex template
(IEEE style with 10 point font size)
or similar in style. Please do not squizze the font size or
layout beyond the style file.
- Information on first page: name, study number, title.
- Your programming code in an appendix.
You may exclude code that contains long wordlists, etc.
- You are allowed to use code that you found on the Internet
(provider that the author distributes it under a suitable
license such as BSD or GPL), but remember to give full and
clear attribution to the author. Any text citations should
also be fully referenced and in quotation
marks. Paraphrased
citations should also be fully referenced.
There are length requirements for the report:
- One-person report: 2 pages
- Two-persons report: 3 pages
- Three-persons report: 4 pages
These values are maximum limits.
This limit does not apply for the appendix, which can be
any page length. The appendix may contain code and automated
generated content, e.g.,
from Epydoc, pylint
results or other.
The text and the code should be included as text, rather than
as images, so the plagiarization detection system can read it.
The report could contain, e.g.:
- Discussion of the problem that is going to get solved, the
data available and its features.
- Discussion of the design of the program
- Description of the implementation.
- Table or graphical overviews of modules and/or classes
- Database schema description
- Description of central parts of the code
- Screenshots of the program
- Plots of results
- Plots of code performance
- Details of the development process, e.g., editor, IDE, revision
control system, operating system, cloud service, ...
- Coverage
- ...
When looking into the code in the report apart from the actual
functionality we may possibly examine the below items:
- How well-structured it is (modules/classes/functions)?
- Is it effective? Does it use the Python language
constructions effectively?
.
- Is it secure? In a web application you should sanitize
input and escape during output. If your program receives
strange input does it crash?
- Is the code documented? A structured way would use the
__doc__ variable and, e.g., pydoc. See the "documentation" part
of
the introduction slides
- Is the code tested? Are errors and exceptions handled
well. You have likely run the program and seen that it works,
but a structured approach would also utilize some of the testing
functionality in Python. See "Testing" part of the
introduction
slides and the testing slides.
- Is the coding "nice looking" and consistent? For inspiration
see Style
Guide for Python Code.
The poster shows the design of the program and "results" e.g.,
screen shots or plots and it must be self-contained.
On the top, the poster must contain title, names of students,
course title and date.
Examples on posters can be seen in building 321, 1. floor
(Note that some of the posters are not from the Data Mining
using Python course but from a machine learning course).
The external examiner will attend the presentation of the posters
and will receive the final reports.
We are able to print posters. They are printed in A1.
If you opt for this possibility, please email a PDF file
in sufficient resolution to
Finn
Kuno Christensen (or
Erik
Lund Poulsen) - with the email:
print (a) compute. dtu. dk. In good time
before the exam.
The printer room is in building 322.
You need to pick up the poster
yourself from building 322 room 030 where they are put on a
table, and bring to building 321 and find a place to hang it
up.
If you are scheduled late do not hang your poster up early,
because there is only limited space.
Kuno and Erik tell me that unless you hear from them the posters
will be available minimum 3 hours after you sent dem, — in
the opening hours 9:00-15:00. To be sure that your poster is
printed in time it should be sent before 9:00 the day before
your exam.
If there is any special issues about printing Erik is in
building 322 room 033 while Kuno is in building 324 room 150.
The DTU Library may
also offer
printing service. I am unaware of how much the cost is.
As an alternative you can print the poster on multiple A4 sheets on glue it
on a post board or hang them next to each other on the pin board we have.
There are no rules about the size of the poster, but A3 will
probably be too small. A1 is what DTU Compute will print.
A hint:
A standard issue with presenting material on a poster is
that one tends to put too much text on the poster. Better it is
to have good drawings, diagrams or plots.
Poster presentation
For the poster presentation the two- and three-men teams should each give a five
minute presentation, while a one-man "teams"
should prepare a presentation between 5 and 10 minutes. After the
presentation we (teachers and censor) will ask questions.
In a group each participant should present a part of the
project/poster, i.e., each participant should not present
the entire project/poster.
Note that each participant's presentation and answers to questions
need to be individual. This is not a group exam and we may
direct individual questions to one of you in the group,
- and the others should not answer.
A hint:
It will be a good idea to rehearse exactly what you are going to
say at the poster presentation.
You can demonstrate the program on the computer. It might give
us a nice impression.
The risk is that you cannot make the computer work or that it
will take too much time to setup and draw attention away.
If you choose to also demonstrate the program on a computer, be
sure to rehearse that along with the oral presentation.
You should expect that other students follow around and listen
while you a given a presentation.
There is also ordinary office noise in the hallway where we have
the exam.
You are welcome to follow around and listen to the other
students while the they present.